2019-05-09

Terraform is Terrible: Part 4

If I'd just stopped here I'd probably have been fine. I mean it. Terrform is a janky, buggy, badly-documented, and brittle piece of software, but I'd finally fought it long enough to get a working VM out of it. I was going to be able to sleep that night. And in retrospect, maybe I should have quit on a high note. I'd gotten enough material out of my suffering for a blog post or three and, hopefully, other people wouldn't step on the same landmines I had as I was trying to get Terraform working.

Critics might say, "Wait a sec, all your problems are with Azure/modules/Windows resources. Terraform works great on AWS!" and that may be so. I don't know. I use Azure for my cloud compute, and nowhere on the Terraform website does it claim that this isn't an option, or that trying to use Azure with Terraform is an open invitation to waste your evening wrestling with errors from within nested objects you don't see and can't control, all the while lamenting all your life's choices that brought you to test out this piece of trash program.

If Azure isn't a functional, supported platform for their tool, they shouldn't pretend like it is. Remove it, call it an alpha test, or at the very least put a big warning at the top and bottom of every page in big red letters: "Azure is NOT a 1st-Party Provider for Terraform!!! Use at your own risk!!!"

At least then I'd have been duly warned. Until that happens, if you want to run an Azure service with Terraform, you are but lambs for the slaughter. I really can't overstate this. Terraform is broken. Its tools do not work as described.

Back to my woes. I'd fought with Terraform. I'd fought with Terraform modules. I'd fought with bad Terraform deployment defaults. I finally had a working Terraform config. terraform apply and terraform destroy worked as expected now.

And as I said, if I had stopped there, I'd probably have been fine. But I didn't, because a single VM sitting in its own subnet is not what I wanted, because that's not what most cloud services look like. It had taken me 93 minutes to get that initial VM deployed. I felt bolstered to try to build a very simple, but very usable, second deployment from there. There was just one little catch.

I already had a cloud service.

Most of us do. You're not typically building a brand new service entirely from scratch every time you want to deploy something to the cloud. You have existing machines, old networking configs, and all sorts of legacy devices laying around doing important odds and ends and that's how your business runs. If you were using hand-tuned scripts to manage everything yesterday, I am sure as hell not going to tell you to start migrating your resources over to Terraform tomorrow.

Because you can't.

Terraform is so fragile that it is a Herculean labor to lean on it to move your cloud infrastructure over to Terraform configs at a pace that fits you and your service demands. In theory, Terraform supports "data" sources. Like a resource, data sources are Terraform objects that wrap around existing cloud components so you can leverage them in your config. In theory, if you already have a VM running in Azure, you can tell that to Terraform and then create something new using Terraform's knowledge about that existing VM. In theory, it might look like this:

data "azurerm_virtual_machine" "existingwebserver" {
  name                = "oldwebservervm1"
  resource_group_name = "webserver-rg"
}

resource "azurerm_virtual_machine" "newwebserver" {
  name                  = "newwebservervm1"
  location              = "${data.azurerm_virtual_machine.existingwebserver.location}"
  resource_group_name   = "${data.azurerm_virtual_machine.existingwebserver.resource_group_name}"
  ...[more config here]...
}

The idea here is that we don't explicitly define a location or a resource group for our new web server VM, but we define a data source object with the name and resource group name for something we know already exists. We can reference what we found when we looked for it, then set the location and the resource group name to match the values of the "existingwebserver" Terraform object. Sounds great, right?

You've probably already noticed that we really did explicitly define a resource group name, in the data block, because we had to define it. Azure data sources are much, much simpler and less versatile than AWS data sources and they lack an important feature called filtering. With filtering, you wouldn't necessarily need to define a name and a resource group, you could query your cloud service and pick the right VM based on, say, just its name. Then whatever its resource group happens to be wouldn't need to be hard-coded in the configuration file. As far as Azure networking data sources are concerned, you have two required fields: name and resource group name and that's it. If you don't already know what you're looking for, Terraform sure ain't gonna help you find it.

This is a big deal.

It matters because I have a couple of different virtual networks living in Azure and their names are defined based on their geographical location. So I might have "West-US-2-VNET-133" and "Central-US-VNET-2496" and if I put a VM in one of those locations, I want it to automagically look up the name of the right vnet to use based on one very simple criteria: find the name of the vnet in the same geographical location in which the new VM will be and the resource group name will be "InternalNet".

That's it. That's all I want.

Because I cannot query Azure with a data source, I need to give both a vnet name and a vnet resource group and, frankly, I only know one of those off the top of my head. The other is location-dependent and this platform-agnostic tool doesn't have a built-in mechanism for even grabbing a list through which I could iterate or pattern match or... or anything.

What a fool I have been.

I read. I researched. I found external data sources. External data sources look useful and the documentation around them is wrapped in multiple warnings:

Warning: This mechanism is provided as an "escape hatch" for exceptional situations where a first-class Terraform provider is not more appropriate. Its capabilities are limited in comparison to a true data source[.]

That's right. Querying your existing infrastructure ("There's something out there in my cloud and I know a name or a resource group, but not both, can you please find it?") is what Terraform considers an "exceptional situation". An external data source is really just an open-ended exec() mechanism that allows you to plug in a script or another piece of code that can go and actually talk to your cloud and get answers that Terraform doesn't know how to ask.

In other words I spent just a hair under 4 hours getting kicked in the teeth by Terraform, trying to replace my custom scripts with something better, only to discover that the best way to interact with my Azure resources in Terraform... is with a custom script.

My current scripts already do this, without difficulty, and without Terraform:

Get-AzureRmVirtualNetwork -ResourceGroupName "InternalNet" | ? { $_.Location -eq $vm.Location }

That was it. That was the last straw. I deleted the resource group (through the Azure portal, if you were wondering). I shut my machine down. I stood up and I took a walk. I needed to clear my head. I needed to think about what I'd just experienced. I needed a drink.

I'm not looking forward to learning libcloud on Python. But what choice do I have?

Next time: Where do we go from here? Bugger this.

No comments: