2019-05-08

Terraform is Terrible: Part 3

After I took a look around the Github issues for the (I would soon learn to be) utterly broken AzureRM Terraform modules, I found a couple of proposed solutions to my compute problems. One suggested using the latest version of the module by fetching it straight from Github:

module "compute" {
-   source            = "Azure/compute/azurerm"
-   version           = "1.2.0"
+   source            = "github.com/Azure/terraform-azurerm-compute.git"

This is a neat trick for distributing Terraform modules, so that's what I did. Which meant I needed to install Git on my machine and then run Terraform from inside a Git bash window instead of a command prompt, but whatever! I tried it.

It still didn't work.

Another comment in a different open Github issue said to add is_windows_image = true so I did that, too, and this finally convinced Terraform that my vm_os_simple = "WindowsServer" meant I wanted a Windows server. I ran "terraform init" and "terraform plan". Again. And again, my deployment failed.

The error this time was about an invalid storage account type for my region.

But I hadn't defined a storage account type. The module's supposed to do that for me.

The online examples of using the AzureRM compute module didn't mention storage account types. At all. I'd tried to create a VM in a region and the module defined the storage account type for the VM's OS disk as type "Premium_LRS", but the region I'd chosen didn't offer "Premium_LRS". And I didn't see a way to change it. So like a battered spouse who's convinced it's something they did wrong, I went back to reading the now highly-dubious Terraform docs and found nothing about overriding module defaults and I couldn't find an example of how to get a full printout of the secret nested attributes inside a compute module object so I could define my own values. I still don't know what to put here:

module "compute" {
  source            = "github.com/Azure/terraform-azurerm-compute.git"
  location          = "${var.location}"
  admin_username    = "plankton"
  admin_password    = "Password1234!"
  vm_os_simple      = "WindowsServer"
  ?????????????     = "Standard_LRS"
}

So I yanked out the modules. All of them. There was an AzureRM "compute" module and a "network" module and in my opinion they both need to be taken out back and shot. I rewrote my dinky little main.tf config file with full resources this time:

provider "azurerm" {
  version         = "=1.27.0"
  subscription_id = "bba8a111-a014-4dbf-aa90-9692362fd971"
}

resource "azurerm_resource_group" "mygroup" {
  name     = "${var.resource_prefix}-test"
  location = "${var.location}"
}

resource "azurerm_availability_set" "myavailset" {
  name                = "${var.resource_prefix}-availset"
  location            = "${var.location}"
  resource_group_name = "${azurerm_resource_group.mygroup.name}"
  managed             = true # poorly-documented but required argument
}

resource "azurerm_virtual_network" "myvnet" {
  name                = "${var.resource_prefix}-vnet"
  location            = "${azurerm_resource_group.mygroup.location}"
  resource_group_name = "${azurerm_resource_group.mygroup.name}"
  address_space       = ["10.0.0.0/16"]
}

resource "azurerm_subnet" "mysubnet" {
  name                 = "Subnet-1"
  resource_group_name  = "${azurerm_resource_group.mygroup.name}"
  virtual_network_name = "${azurerm_virtual_network.myvnet.name}"
  address_prefix       = "10.0.1.0/24"
}

resource "azurerm_network_interface" "myvnetif" {
  name                = "${var.resource_prefix}-nic"
  location            = "${azurerm_resource_group.mygroup.location}"
  resource_group_name = "${azurerm_resource_group.mygroup.name}"

  ip_configuration {
    name                          = "ipconfig-1"
    subnet_id                     = "${azurerm_subnet.mysubnet.id}"
    private_ip_address_allocation = "Dynamic"
  }
}

resource "azurerm_virtual_machine" "myvm" {
  name                  = "${var.resource_prefix}-vm1"
  location              = "${azurerm_resource_group.mygroup.location}"
  resource_group_name   = "${azurerm_resource_group.mygroup.name}"
  availability_set_id   = "${azurerm_availability_set.myavailset.id}"

  vm_size               = "Standard_D2_v2"

  network_interface_ids = ["${azurerm_network_interface.myvnetif.id}"]

  # pull preferred values here from the output of "az vm image list"
  storage_image_reference {
    publisher = "MicrosoftWindowsServer"
    offer     = "WindowsServer"
    sku       = "2019-Datacenter"
    version   = "latest"
  }

  storage_os_disk {
    name              = "myosdisk1"
    caching           = "ReadWrite"
    create_option     = "FromImage"
    managed_disk_type = "Standard_LRS"
  }

  os_profile {
    computer_name  = "${var.resource_prefix}-vm1"
    admin_username = "${var.admin_username}"
    admin_password = "${var.admin_password}"
  }

  os_profile_windows_config {
    provision_vm_agent = false
  }

  boot_diagnostics {
    enabled     = false
    storage_uri = "" # required, even if enabled is false
  }

  delete_os_disk_on_termination    = true
  delete_data_disks_on_termination = true
}

This creates a longer file that's harder to read, but at least it's not totally broken. Right? (Spoilers: It is.)

I'm going to skip over how I had to figure out that an Azure availability set resource requires you to add the line managed = true or else your deployment will fail. I'm going to skip over the fact that boot_diagnostics.storage_uri needs to be set even when boot_diagnostics.enabled is false.

And yet even this deployment failed, too. There's another unusable default value in the availability set resource to define the number of fault domains as 3. For some reason, Azure only permits this value to be in the range (1,2) inclusive. (Don't ask me why that's the range. The range is different if the availability set is managed or unmanaged, but which you can use is dependent on if your VM uses managed disks and Terraform doesn't seem to make this connection when planning out your deployment.)

Now well and properly pissed off, I deleted the availability set resource and the VM's mention of it. I like availability sets and I think more people should use them, but I'd stopped caring about good service hygiene and I just wanted something, anything, to work. Contempt and frustration are both bad signs to see in your management tools.

- resource "azurerm_availability_set" "myavailset" {
-   name                = "${var.resource_prefix}-availset"
-   location            = "${var.location}"
-   resource_group_name = "${azurerm_resource_group.mygroup.name}"
-   managed             = true # poorly-documented but required argument
- }

resource "azurerm_virtual_machine" "myvm" {
  name                  = "${var.resource_prefix}-vm1"
  location              = "${azurerm_resource_group.mygroup.location}"
  resource_group_name   = "${azurerm_resource_group.mygroup.name}"
- availability_set_id   = "${azurerm_availability_set.myavailset.id}"

  vm_size               = "Standard_D2_v2"

I finally, finally, ran the magic commands:

rm -rf .terraform *.tfstate # Always do this, I don't care what the docs say
terraform init
terraform plan -out my.plan
terraform apply my.plan

And it worked! I had a VM! I checked the clock. It had been 1 hour and 33 minutes from my first attempt to run main.tf to a successful deployment of something, anything, I could call a win. Those 93 minutes of my life I'll never get back didn't include installing Terraform or Azure CLI. It was just copying and pasting config examples from Github and the Terraform website, desperately trying to get the damn thing to work. I'd needed to install Git and refetch the modules a dozen times or so, but I had finally gotten something working. I was relieved. Everything was going to be OK because the first deployment is always the hardest, right?

What a fool I was.

Next time: A data source without the data is just a source... of suffering.

No comments: