2019-05-24

Taco Bell Programming and The Time I Summoned The Ghost of Doug McIlroy (Who Is Still Alive at the Time of This Writing)

Don't overthink things.

If I had one piece of advice for software developers, engineers, and other folks working in careers that value attention to details, it would be this. Don't overthink things.

Think about things, sure. I'm not advocating an "ignorance is bliss" policy or saying that the simplest solution is the only correct one. Far from it; if you have a problem, you should consider all the variables before you implement a solution if you have the luxury of time to do so. Too much time is wasted on trying to forge a perfect bespoke widget that is the exact size and shape of the problem when a much more generic option exists that can get the job done in a fraction of the time.

A recent BSD Now episode (#291) discussed an article about Taco Bell Programming. The idea is that Taco Bell is one of the most successful fast food chains in the US and its entire menu is composed of only a few basic ingredients: meat, cheese, lettuce, tomato, onion, flour tortilla, corn flour tortilla, spices. All basic stuff, but they can be reorganized and arranged into a dozen or more permutations.

It reminded me of the 1986 Jon Bentley challenge. The challenge was simple enough by today's standards: given a text file, do a word frequency analysis on it and print the n most common words in the file. He enlisted Don Knuth to solve the problem, which he did, with a custom language he'd designed called WEB. Knuth's program was mathematically elegant and used a bespoke data structure perfectly tuned to solve the problem. It also had bugs that made it incorrect.

Doug McIlroy, when presented with Knuth's WEB program, complimented its design and cleverness, then produced a new implementation of it in a six-part UNIX shell script:

tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q

Not only is McIlroy's solution more correct than Knuth's custom WEB app (har har), it's short and sweet. In fact, it's trivial to understand if you use UNIX regularly and still easy to digest in simple terms if you don't. It uses four different programs found in the base system, and it only calls sed because head hadn't been written yet. It doesn't require you to download and install the WEB runtime and compile "knuth-1.wb" into a binary that you will only use once in your lifetime.

McIlroy's solution was Taco Bell Programming at its finest, and he taught a valuable lesson that is all too often forgotten, especially now in this era of cloud computing/machine learning/blockchain/distributed buzzword soup. Companies push their complex software as "platforms". The Taco Bell Programming philosophy says that new systems mean new problems and I remember the time I had to pull a McIlroy move of my own several years ago.

I was just a junior sysadmin sitting at my desk, wondering what TV I was going to watch that evening when my boss and his boss walked into my office. When that happens, it's never a good thing.

"John's on vacation and there's a hotfix that the Green Team needs to deploy, can you do it?"

I was on the Blue Team, and we were almost entirely separated from the Green Team except by a few endpoints that relayed our data over to them sometimes, and they never invited us to see how their sausage was made. I say "them", but really the Green Team was one guy: John. John was a senior ops engineer and there was no junior to support him. He inherited the role when the last senior left, and he never backfilled a replacement due to never finding a suitable apprentice.

At one point after his rejection of an umpteenth potential candidate, a coworker asked him, "John, remember when you started working here? If that version of you showed up for an interview today... would you hire him?" John still never found a replacement. A few years later he'd be forced by management to take one of my Blue Team vendors and I hope that worked out for everyone.

But on this particular afternoon John was the only ops engineer the Green Team had and John was on vacation for at least a couple more days. They needed to deploy a hotfix and since John wasn't available, they just needed a keyboard monkey with the right credentials to hit the keys.

I agreed. How could I refuse?

I was assured that this was A Big Deal: customer-impacting, gotta get it fixed, now-now-now, and that the Green Team Developers were hard at work upstairs — developers were never on the same floor as the lowly operations guys — hammering out a fix for the problem. "OK!" I thought. I will just wait for the devs to finish writing their thing, testing it, confirming that it works in their test environment, and then I'll just run a script or swap out a binary or twelve, restart a service or two, and everyone will be happy.

I wanted to make everyone happy. I was young.

I sat in my office, fiddling with whatever I was supposed to be doing, but anxious about getting this hotfix deployed. I'd never touched the Green Team servers. I could screw it up. "You won't screw it up," I was assured by my boss. I was to be given the usernames and passwords I needed, all the server names were documented. All the repair instructions were going to be handed to me on a silver platter. The dev team was going to make absolutely certain that I could not mess up their system. I couldn't. They built it to be un-screw-uppable.

Throughout that afternoon, I'd get status updates from upstaris. "They're almost done coding." "They just handed it to the test team." "Test team is going to sign off on it as soon as the turboencabulator finishes reticulating its splines." I barely understood what any of the statuses that I was being given meant, but I understood the bottom line. I was going to be working late.

Not too too late. Test team was going to finish their testing by 5 PM, I was going to deploy the fix, verify it, let everybody shake hands on a job well done, send out an all-clear e-mail, and be on my way. Twenty minutes. Tops.

So at about 5 o'clock people I've never seen before or since start to file into my office one or two at a time. They said hi, introduced themselves as so-and-so from the Green Team upstairs, and then parked themselves somewhere to await the final bless├ęd bits to be delivered into my waiting embrace.

By 5:30, I'd say there were a dozen people crammed into my office. They were sitting on my guest chair, sitting on my love seat, resting on its arms, milling in the corners, and hanging out anywhere there was a spot to stand. I had so many people interested in this hotfix that they were standing outside in the hallway because my office was too crowded to get anyone else in it.

Then the word was passed that the fix was ready and some senior program manager with a look that just screamed "all business, all the time" came in, said hi, and sat next to me to get started. The sea of devs parted to let him get access to lowly me, the guy who was going to do the typing.

He had some notes and he handed them to me. I looked over what I needed to do and it didn't seem that crazy. Copy a directory from A to B, open a file, confirm it says C in it, restart a service. Basic stuff, and the guys upstairs had spent all damn day making sure this was going to fix everything automagically and the day would be saved.

So with all eyes on me, I got to work. Go to A, find the directory, log into B, credentials worked (whew!), make a backup on B (brownie points for that), copy the directory to a safe place on B, move it into position, open the file, do the thing. I made sure at each point that I was moving just slowly enough that if I started to go off the rails one of the more than 24 eyes would catch my mistake and stop me before I broke something. I confirmed with the crowd. "OK, directory is Foo.Bar.Baz-7." If I'd said "Foo.Bar.Baz-8", they'd correct me. Things were going smoothly, even though the tension in the air was palpable.

The temperature in my office was climbing just from body heat and hot breath. It was a little after 6. Everything was going according to plan. I was going to go home soon.

The old files had been backed up, the new files had been put into place. It was time to restart the service and split. I restarted it, confirmed it was up and running, no errors, great! I walked through the verification steps.

It wasn't working.

The program was running. It hadn't frozen, it hadn't crashed. But the hotfix was not actually fixing anything. There were sighs and groans, and the tension in the room ticked up another few notches.

Someone stepped out of the office. I'm not sure I ever saw them again. Maybe they went to pack up their belongings and leave the company.

The PM sitting next to me didn't lose the weird intensity he'd had when he arrived. It didn't even waver. The troubleshooting began immediately and from a dozen people. Backseat driving pales in comparison to backseat debugging, especially when so many people are doing it simultaneously they could all go form a soccer team. They even had to make their condescending opinions known about my choice of text editor: my choice of WordPad was met with snide disapproval.

"Open this log file," said one. "Check the Event Viewer," said another, and we were off, trying to understand what this program was doing and why it wasn't doing what we wanted it to do. I say "we", but I honestly didn't know what this program was supposed to do in the first place and this wasn't my team. But I dutifully opened all the log files and checked all the tire pressures and patted my head and rubbed my stomach at the same time.

Around 7 PM or so, with tabs opened and events viewed and logs grepped for arcane bits of data only God could know the meaning of, I finally asked what the problem was. What was so terrible in the world of the Green Team that they'd dropped everything that day to write a fix for it and were now settling into hours of unpaid overtime to debug all its heaping helpings of failure in production over the shoulders of a borrowed Blue Team involunteer?

"Files are arriving on Machine One," the PM said, "and they're getting stuck there. They need to get dequeued and sent to Machine Two."

Without thinking, I asked. "Why not use robocopy?"

Tongues were clucked. Maybe someone laughed, the smug kind of chuckle parents don't even try to hide when their kid asks in all seriousness if clouds are made of candy. Their meaning was clear: robocopy isn't an enterprise-grade solution you silly boy! You are looking at a complex pipeline of data management services! A dinky built-in file utility that our test team hasn't signed off on is not a proper solution! The developement team has spent all day carefully crafting expert algorithms to handle this issue and they've got a proven pipeline of tools to demonstrably create and test reliable software...

...idiot.

Hours passed. The devs couldn't figure out what was going wrong, the testers couldn't figure out why this all of a sudden wasn't working when they swore up and down it had in their lab. There was muttering and bickering amongst the throng, but it didn't result in any enlightenment. People started filing out. It was late. They had families to get to, cold dinners to reheat, and really, they were only there to see their hotfix work so they could get cake & ice cream from their management. Once it was obvious their update was a complete wreck and the cake party would be cancelled, it was time for them to quietly exit my office. The teeming mass slowly dwindled to just a few people over the next few hours as plans B, C, and D were thought up, attempted, and rejected.

We were heading into hour four. This isn't counting me staying late just waiting for the dev and test teams to finish coding up their stupendous little notfix. This was just four hours of pure deployment fail.

By 10 o'clock, only the program manager, One Last Dev, and myself remained. We were tired, we were hungry, and we were out of ideas. So I mentioned it again.

"If you're just trying to move files from one machine to the other, and that's it? Right? Why. Not. Just. Use. Robocopy?"

I can't say I'd persuaded him. By this point in the evening he was already sunk, treading water, and had lost a lot of energy. I wasn't influencing him with a brilliant display of lateral thought. I was giving him a straw to grasp. He relented, and asked what that would entail.

I ran "robocopy /?" and pointed out that it was designed to copy files and it can do it between machines with UNC paths. And you could remove the copied files from the source machine if you wanted. And you could just have it run every few minutes. Which, it turns out, was exactly what their holy Foo.Bar.Baz-7 software was designed to do, poorly.

It took me all of about a minute to write up "fix.bat":

@echo on
robocopy.exe D:\Data\Files \\other-machine\D$\Data\New-files /MOV /MOT:5

The PM was shocked. Not in an "I'm impressed" way. Shocked like he'd just spent twelve hours getting screamed at that everything is on fire and now this Blue Team bozo was showing him a two-line batch script that was going to save the day. It was janky. It wasn't tested. It didn't comply to the Big Book of Secure Coding Practices. It didn't have a code spec document that had been reviewed and approved by the code committee. It hadn't been checked into source control or pair programmed or anything. And it was written by some guy in Ops. Ugh!

But it worked.

It worked well enough that we could go home. The files were moving from one machine to the other and that's all we needed to get things working again. The devs could come back in the morning and start rewriting their algorithms and they could try their homebrew solution again when John returned.

The PM could write his bosses and, I expect, understate the waste of a day's and night's work on a bogus fix that didn't do anything. He could report that the live-site issue was mitigated and that additional cleanup work remains to be done to ensure a permanent resolution to the problem.

He thanked me. He and the other guy left my office. I packed up, went home, and microwaved something for dinner.

Don't overthink things.

2019-05-19

ZFS Native Encryption on Linux: It's Here Now! Kinda!

ZFS is getting native encryption. I'm pumped. It's a game changer.

ZFS almost had native encryption back when Sun Microsystems still existed, but the implementing team was apparently not packed with A-listers and the feature was scrapped. Sun died out before they could take a second swing at it, so over the years we've had to settle for compromises like GELI and LUKS to handle our full disk encryption needs. And it was good.

But it's going to get better.

This is a brave new world and we are all going to have to explore it.

The framework for native encryption was added back in the v0.7.x branch, but full-on, real-deal encryption is coming in v0.8.0 which ships Real Soon Now. The ZFS on Linux project is up to 0.8.0 release candidate 5, so we're really, really close to ZFS encryption Nirvana.

But we're not there yet.

I was lamenting this fact last month while sitting through a ZFS talk that Allan Jude was giving at LinuxFest Northwest. I was thinking that if ZFS on Linux is going to be the first open source ZFS implementation with native encryption, if it ships tomorrow, it will still be months before that magic release gets adopted into Debian/Ubuntu/Mint and becomes an out-of-the-box feature I can use. And I became depressed.

So one of the things that has constantly vexed me about using ZFS on Debian-based distros is that, other than Ubuntu, ZFS needs to be added as a DKMS module and that makes upgrades a delicate matter to approach. I've covered how to handle this in the past.

Impatiently, I set out to look into how to get the raw ZFS on Linux source to compile, without the middle man. And, surprisingly, it's doable. It's even more fragile than using DKMS, but it works. So while at some point the major Linux distros and FreeBSD will have native ZFS encryption that you can use right away when you set out to install your OS, you can get started, carefully, today. And that's what I'm going to cover in this howto.

To start with the quick version, this is NOT going to cover how to deploy native encryption with a ZFS on root Linux setup. We'll get there, in time. Every baby learns to crawl before it learns to walk.

Instead, we are going to end up creating a portable set of .deb package files we can install on any suitable machine if it matches our exact kernel version and architecture. These instructions are largely taken from https://www.klaus-hartnegg.de/gpo/2017-11-29-ZFS-in-Devuan.html but these instructions are slightly out-of-date, so consider this an addendum to that page.

  1. First things first, we download an install ISO of Devuan Linux and boot it. I use the file "devuan_ascii_2.0.0_amd64_minimal-live.iso" either copied to a USB drive with Rufus or Etcher, or attached to a virtual DVD drive when creating a VM. We are going to install a regular ol' Devuan distro on a regular uncool file system like ext2. This is going to be our ZFS source compiler machine, and it can easily be done in a VM so long as you give it at least 4GB of RAM. I also recommend 4 cores on your CPU, but you can get by with fewer if you're patient. This howto assumes an amd64 architecture. Your actual mileage may vary.

  2. Install Devuan. Basically this involves: partitioning your disk, formatting the new partition, mounting the new partition, running debootstrap in order to download and extract core OS packages to the new partition, mounting several mountpoints (/dev, /dev/pts, /proc, and /sys) inside the new partition, setting a root password, setting timezone and locale data, configuring the network, and setting the bootloader. This seems like a lot, but we've done this a bunch of times on this blog and it should start to feel pretty routine by now.

    If you really want a rundown of what to run in the Devuan minimal ISO to install Devuan, it'd look something like this:

    DEVICE=/dev/sda
    PARTITIONNUMBER=1
    TARGET=/mnt
    ARCH=amd64
    BRANCH=ascii
    MIRROR=https://pkgmaster.devuan.org/merged
    PKGS=console-setup,kbd,locales,tmux,openssh-client
    KEYRINGDIR=/usr/share/keyrings
    
    dd if=/dev/zero of=${DEVICE} bs=1M count=2
    /sbin/parted --script --align opt ${DEVICE} mklabel msdos
    /sbin/parted --script --align opt ${DEVICE} mkpart pri 1MiB 100%
    /sbin/parted --script --align opt ${DEVICE} set ${PARTITIONNUMBER} boot on
    
    mkfs.ext2 ${DEVICE}${PARTITIONNUMBER}
    mount ${DEVICE}${PARTITIONNUMBER} ${TARGET}
    
    dhclient eth0
    
    /usr/sbin/debootstrap \
      --arch=${ARCH} \
      --include=${PKGS} \
      ${BRANCH} \
      ${TARGET} \
      ${MIRROR}
    
    cp -v -r -p ${KEYRINGDIR} ${TARGET}/usr/share/
    
    mkdir -p ${TARGET}/etc/apt/sources.list.d
    mkdir -p ${TARGET}/usr/share/keymaps
    mkdir -p ${TARGET}/etc/network
    
    # Add eth0 DHCP config to /etc/network/interfaces
    cp -p ${TARGET}/etc/network/interfaces ${TARGET}/etc/network/interfaces.bak
    cp -p ${TARGET}/etc/network/interfaces ${TARGET}/etc/network/interfaces.new
    echo "auto eth0" >> ${TARGET}/etc/network/interfaces.new
    echo "iface eth0 inet dhcp" >> ${TARGET}/etc/network/interfaces.new
    chmod 0644 ${TARGET}/etc/network/interfaces.new
    mv -v -f ${TARGET}/etc/network/interfaces.new ${TARGET}/etc/network/interfaces
    
    cp -p ${TARGET}/etc/apt/sources.list ${TARGET}/etc/apt/ # this will get ascii-security too
    
    install -m0644 /etc/hostname ${TARGET}/etc/
    echo 'en_US.UTF-8 UTF-8' > ${TARGET}/etc/locale.gen
    ln -sf /proc/self/mounts ${TARGET}/etc/mtab
    
    cat /etc/resolv.conf > ${TARGET}/etc/resolv.conf.new
    chmod 0644 ${TARGET}/etc/resolv.conf.new
    mv -v -f ${TARGET}/etc/resolv.conf.new ${TARGET}/etc/resolv.conf
    
    for i in /dev /dev/pts /proc /sys
    do
      echo -n "mount $i..."
      mount -B $i ${TARGET}$i
      echo 'done!'
    done
    
    chroot /mnt env DEBIAN_FRONTEND=noninteractive dpkg-reconfigure locales
    chroot /mnt env DEBIAN_FRONTEND=noninteractive dpkg-reconfigure tzdata
    chroot /mnt apt-get update
    chroot /mnt apt-get install -y linux-image-${ARCH}
    chroot /mnt env DEBIAN_FRONTEND=noninteractive apt-get install -y grub-pc
    chroot /mnt passwd -u root
    chroot /mnt passwd root
    < enter a root password >
    chroot /mnt update-initramfs -u -k all
    chroot /mnt update-grub
    chroot /mnt grub-install ${DEVICE}
    
    for i in sys proc dev/pts dev
    do
      umount ${TARGET}/$i
    done
    
    halt -p

  3. Remove the live CD and restart the machine. This should give you a working Devuan install, albeit a pretty sparse one. In order to compile the ZFS on Linux source on it, you'll need to login and begin the real setup:

    apt-get update
    apt-get install -y \
      alien autoconf build-essential dirmngr fakeroot gawk \
      gnupg2 ksh libattr1-dev libblkid-dev libselinux1-dev \
      libssl-dev libtool libudev-dev linux-headers-$(uname -r) \
      lsscsi parted python3 python3-dev python3-pip \
      uuid-dev zlib1g-dev
    
    pip3 install setuptools
    pip3 install cffi

  4. Don't compile things as root. Make a new user account to use for the rest of the procedure:

    groupadd source
    useradd -g source -d /home/source -s /bin/bash -m source
    su -l source

    Download ZFS v0.8.0. (Hard to do because at the time of this writing it doesn't exist yet. I'm using the rc5 release:

    https://github.com/zfsonlinux/zfs/releases/download/zfs-0.8.0-rc5/zfs-0.8.0-rc5.tar.gz

    https://github.com/zfsonlinux/zfs/releases/download/zfs-0.8.0-rc5/zfs-0.8.0-rc5.tar.gz.asc

    These will soon be obsoleted, but ya gotta start somewhere.

  5. Verify the tar.gz you downloaded has been signed by the project's signing key.

    /usr/bin/gpg2 --verbose --keyserver keys.gnupg.net --recv-key 0AB9E991C6AF658B
    /usr/bin/gpg2 --verbose --verify zfs-0.8.0-rc5.tar.gz.asc

  6. If the signature is good, extract the tarball.

    SHORTVERSION=0.8.0
    LONGVERSION=${SHORTVERSION}-rc5
    
    gzip -d < ./zfs-${LONGVERSION}.tar.gz | tar -xf -
    cd ~source/zfs-${SHORTVERSION}
    sh ./autogen.sh
    ./configure
    make -j $(nproc)
    make deb
    

  7. This will create number of .deb package files you should relocate to a safe location, like another machine or a USB thumb drive, or both. To install ZFS on this machine, or any other, install the .deb files manually with dpkg as root:

    ARCH=amd64
    KERNELVERSION=$(uname -r)
    SHORTVERSION=0.8.0
    LONGVERSION=${SHORTVERSION}-0
    
    cd ~source/zfs-${SHORTVERSION}
    dpkg -i zfs_${LONGVERSION}_${ARCH}.deb
    dpkg -i kmod-zfs-${KERNELVERSION}_${LONGVERSION}_${ARCH}.deb
    dpkg -i libnvpair1_${LONGVERSION}_${ARCH}.deb
    dpkg -i libuutil1_${LONGVERSION}_${ARCH}.deb
    dpkg -i libzpool2_${LONGVERSION}_${ARCH}.deb
    dpkg -i libzfs2_${LONGVERSION}_${ARCH}.deb

  8. To use ZFS, make sure your kernel modules are loaded:

    modprobe zfs

From here, you can begin creating zpools and zfs datasets on this box, which may or may not be useful to you if you are hellbent on a ZFS-on-root setup, but this is enough for you to start working with the encryption feature of ZFS v0.8.0 as a learning tool.

Note: I only very briefly touched on ascii-security. This apt-get branch is useful for getting old versions of the kernel and kernel headers installed. Since the ascii install ISO uses a slightly outdated kernel version, 4.9.0-6, if you specify installing this exact kernel on your source compiler machine, you can then re-use the .deb files it creates in conjunction with the install ISO to create a ZFS-on-root Devuan machine in a very similar manner as we described in the first steps here. In other words, making this change:

- chroot /mnt apt-get install -y linux-image-${ARCH}
+ chroot /mnt apt-get install -y linux-image-$(uname -r)

during the initial setup of your compiling machine will put the same kernel on that machine as the ascii ISO live CD uses. Thus the .deb files you build on it can be installed to the live CD environment with dpkg to create a zpool. You can even do this on the same hardware once you've built the .deb files, at the cost of having to install Devuan twice, and repeat the process whenever you choose to change the kernel.

Note also that while the ZFS compiling process creates a zfs-initramfs .deb, we don't install it. For a ZFS-on-root scenario, you'd want to make sure that it's included as well or else your machine will be unbootable.

2019-05-10

Terraform is Terrible: Part 5

I spend a lot of time here urging about the benefits of lighting a candle instead of cursing the darkness. In that spirit I wanted to end on a high note for this week I've spent pointing out how bad and broken Terraform is when trying to use it on Azure.

But I have to be honest with you: I'm not sure I know how to light this candle.

It's not that I didn't try. On a typical Tuesday I sat down to play with Terraform and lost four hours catching just a speck, a sliver, of its vast and dimensionless horrors.

The next day I still wasn't OK with what I'd just experienced, so I wrote four days' of blog posts in a single sitting.



The words of Spider Jerusalem were echoing in my mind as I hammered out this week's Monday through Thursday entries: "Home entertainment system: give me fire." But I didn't have an ending. The day after my writing binge, I went back and relearned Terraform.

Seriously.

OpenBSD 6.5 came out just a week or two ago and I was excited to use Terraform to manage my OpenBSD Azure deployments, so I built a new 6.5 image and then I used Terraform to deploy it to Azure anyway. I paved over every pothole, hard-coded every vnet, and wrote up an epic set of main.tf files to put my OpenBSD VM out there. And it worked. Kinda.

There were still more pitfalls waiting for me to fall into and if, figuratively speaking, I'd been trying to run a marathon instead of a 100-meter dash, I imagine I'd have fallen into many more. But my scope was narrow and by Saturday I'd built a library of main.tfs that accurately (and verbosely) describes my infrastructure, and I used it. I'm running OpenBSD on Azure right now thanks to Terraform. And it was a miserable process.

But that misery helped me to think about what I'd improve. I had to walk the mile, or at least 100 meters, in a Terraform user's shoes before I could finally put the last nail in this piece of shit software's coffin. And I couldn't do it.

Terraform is such a nice idea. It's multi-platform. It's got an installer for OpenBSD for Christ's sake. I really want to see this disaster of a project succeed.

So I went digging and I found the AzureRM availability set file that has the garbage defaults. It'd be easy to file a Github issue to get those defaults changed, so I whipped up another main.tf to run so I could copy the exact error message to include in my Github issue for both posterity and search engines to find later.

And it deployed just fine. >:(

This is when I learned that the availability set resource object's defaults of unmanaged+3 fault domains is just fine... unless you want to use it to deploy a VM using Azure managed disks, which is the recommended config now. Huh? Yeah. The availability set defaults aren't actually wrong, they're just outdated. You need to try to deploy a VM, too, to hit the error:

Error: Error applying plan:

1 error(s) occurred:

* azurerm_availability_set.myavailset: 1 error(s) occurred:

* azurerm_availability_set.myavailset: compute.AvailabilitySetsClient#CreateOrUpdate:

Failure responding to request: StatusCode=400 --

Original Error: autorest/azure: Service returned an error.
Status=400 Code="InvalidParameter"

Message="The specified fault domain count 3 must fall in the range 1 to 2." 
Target="platformFaultDomainCount"

(Note that the error occurs in the azurerm_availability_set resource, but the error is that it's incompatible with your azurerm_virtual_machine resource, which isn't mentioned in the error. This garbage is how software support contracts get sold.)

So what's the fix here? I'm still not sure. Terraform would need to perform some kind of check with the cloud provider frontend to see if your settings are valid, but Terraform doesn't actually do that until you run terraform apply and by then, it's too late to ask for a do-over. You'd think that terraform would figure this out during its planning stage, but planning is just that: Terraform puts together an idea of what it needs to do to make your main.tf's wishes come true. It doesn't actually deal with all the logistics until you pull the trigger and apply the plan, at which point Terraform is happy to let your plans get shredded to bits faster than an army ground charge in World War I.

(Side note: I watched the movie Regeneration once twenty years ago and I really wish it would come out on DVD or BluRay because it's an excellent film that should show up one day on the Friendly Fire podcast. I'd rate it five armbands.)

Then in the same repo I found the data source for AzureRM virtual networks. This is the place that needs a filter so you can let Terraform get all your vnets and then return the one you need based on some user-defined criteria. Then I cross-referenced it against the AWS repo which has filters... and I couldn't make heads or tails of it.

Because Terraform is written in Go, it's pretty versatile with respect to being able to run on multiple different platforms. And because it's written in Go, I can't write patches for it, because I can neither read nor write Go.

Yet.

So my journey begins. I'm putting Rust aside for now and I'm going to teach myself Go. At least enough to be able to read what this filtration code is doing and maybe, somehow, write something similar for Azure. It's not going to be quick, or easy, or fun.

But I'm sick of this darkness.

2019-05-09

Terraform is Terrible: Part 4

If I'd just stopped here I'd probably have been fine. I mean it. Terrform is a janky, buggy, badly-documented, and brittle piece of software, but I'd finally fought it long enough to get a working VM out of it. I was going to be able to sleep that night. And in retrospect, maybe I should have quit on a high note. I'd gotten enough material out of my suffering for a blog post or three and, hopefully, other people wouldn't step on the same landmines I had as I was trying to get Terraform working.

Critics might say, "Wait a sec, all your problems are with Azure/modules/Windows resources. Terraform works great on AWS!" and that may be so. I don't know. I use Azure for my cloud compute, and nowhere on the Terraform website does it claim that this isn't an option, or that trying to use Azure with Terraform is an open invitation to waste your evening wrestling with errors from within nested objects you don't see and can't control, all the while lamenting all your life's choices that brought you to test out this piece of trash program.

If Azure isn't a functional, supported platform for their tool, they shouldn't pretend like it is. Remove it, call it an alpha test, or at the very least put a big warning at the top and bottom of every page in big red letters: "Azure is NOT a 1st-Party Provider for Terraform!!! Use at your own risk!!!"

At least then I'd have been duly warned. Until that happens, if you want to run an Azure service with Terraform, you are but lambs for the slaughter. I really can't understate this. Terraform is broken. Its tools do not work as described.

Back to my woes. I'd fought with Terraform. I'd fought with Terraform modules. I'd fought with bad Terraform deployment defaults. I finally had a working Terraform config. terraform apply and terraform destroy worked as expected now.

And as I said, if I had stopped there, I'd probably have been fine. But I didn't, because a single VM sitting in its own subnet is not what I wanted, because that's not what most cloud services look like. It had taken me 93 minutes to get that initial VM deployed. I felt bolstered to try to build a very simple, but very usable, second deployment from there. There was just one little catch.

I already had a cloud service.

Most of us do. You're not typically building a brand new service entirely from scratch every time you want to deploy something to the cloud. You have existing machines, old networking configs, and all sorts of legacy devices laying around doing important odds and ends and that's how your business runs. If you were using hand-tuned scripts to manage everything yesterday, I am sure as hell not going to tell you to start migrating your resources over to Terraform tomorrow.

Because you can't.

Terraform is so fragile that it is a Herculean labor to lean on it to move your cloud infrastructure over to Terraform configs at a pace that fits you and your service demands. In theory, Terraform supports "data" sources. Like a resource, data sources are Terraform objects that wrap around existing cloud components so you can leverage them in your config. In theory, if you already have a VM running in Azure, you can tell that to Terraform and then create something new using Terraform's knowledge about that existing VM. In theory, it might look like this:

data "azurerm_virtual_machine" "existingwebserver" {
  name                = "oldwebservervm1"
  resource_group_name = "webserver-rg"
}

resource "azurerm_virtual_machine" "newwebserver" {
  name                  = "newwebservervm1"
  location              = "${data.azurerm_virtual_machine.existingwebserver.location}"
  resource_group_name   = "${data.azurerm_virtual_machine.existingwebserver.resource_group_name}"
  ...[more config here]...
}

The idea here is that we don't explicitly define a location or a resource group for our new web server VM, but we define a data source object with the name and resource group name for something we know already exists. We can reference what we found when we looked for it, then set the location and the resource group name to match the values of the "existingwebserver" Terraform object. Sounds great, right?

You've probably already noticed that we really did explicitly define a resource group name, in the data block, because we had to define it. Azure data sources are much, much simpler and less versatile than AWS data sources and they lack an important feature called filtering. With filtering, you wouldn't necessarily need to define a name and a resource group, you could query your cloud service and pick the right VM based on, say, just its name. Then whatever its resource group happens to be wouldn't need to be hard-coded in the configuration file. As far as Azure networking data sources are concerned, you have two required fields: name and resource group name and that's it. If you don't already know what you're looking for, Terraform sure ain't gonna help you find it.

This is a big deal.

It matters because I have a couple of different virtual networks living in Azure and their names are defined based on their geographical location. So I might have "West-US-2-VNET-133" and "Central-US-VNET-2496" and if I put a VM in one of those locations, I want it to automagically look up the name of the right vnet to use based on one very simple criteria: find the name of the vnet in the same geographical location in which the new VM will be and the resource group name will be "InternalNet".

That's it. That's all I want.

Because I cannot query Azure with a data source, I need to give both a vnet name and a vnet resource group and, frankly, I only know one of those off the top of my head. The other is location-dependent and this platform-agnostic tool doesn't have a built-in mechanism for even grabbing a list through which I could iterate or pattern match or... or anything.

What a fool I have been.

I read. I researched. I found external data sources. External data sources look useful and the documentation around them is wrapped in multiple warnings:

Warning: This mechanism is provided as an "escape hatch" for exceptional situations where a first-class Terraform provider is not more appropriate. Its capabilities are limited in comparison to a true data source[.]

That's right. Querying your existing infrastructure ("There's something out there in my cloud and I know a name or a resource group, but not both, can you please find it?") is what Terraform considers an "exceptional situation". An external data source is really just an open-ended exec() mechanism that allows you to plug in a script or another piece of code that can go and actually talk to your cloud and get answers that Terraform doesn't know how to ask.

In other words I spent just a hair under 4 hours getting kicked in the teeth by Terraform, trying to replace my custom scripts with something better, only to discover that the best way to interact with my Azure resources in Terraform... is with a custom script.

My current scripts already do this, without difficulty, and without Terraform:

Get-AzureRmVirtualNetwork -ResourceGroupName "InternalNet" | ? { $_.Location -eq $vm.Location }

That was it. That was the last straw. I deleted the resource group (through the Azure portal, if you were wondering). I shut my machine down. I stood up and I took a walk. I needed to clear my head. I needed to think about what I'd just experienced. I needed a drink.

I'm not looking forward to learning libcloud on Python. But what choice do I have?

Next time: Where do we go from here? Bugger this.

2019-05-08

Terraform is Terrible: Part 3

After I took a look around the Github issues for the (I would soon learn to be) utterly broken AzureRM Terraform modules, I found a couple of proposed solutions to my compute problems. One suggested using the latest version of the module by fetching it straight from Github:

module "compute" {
-   source            = "Azure/compute/azurerm"
-   version           = "1.2.0"
+   source            = "github.com/Azure/terraform-azurerm-compute.git"

This is a neat trick for distributing Terraform modules, so that's what I did. Which meant I needed to install Git on my machine and then run Terraform from inside a Git bash window instead of a command prompt, but whatever! I tried it.

It still didn't work.

Another comment in a different open Github issue said to add is_windows_image = true so I did that, too, and this finally convinced Terraform that my vm_os_simple = "WindowsServer" meant I wanted a Windows server. I ran "terraform init" and "terraform plan". Again. And again, my deployment failed.

The error this time was about an invalid storage account type for my region.

But I hadn't defined a storage account type. The module's supposed to do that for me.

The online examples of using the AzureRM compute module didn't mention storage account types. At all. I'd tried to create a VM in a region and the module defined the storage account type for the VM's OS disk as type "Premium_LRS", but the region I'd chosen didn't offer "Premium_LRS". And I didn't see a way to change it. So like a battered spouse who's convinced it's something they did wrong, I went back to reading the now highly-dubious Terraform docs and found nothing about overriding module defaults and I couldn't find an example of how to get a full printout of the secret nested attributes inside a compute module object so I could define my own values. I still don't know what to put here:

module "compute" {
  source            = "github.com/Azure/terraform-azurerm-compute.git"
  location          = "${var.location}"
  admin_username    = "plankton"
  admin_password    = "Password1234!"
  vm_os_simple      = "WindowsServer"
  ?????????????     = "Standard_LRS"
}

So I yanked out the modules. All of them. There was an AzureRM "compute" module and a "network" module and in my opinion they both need to be taken out back and shot. I rewrote my dinky little main.tf config file with full resources this time:

provider "azurerm" {
  version         = "=1.27.0"
  subscription_id = "bba8a111-a014-4dbf-aa90-9692362fd971"
}

resource "azurerm_resource_group" "mygroup" {
  name     = "${var.resource_prefix}-test"
  location = "${var.location}"
}

resource "azurerm_availability_set" "myavailset" {
  name                = "${var.resource_prefix}-availset"
  location            = "${var.location}"
  resource_group_name = "${azurerm_resource_group.mygroup.name}"
  managed             = true # poorly-documented but required argument
}

resource "azurerm_virtual_network" "myvnet" {
  name                = "${var.resource_prefix}-vnet"
  location            = "${azurerm_resource_group.mygroup.location}"
  resource_group_name = "${azurerm_resource_group.mygroup.name}"
  address_space       = ["10.0.0.0/16"]
}

resource "azurerm_subnet" "mysubnet" {
  name                 = "Subnet-1"
  resource_group_name  = "${azurerm_resource_group.mygroup.name}"
  virtual_network_name = "${azurerm_virtual_network.myvnet.name}"
  address_prefix       = "10.0.1.0/24"
}

resource "azurerm_network_interface" "myvnetif" {
  name                = "${var.resource_prefix}-nic"
  location            = "${azurerm_resource_group.mygroup.location}"
  resource_group_name = "${azurerm_resource_group.mygroup.name}"

  ip_configuration {
    name                          = "ipconfig-1"
    subnet_id                     = "${azurerm_subnet.mysubnet.id}"
    private_ip_address_allocation = "Dynamic"
  }
}

resource "azurerm_virtual_machine" "myvm" {
  name                  = "${var.resource_prefix}-vm1"
  location              = "${azurerm_resource_group.mygroup.location}"
  resource_group_name   = "${azurerm_resource_group.mygroup.name}"
  availability_set_id   = "${azurerm_availability_set.myavailset.id}"

  vm_size               = "Standard_D2_v2"

  network_interface_ids = ["${azurerm_network_interface.myvnetif.id}"]

  # pull preferred values here from the output of "az vm image list"
  storage_image_reference {
    publisher = "MicrosoftWindowsServer"
    offer     = "WindowsServer"
    sku       = "2019-Datacenter"
    version   = "latest"
  }

  storage_os_disk {
    name              = "myosdisk1"
    caching           = "ReadWrite"
    create_option     = "FromImage"
    managed_disk_type = "Standard_LRS"
  }

  os_profile {
    computer_name  = "${var.resource_prefix}-vm1"
    admin_username = "${var.admin_username}"
    admin_password = "${var.admin_password}"
  }

  os_profile_windows_config {
    provision_vm_agent = false
  }

  boot_diagnostics {
    enabled     = false
    storage_uri = "" # required, even if enabled is false
  }

  delete_os_disk_on_termination    = true
  delete_data_disks_on_termination = true
}

This creates a longer file that's harder to read, but at least it's not totally broken. Right? (Spoilers: It is.)

I'm going to skip over how I had to figure out that an Azure availability set resource requires you to add the line managed = true or else your deployment will fail. I'm going to skip over the fact that boot_diagnostics.storage_uri needs to be set even when boot_diagnostics.enabled is false.

And yet even this deployment failed, too. There's another unusable default value in the availability set resource to define the number of fault domains as 3. For some reason, Azure only permits this value to be in the range (1,2) inclusive. (Don't ask me why that's the range. The range is different if the availability set is managed or unmanaged, but which you can use is dependent on if your VM uses managed disks and Terraform doesn't seem to make this connection when planning out your deployment.)

Now well and properly pissed off, I deleted the availability set resource and the VM's mention of it. I like availability sets and I think more people should use them, but I'd stopped caring about good service hygiene and I just wanted something, anything, to work. Contempt and frustration are both bad signs to see in your management tools.

- resource "azurerm_availability_set" "myavailset" {
-   name                = "${var.resource_prefix}-availset"
-   location            = "${var.location}"
-   resource_group_name = "${azurerm_resource_group.mygroup.name}"
-   managed             = true # poorly-documented but required argument
- }

resource "azurerm_virtual_machine" "myvm" {
  name                  = "${var.resource_prefix}-vm1"
  location              = "${azurerm_resource_group.mygroup.location}"
  resource_group_name   = "${azurerm_resource_group.mygroup.name}"
- availability_set_id   = "${azurerm_availability_set.myavailset.id}"

  vm_size               = "Standard_D2_v2"

I finally, finally, ran the magic commands:

rm -rf .terraform *.tfstate # Always do this, I don't care what the docs say
terraform init
terraform plan -out my.plan
terraform apply my.plan

And it worked! I had a VM! I checked the clock. It had been 1 hour and 33 minutes from my first attempt to run main.tf to a successful deployment of something, anything, I could call a win. Those 93 minutes of my life I'll never get back didn't include installing Terraform or Azure CLI. It was just copying and pasting config examples from Github and the Terraform website, desperately trying to get the damn thing to work. I'd needed to install Git and refetch the modules a dozen times or so, but I had finally gotten something working. I was relieved. Everything was going to be OK because the first deployment is always the hardest, right?

What a fool I was.

Next time: A data source without the data is just a source... of suffering.