2020-05-01

Loopy (Part 2)

Back in my old SQL days, I'd discovered ConEmu for being able to run a dozen or so command prompts simultaneously without (totally) losing my mind. The benefit of using ConEmu came back to me here in a big way as I could open ConEmu, spawn a handful of tabs, and cleanly kick off concurrent deployments of, say, Red1, Red2, Blue1, Blue4, and Green5.

And so it went. For about a year, keeping all 1,220 plates spinning became my full-time job. And I really don't want to exaggerate the term "full-time" here. At a meeting I and another ops guy (who isn't in the rest of this story because he was dead-set against ever using my code) were asked how much of our time was spent managing our respective pools and we both said "two weeks a month", and we weren't kidding.

It took about a year for me to witness all the weird and wacky things this pool was capable of doing. Every month, I would dutifully set up and reserve one new machine of each color on the Monday before the second Tuesday in preparation of the release of that month's patches. That alone would buy me about half a day of time. On Tuesday morning I'd try to get all of the new machines patched and sysprepped so I could start the copying operation that would usually run overnight if all went well.

This got old right around month number 2. But I kept doing it by necessity, since there was no alternative to managing this custom pool of VMs. After about a year, I'd gotten the process optimized down from fourteen days (evenings and weekends were useful, off-peak times to deploy, so I capitalized on them) to about seven days if nothing went wrong, but something seemed to always go wrong.

Over that year, I started writing more diffs for the deployment system. There was no programmatic mechanism to take a machine out of rotation or put one back in — the vendors got paid to do that — so I wrote a script to Enable-Machine or Disable-Machine as needed. Throw in a few more helper functions over time and I eventually had enough custom code to put it into its own PowerShell module, so that's what I did.

The Patch Tuesday for October that year was the last straw for me. Even after all of my tweaks and improvements, I was still killing myself keeping the pool going and I'd finally gained enough experience to be confident with the exact procedure that needed to be done to roll out changes... and the skill to know how and when to take tear-it-down-and-rebuild-it repair actions. Worse, the stint of evenings and weekends was wearing me down socially, emotionally, and earning me absolutely no praise from my boss. (In a private meeting between the two of us I was hoping he'd find an ounce of sympathy in his cold black heart when I told him "I'm working evenings and weekends to keep things going," and he just cocked his head to the side like a dog does and asked, "... Why?" It wasn't a moment of zen-like enlightenment for either of us. It was the moment I decided to stop checking my work e-mail at home.)

Experience told me that any given deployment in this pool was just a finite matrix of choices. Combine: machine color, resource group name, image name, and just one or two other deployment-time options. Certain credentials were also needed depending on the options, but those are not choices about the deployment, they are just commodities needed at certain points to proceed. I had copy-and-pastable code that could handle each of these single permutations, but I'd never really taken the time to stitch them all together. I'd just put in my two weeks every month, sometimes three weeks, and when the pool was healthy and current again I'd try to get something else done at work with the rest of my month. Not this time.

Next time: Software got me into this mess; can software get me out of it?

No comments: