2020-04-30

Loopy (Part 1)

A few years back, I found myself inexplicably in charge of about 1,220 Windows VMs.

I didn't ask to be in charge of them. It was a gradual thing of course, building up over time. I was invited to a meeting at work and some folks I'd never seen before spoke about how their team was splitting up into two teams, and part of the old team was moving over there and part of the team was coming over here, and there was a service infrastructure that the old team relied on having. The other team was keeping all the hardware for themselves, so the new team was going to need the same infrastructure in place when they came over to our side of the fence after the split.

I kept getting invited to these meetings, so I kept going. And then after about meeting four or five, one of them turned to me and asked when I'd be able to start cloning their workflow and have a copied infrastructure ready. And that's when I realized I was going to be in charge of 1,220 Windows VMs.

That army of VMs was split into about five different base images and those images were reconfigured and mixed-and-matched into eight or so different configurations. I didn't have to build this network from scratch, though. I was given All The Code and copies of functional base images, which was a big help to me. On day one I opened up the code repo to start reading it and... I didn't understand a damn thing about it.

By day two though, I'd found most of the key lines that screamed out "That Team Over There" and my first diff was changing those lines to "Our Team Right Here". Eventually I got things squared away, and those first few test VMs grew and grew, until I had about 1,220 of them and everyone was happy.

Except me.

In inheriting the responsibility of running a thousand machines, we also inherited the uptime requirements of keeping those machines humming along. The old team had a rotating troupe of vendors working 24/5 watching their pool of machines and manually taking the troublemakers out of rotation and rebuilding them as needed. (24/5 seems like an odd choice for support coverage. I'm sure it was a compromise between users wanting support and budget, plus actual usage metrics.) We were given responsibility for building a work-alike network for our new teammates, but we didn't get the vendors to manage it for us. They stayed on the old team.

So that responsibility fell on me.

Astute readers will note that I am not a team of multiple people. So I and one other guy started working our asses off trying to automate as much of the maintenance of the day-to-day work of the service as possible. And automate we did. We built a health check and the other guy figured out a brilliant way to trigger it without requiring extra overhead. (He also built a flagging service that would disable a VM if it was failing and mark it as "Redeploy" automatically. Eventually the redeployment service broke and we could never figure out why. Hence the rest of this article.)

With daily maintenance of the pool covered (Mostly covered. Mostly.) All I needed to worry about was the non-default behaviors of these machines.

Y'know. Simple stuff... like patching.

Patching a Windows host is, at times, an exercise in Kafkaesque persecution by shadowy and unidentifiable accusers. It's gotten better over the years, but it's never been a process I'd consider to be fun. Now multiply that by 1,220.

The deployment process for these machines was manual and demanded time and effort, what they'd call in today's terms "a high-touch operation". I cannot for one second criticize the developer who put this system together. His name was Barney. Barney was given the assertion that a team of people would always be ready to run his tools, so he designed the whole system with the objective of "if you need something, find the right script to do it". A better objective might have been "make sure no one has to look for any scripts whatsoever", but that simply wasn't a constraint on the old team.

Even more to the point, Barney was a genuis who'd discovered ways to get features working reliably in the cloud years before the cloud provider itself was able to offer comparable officially-supported functionality. Snapshotting a VM is a built-in option to everyone today; I was managing VM snapshots years before you could just click a button. I didn't figure out how to do that. Barney did. So Barney is OK in my book.

The deployment process was simple enough. When you want to rebuild a VM, you:

  • Manually stop the VM
  • Manually delete the VM
  • Manually delete the snapshot of the VM's base VHD
  • Manually delete the VM's base VHD file
  • Run the C:\Repo\Deploy-<Color>Machine.ps1 script with a bunch of custom arguments

And so that's what I did. That lasted about two days before I wrote a "Remove-Machine" script that would do all the deleting for me. But deploying was still a thorny and complex process I won't describe further here. I was able to take the "Deploy-RedMachine.ps1" script and the "Deploy-BlueMachine.ps1" script and merge them all into a single "Deploy-Machine.ps1" script and control the various settings by crafting parameters. A typical deployment was still a behemoth of arguments:

$deploy_args = @{
  'authToken'          = 'TOKEN_GOES_HERE';
  'adminUserName'      = 'admin';
  'adminPassword'      = 'ADMIN_PASSWORD_GOES_HERE';
  'DomainUserName'     = 'DOM\user';
  'DomainPassword'     = 'DOMAIN_PASSWORD_GOES_HERE';
  'ServiceUserName'    = 'serviceUser';
  'ServicePassword'    = 'SERVICE_PASSWORD_GOES_HERE';
  'ImageName'          = 'vm_image_YYYYMMDD.vhd';
  'Subscription'       = 'SUBSCRIPTION_ID_GOES_HERE';
  'StorageAccount'     = 'mylibrary';
  'ResourceGroupname'  = 'RGNRED1';
  'Location'           = 'West US 2';
  'PrivateNetwork'     = $False;
  'VirtualNetworkName' = 'default';
  'Red'                = $True;
  'PoolName'           = 'Our Team Pool';
}
pushd C:\Repo
Measure-Command { & .\Deploy-Machine.ps1 @deploy_args }

More lines were used to avoid having to hard-code passwords and secrets into the instructions, to decrypt the actual values from encrypted files, but I am not including them here. Believe it or not, this 20-line monstrosity was a big boost to my productivity. I had a text file full of notes for deploying a group of red VMs and blue VMs and every other color we had. When I needed to deploy something, I had that file ready to go so I could edit the lines I needed, change the ResourceGroupName to RGNRED2, for example, copy them, paste them, and watch 'em run.

Next time: Repetitive tasks start feeling repetitive, repeatedly.

No comments: