2020-05-05

Loopy (Part 4)

So for the rest of October and the first week of November, I wrote my loop-deploy script and put "a metric fuckton of retry loops" into it. The process was pretty simple: if you want to do a thing then, in a for loop: try to do the thing, if the thing failed, write a warning that it failed, then sleep, or else check that the thing succeeded and, if it did, break out of the loop. Verify that the state of the world is what you want it to be and if it still isn't what you want after the loop has ended, write a bigger, scarier warning and keep iterating inside the SUPERLOOP that is wrapped around the entire end-to-end process.

In other words:

loop A {
  loop B {
    take the target VMs out of rotation
  }
  loop C {
    wait until the target VMs are idle
  }
  loop D {
    Stop-VMs
  }
  loop E {
    Remove-VMs -Force
  }
  loop F {
    Deploy-VMs
  }
  loop G {
    WaitFor-VMs
  }
  loop H {
    HealthCheck-VMs
    if ($healthy_vm_count -eq $total_vm_count) {
      loop I {
        result = attempt to put the target VMs back into rotation
      }
      if (result -eq OK) { break A }
  }
}

UPDATE: After some searching, I found an over-simplified version of the deployment SUPERLOOP (humbly called ":LOOP" in the code). This deployment loop was, of course, called in a loop, once for each item in [String[]] $ResourceGroupName:

:LOOP for ($retry = 1; $retry -le $Tries; $retry++) {

  if ($RemoveMachine) {
    :DELAGENTLOOP for ($i = 1; $i -le $DeleteTries; $i++) {
      Select-Machine -Enable:$False

      # get idle VMs
      if ((10 -eq $idle_vms.Count) -and ($SeriouslyNaughty)) {
        Remove-AzureRmResourceGroup -Name $ResourceGroupName -Force
        Start-Sleep -Seconds 300
        New-AzureRmResourceGroup -Force -Location $Location
        break :DELAGENTLOOP
      }

      if (0 -lt $idle_vms.Count) { Remove-Machine.ps1 }
      if (0 -eq @(Get-AzureRmVM -ResourceGroupName $ResourceGroupName).Count) { break :DELAGENTLOOP }
      if ($DeleteTries -gt (1 + $i)) { Start-Sleep -Seconds 300 }
    } # :DELAGENTLOOP
  }

  if ($DeleteOnly) { Return } # just remove hosts, do not redeploy

  Start-Sleep -Seconds (($retry-1) * 3600)

  # Deploy hosts
  if (-not ($HealthCheck)) { Return }

  # Check status
  :CHECKLOOP for ($check = 0; $check -lt 4; $check++) {
    Start-Sleep -Seconds 300

    if (2 -gt $offline_hosts) {
      Select-Machine -Enable:$True
      Return
    }
  } # :CHECKLOOP
} # :LOOP

Some of these loops had smaller loops inside them because I was completely over running this pool by hand and I'd had 12 months of learning the hard way what the most likely causes of failure were. More specifically, I'd sussed out the best series of steps to take in the flaky situations to still get a good final result, and so the SUPERLOOP had a lot of epicyclic micro- and nano-loops to smooth over bumpy parts of the deployment process where waiting 30 seconds and trying again might be just the thing to get the deployment back on its feet.

The SUPERLOOP had its own sleep schedule which was defined by (current attempt number - 1) * 3600 seconds. On multiple occasions a retry would fail, immediately retry, fail, wait one hour, fail, and so on, only succeeding on the sixth attempt after five failures, plus waiting 4 + 3 + 2 + 1 hours in between.

A lot of time is spent philosophizing on "service reliability", but that's really not the right term for it. I suppose most folks think a reliable service never crashes or becomes unavailable, but the truth is that no service has 100% uptime, so the real problem you want to think about isn't "How can this program keep running?" but rather "How can this program adapt to failures and recover from them?" Lots of software has been written that just writes out its state to disk every couple of seconds and, when it starts, it looks for its retry file to see where it left off, and picks up from there. Or it sets a timer and if an operation doesn't complete within that length of time it performs some kind of rollback or recovery action. loop-deploy3 was a tool designed not to try to keep a sinking ship afloat. All of its myriad epicycles of loops were deliberately put in place to keep trying until a known-to-work-previously operation succeeded, and if it didn't? loop-deploy3 was ready and willing to tear it all down, wait for the frontend to finish its hissy fit and calm down, and then it would try again.

This seems on the surface to be a crude way of writing software, but writing loop-deploy3 really helped emphasize in my mind the "livestock, not pets" methodology. I don't spend time naming servers cutesy things anymore like "zeus" or "ilpostino". I stopped caring about keeping long-term VMs around I could cultivate like a victory garden of delicate, exotic flowers. I just needed 1,220 VMs to be ready and able to get the job done, and cloud deployments make it so I don't have to wonder about where the VMs came from or what they look like. I just need to provision disk space, cores, and a network interface and I'm off to the races. If one deployment fails? Throw it out and start another. And keep trying until you get what you asked for in the first place. The loop-deploy3 SUPERLOOP maximum attempt count was configurable, but the default was 10. Rarely were deployments so horribly broken that they'd make it to attempt #10. This gave me the satisfaction of knowing that, no matter how badly the cloud was acting that day, loop-deploy3 would keep trying until either (a) it got what I asked it to build for me, or (b) I'd come in the next day, see it was still trying and failing to deploy VMs, and stop it from any further attempts. I recall that (b) happened on at least one occasion.

The second Tuesday in November rolled around, and I put my loop-deploy script into practice for the rollout. And I immediately hit a bunch of showstopping problems with it and did some significant rewrites — it was during this time that loop-deploy1 and loop-deploy2 were taken out back behind the barn and never heard from again while loop-deploy3 managed to be functional and effective enough to help me get through the monthly rollout in under a week.

This was progress.

From there I refined loop-deploy3 until each piece of it, and the logging, was good enough for me to depend upon. I was still managing the pool by myself like before, but "redeploy one thousand machines" was no longer a two-week chore. With ten ConEmu tabs open I could run loop-deploy3 ten times at once with ten resource groups: "loop-deploy3 -ResourceGroupname @('RGNRED1, 'RGNRED2', ... 'RGNRED10')" Assuming ten VMs per resource group, from one host machine I could redeploy the whole pool in about a day if I was really in a hurry and things went well. The December patches were a snap.

That's a two-week monthly patching obligation boiled down to about 24 hours, but I kept going.

I'd also written automation to build the new VMs on Tuesday morning for me, then added software to the image so that if they recognized they weren't in production they would try to patch themselves, in a loop, until there were no more patches to be found, and then sysprep themselves and wait for me to bless them into the image library for distribution to Prod. That took an afternoon, and so I would normally have the images copying by early Tuesday evening and ready for loop-deploy3 to roll out on Wednesday.

The PS module I wrote to support this pool could also query the VMs and figure out what version of the image they were all running and when they had last been deployed. So from there I just wrote a simple algorithm to determine when the second Tuesday of the current month is (handy with the previous patching-VMs setup step), and then query which machines had not been deployed more recently than that. Most machines fit this profile, but the more accurate way to check for this is with an exact image name, which I ended up doing as a final validation step anyway to look for stragglers. Each month as part of the file copying step I'd update ImageTable.xml with the new image name, check it into source control, and let loop-deploy3 handle the rollout for me. If I was really in a hurry, I could distribute the base image to more than just the library where VMs knew to fetch their VHDs. If they already had a local copy of the VHD file they'd save an hour of copying time, so if I had a known-good updated file, I could — with a loop of course — blast it out to a bunch of target locations at least one hour before I needed to deploy them. During security response operations, I ended up doing this more than once, but generally I preferred to let Barney's process do what it was designed to do.

I just didn't do it by hand anymore.

No comments: