2018-04-14

Ansible Week - Part 0

Reading up on Ansible delegation and its claims to support multi-tiered administrative actions ("I'm in charge, and I'm telling you five machines to tell five more machines to run this script....") reminded me of my first exposure to scalable remote server management software. Before Ansible. Before Puppet or Chef or Salt or DSC or Rex (or Docker or Kubernetes or whatever the new buzzword is this month). Back when men were men and so were some of the women. It was a bizarre time for everyone.

Years ago I worked on a hosted e-mail service, something along the lines of your run-of-the-mill Gmail or Hotmail/Outlook.com online offerings. We had a number of datacenters, hundreds of mail hosts, and at times a staggering number of deployment changes to stay on top of in any given week.

The team had a number of tools to accomplish this. Some new, some old, some sane, some not. The oldest and least sane was a bespoke shell/Perl fusion script that would recursively execute arbitary commands on a regex-designated list of hosts and report the results back in a multi-tiered push/pull methodology that would make everyone's eyes cross the first time I'd explain to them how it worked. It was a maddeningly difficult thing to use and was ridiculously easy to shoot yourself in the foot with it if you weren't careful.

I loved it dearly.

The program was called mailcom and I still don't exactly know why. It was used on more than just our mail hosts and was far more powerful than just communicating to any mere SMTP server. Whoever the Ghostbusters-like Ivo Shandor was that architected this thing was a delicious combination of lunatic and genius. He or she may even have literally worshipped Gozer and I wouldn't be surprised if they did. mailcom, in its infinite wisdom, was datacenter-aware. Rather than copy data from your jumpbox/bastion host/admin node to dozens of machines across the globe, mailcom would take your payload and send it to one host in each datacenter, and from there replicate it to any machines in that datacenter in parallel. Have 800 boxes in four DCs? You "pay" the price of copying it to four places, and from there it would disseminate the file to the hosts in that DC for you to eliminate unnecessary trans-continental transfer times. At each level of your tree of machines, mailcom would pick a machine to be the parent, assign it a list of children, and instruct the parent to remote into each child and tell it "I have a script/file/command for you. Take it from me, execute it, and give me the results." It wasn't exactly pull, it wasn't exactly push. It was both. Simultaneously, like its creator: lunatic and genius.

mailcom would fail. Oh lordy would it ever fail. So much so that its "never-saw-the-light-of-day" replacement was ironcially dubbed "failcom". A staggering quantity of replacements for mailcom were proposed and it outlived them all. Every single one. This more than anything is a testament to the tenacity of a very sharp razor with an indisputable track record of cutting through every damned thing which is thrown at it.

I became an expert at crafting mailcom-safe scripts and inherited the duty of peer-reviewing junior admins' proposed changes to eliminate scripting habits that weren't "safe" and "restartable" from a mailcom perspective. If a mailcom task failed partway through execution, the indeterminate meta-state a host could be in was a horrendous botch that often needed to be investigated and repaired by hand. mailcom logs were recursive: each machine would report its logs and the sum of the logs of all its children. The final mailcom.log file for an update of several hundred machines would be about as verbose as a Michner novel. So you had to be smart about not accidently doing dumb things.

Things like non-atomically editing a file. mailcom's fickle nature meant that you could never just append data to a file like "echo 'unique new line' >> /etc/file.conf". If the deployment failed and you reran mailcom, you could potentially wind up with your unique new line in /etc/file.conf twice. A "good" mailcom script (so says me) would always make a backup of the original file, preferably with the change number in it for easy blameability ("/etc/file.conf.changenum_12345.backup"), then make a temp copy of the file to edit, edit it, set the temp file ownership and mode, and move it to its intended destination if and only if the destination's checksum was not the same as the value of the final intended version and the temp file's checksum matched a known good version. Otherwise, delete the temp file.

In other words, writing a mailcom-friendly script was an exercise in paranoia. Whatever could go wrong ultimately would, and I spent more than a few spare minutes educating the newer folks on the team to never, ever assume your mailcom change would "just work". Nothing ever "just works" at scale. So every step had to be punctuated with conditions: "Has this already been done? Can we do this in a way that doesn't leave a file in an unknown state? Should the service restart? How do I check that the service has restarted successfullly?" And so forth. With mailcom's schizophrenic approach towards parallel execution and a whimsical, devil may care attitude towards failure states, you had to be damn DAMNED sure that your script wasn't going to put anything anywhere it didn't belong, because mailcom was never going to swoop in and save you.

I loved it dearly.

I didn't love mailcom because it was so damned finicky, or because in order to wield it safely one had to be so damned pedantic, but because mailcom let me make relatively safe changes to a broad number of machines. With a little bit of caution, I could run one command:

mailcom -h "mail[1-100]-(dc1|dc2|dc3)" -s my_script.sh

and efficiently update 300 machines. My change requests were renowned (or perhaps feared) for their detail, literally "copy these lines exactly and run them (1) against one machine and monitor it for x hours. If successful, run against (2) one machine per DC and monitor for x hours. If successful then run against (3) ten machines per DC and monitor for x hours...." The only change to make was the regex of the target hosts. Hosts that were already updated could run the same script a hundred times and only apply the change once because of the "only if this then precisely that" design of the mailcom script. mailcom was oblivious to which machines it had touched previously, so a mailcom-friendly script needed to carefully check if its job was already completed on a machine and then leave well enough alone. In doing, at times, hundreds of machines at once, it was easy for communications to get muddled between engineers, even when working side by side, as to which boxes were already updated and which weren't. So a mailcom script that was full of careful conditional steps meant multiple people could run it over and over again throughout the day against a diverse set of machines and not break something.

In 2008, this was a little bit revolutionary. Now we have Ansible.

Next time: Key matters.

No comments: