2018-05-01

A "review" of The Carrier (1988)

Your touch is sweet and kind
But there's something on your mind
I can't see your eyes, I can't see your eyes....

My Blu-ray copy of The Carrier came in the mail today. I tell you this because now that I have my copy, you should get yours.

I like good movies, but I loooove bad ones. I thought The Carrier would be one of the latter and discovered it to be closer to the former. I dismissed it on its face as a conventional horror movie: low budget, no-name cast, shot over two weekends with a small town of extras who had a camera, a microphone on a stick, and a lot of "can do" enthusiasm.

Boy, was I wrong. So, so wrong. If you choose to watch The Carrier, you should not make the same mistake I did. It is not your run of the mill late-80s horror film, not by any stretch of the imagination.

The movie starts with a square dance scene and an off-brand Rebel Without a Cause/The Outsiders-lookin' young man crashing the party. A group of locals is casually talking about "the black thing" that has begun appearing on the edge of town and debating whether to fear it or doubt its existence. "Oh," I thought. "This is going to be a by-the-numbers horror movie with a not-so-subtle social message about bigotry in it." I lost interest. I let the movie play and went off into the other room to prepare dinner, occasionally poking my head in to see if the creature feature special effects would be goofy enough to make me chuckle.

Do not do this. Sit down. Watch the film. There's a lot going on in this movie that is easy to miss, even if you think you're paying close attention.

I first saw The Carrier last month, and I haven't stopped thinking about it since. I half-heartedly watched it out of the side of my eye at first and then became so engaged and so haunted by its imagery that I sat down and watched it again the next day. Then I kicked myself for missing so much of it the first time around.

Then I went out and bought it on Blu-ray.

I don't even own a Blu-ray player.

Upon first glance, I considered The Carrier to be a fun, low-budget horror flick, something MST3K-worthy in the same vein as The Bloodwaters of Dr. Z or Boggy Creek (I or II, take your pick), or even The Giant Spider Invasion. Especially The Giant Spider Invasion. All of these films are so-bad-they're-good in the "get drunk with some friends and make fun of the hillbillies" way.

I was wrong. Very, very wrong.

The Carrier is not a perfect film and it certainly contains flaws that plague every low-budget movie. If you give it a chance, you will bear witness to an intelligent story of alienation and fear. Part The Outsiders, part The Crucible, part The Purge, this film is the story of a young man estranged from his community who becomes, very literally, toxic to everyone around him. I dare not say more. Perhaps I've already said too much. Don't spoil this movie for yourself. Just push play and observe.

The film is a delight to watch. It was shot on that grainy sorta-expired film stock from the mid 1980s that makes it look like it could have been filmed in the 1970s. The acting is poor but heartfelt and earnest. The special effects are exemplary for the budget of the movie, and just when you start thinking this is going to be a fun, cheesy "guy in a rubber suit" movie, it veers you down a sharp left turn of paranoia and old-fashioned, home-grown, utterly volatile clan warfare. What?

The Hatfields and the McCoys were a spat over a Scrabble game compared to The Carrier. This movie gets bonkers. Full-on mob mentality, red scare, wrap-yourself-in-plastic-to-stay-pure, murder-your-neighbor-for-his-stuff, end-of-days mania. And it is engrossing and frightening and bizarre and wonderful. Every so often you find a gem like The Carrier, where a humble band of folks just try to make a little cinematic enterprise for funsies and end up with a secret masterpiece of dread and creeping, amorphous, "fear of the unknown" terror. The film underscores the horrifying notion that anything can kill you at any time and you can't know what or how, but you have to do something — anything — to protect yourself and your loved ones at any cost.

And in that mad rush to find your own tentative safety, what are you willing to sacrifice to reach it?

10 out of 10. Make sure your cats are safely upstairs, wrap yourself in a big plastic sheet, and watch The Carrier. This movie knocked my socks off and that was when I was only half paying attention to it. The full experience is shocking, bizarre, and will stay with you for days.

The Blu-ray contains two cuts of the movie, a director's cut and a theatrical cut. The theatrical cut contains a 2-second error at the half-way point that forces the video and audio to be unwatchably out of sync and I still bought it anyway because it's just that good. If you want to see a small town of hillbillies lose their god-damned minds over barn cats, it's on YouTube in its entirety.

2018-04-27

Ansible Week - Bonus - Automating OpenBSD Installs

Manually setting up your OpenBSD VMs is for chumps.

The recent release of OpenBSD 6.3 gave me an excuse to finally sit down and start teaching myself how to use the builtin OS autoinstall feature.

OpenBSD had supported installation templates for a few years now, but I was always mired in the artisinal mindset. I believed that setting up a new machine was a labor of love and in spite of the simplicity of the install wizard felt one needed to spend at least that long minute or two crafting the hard and fast rules by which the system will live forever.

ZFS sure would be nice to have on the platform, but no. There's no way in hell that's gonna happen.

You may want to take that time thinking about the disk layout of your next mail server or firewall or whatever, but when it comes to a VM image you want to run at scale in the cloud, there are advantages to finding ways to streamline the process after you've made those decisions the first time.

The central tool of autoinstallation of OpenBSD is the "install.conf" file, which contains answers to every question that the install wizard would normally ask you interactively.

An example install.conf would look like this:

System hostname = mymachine
Which network interface do you wish to configure = hvn0
IPv4 address for hvn0 = dhcp
IPv6 address for hvn0 = none
Which network interface do you wish to configure = done
Password for root = $2b$08$sjHcRpZW2Jg7ryPxeHEBNu7DsyA3Fg8FrDvqLSqkx7TFmbUST9z/C
Public ssh key for root account = none
Start sshd(8) by default = no
Do you expect to run the X Window System = no
Do you want the X Window System to be started by xenodm(1) = no
Change the default console to com0 = no
Setup a user = no
Allow root ssh login = no
What timezone are you in = UTC
Which disk is the root disk = sd0
Use (W)hole disk MBR, whole disk (G)PT, (O)penBSD area or (E)dit = Whole
Use (A)uto layout, (E)dit auto layout, or create (C)ustom layout = A
URL to autopartitioning template for disklabel = none
Location of sets = cd
Set name(s) = done
Directory does not contain SHA256.sig. Continue without verification = yes
Location of sets = done

This is enough to set up a machine in short order. You can customize it to your wishes, and there's even a disklabel template format you can provide in a separate file:

/		250M 
swap		80M-256M 10% 
/tmp		120M-4G	8%

This is really nice, because you can put this disklabel template online and set its URL in the "URL to autopartitioning template for disklabel" line of install.conf and get a very-close-to-hands-free OpenBSD install just using two config files on a trusted internal webserver and the default OpenBSD installXX.iso.

You can even embed the install.conf into custom install media to make it totally automated if you want.

So, in conclusion, OpenBSD autoinstallation features, plus an Ansible system setup playbook, and scriptable Azure utilities can combine to create a very nice cloud service platform. Reshape the world as you see fit.

2018-04-25

Ansible Week - Part 4

Ansible is powerful, so let's put together a real-world example of how to use it for installing some software.

We have recently been blessed with a new crypto library aimed at maintaining strong security that doesn't depend on the difficulty of factoring composite numbers.

It's so new, there isn't a package for it, and its installation steps are quirky, to say the least. It boils down to:

  1. Install some pre-requisite tools (OpenSSL, GMP, Python3, gcc)
  2. Create a new user, libpqcrypto
  3. Fetch the software as that user
  4. Create some symlinks
  5. Compile the library

Ansible was made to handle all of this, and to demonstrate the power of Ansible roles in our playbook, we're going to split the prerequisite installation steps out into their own separate pieces, or "roles". Roles are useful for when we want a set of commands we can use and reuse to set up a build environment, even if we end up not using that environment to build this exact libpqcrypto library again in the future.

By putting your work into roles, Ansible allows you to group your tasks into distinct phases and sort them based on your defined dependencies between them. In other words, if you have a role called "fasten-seatbelt", you can define other roles as unique dependencies for it, maybe ones called "sit-in-seat", "have-keys-in-hand", and "buy-a-car". Each of these roles are generalizable into other things, so if you ever write "ride-rollercoaster", you can reuse the tasks of "sit-in-seat", though perhaps without the car-buying dependency.

Maybe we should just build the role and show you.

First, we set up a target host. We've done this before, but for this exact example we're going to install our OS (a Debian-based Linux in this case), patch it to current, add our ansible user, and create our OpenSSH key:

sudo apt install -y openssh-server
sudo groupadd ansible
sudo useradd -g ansible -m ansible
ssh-keygten -t ed25519 -N '' -q -f ~/.ssh/id_ed25519
cd .ssh
cp -p ./id_ed25519.pub ./authorized_keys.new
chmod 0600 ./authorized_keys.new
mv ./authorized_keys.new ./authorized_keys
exit
sudo visudo
#add line:
# ansible ALL=(ALL:ALL) NOPASSWD: SETENV: ALL

Sync this key, ~ansible/.ssh/id_ed25519 to your ansible host and build your inventory file:

[libpqcrypto]
10.0.0.4

[libpqcrypto:vars]
ansible_become_method=sudo
ansible_ssh_user=ansible
ansible_ssh_port=22
ansible_ssh_private_key_file=/path/to/libpqcrypto/id_ed25519

We run ansible -i ./hosts.libpqcrypto -m ping libpqcrypto and our ping got ponged, so we can write our first role. It's a .YML file outlining which packages we want to install before we go about doing anything else with our machine:

---
- name: install compiler and libpqcrypto pre-reqs (apt)
  become: yes
  apt:
  args:
    name: "{{item}}"
    state: present
    cache_valid_time: 86400
    update_cache: yes
  with_items:
    - build-essential
    - gcc
    - libssl-dev
    - libgmp-dev
    - make
    - python3
  when: ansible_pkg_mgr == "apt"

This is just a normal Ansible playbook. We turn it into a role by putting it in an exact location on our Ansible machine: ./roles/libpqcrypto-prereqs/tasks/main.yml.

This creates a role called "libpqcrypto-prereqs" we can reference in our role that will fetch the libpqcrypto source, configure the host to create a user, make some symbolic links, and compile the code as per the instructions on the web site. Let's make another role to do these steps. If our previous role has run successfully, we know we have our compiler and dev libraries on the target host and can just do the other steps. So we make a role, "libpqcrypto-build", and put this into "./roles/libpqcrypto-build/tasks/main.yml":

---
- name: create group
  become: yes
  group:
  args:
    name: libpqcrypto
    state: present

- name: create user
  become: yes
  user:
  args:
    name: libpqcrypto
    createhome: yes
    group: libpqcrypto
    home: /home/libpqcrypto
    shell: /bin/false
    state: present

- name: fetch latest version string
  become: yes
  become_user: libpqcrypto
  get_url:
  args:
    url: https://libpqcrypto.org/libpqcrypto-latest-version.txt
    dest: /home/libpqcrypto/libpqcrypto-latest-version.txt
    validate_certs: false # ouch

- name: read latest version string
  shell: cat /home/libpqcrypto/libpqcrypto-latest-version.txt
  register: version

- name: stat libpqcrypto file
  stat:
  args:
    path: libpqcrypto-{{version.stdout}}.tar.gz
  register: st

- name: fetch libpqcrypto
  become: yes
  become_user: libpqcrypto
  get_url:
  args:
    url: https://libpqcrypto.org/libpqcrypto-{{version.stdout}}.tar.gz
    dest: /home/libpqcrypto/libpqcrypto-{{version.stdout}}.tar.gz
    validate_certs: false
  when: st.stat.exists == False

# never use unarchive
- name: untar libpqcrypto
  become: yes
  become_user: libpqcrypto
  shell: tar -xzf /home/libpqcrypto/libpqcrypto-{{version.stdout}}.tar.gz
  args:
    chdir: /home/libpqcrypto/
    creates: /home/libpqcrypto/libpqcrypto-{{version.stdout}}/

- name: create symlinks
  become: yes
  become_user: libpqcrypto
  file:
  args:
    src: /home/libpqcrypto
    dest: /home/libpqcrypto/libpqcrypto-{{version.stdout}}/{{item}}
    owner: libpqcrypto
    group: libpqcrypto
    force: yes
    state: link
  with_items:
    - link-build
    - link-install

- name: remove clang compiler option
  become: yes
  become_user: libpqcrypto
  lineinfile:
  args:
    path: /home/libpqcrypto/libpqcrypto-{{version.stdout}}/compilers/c
    regexp: "^clang.*"
    state: absent
  
- name: timestamp
  shell: date
  register: timestamp

- name: start compile libpqcrypto
  debug:
  args:
    msg: "{{timestamp.stdout}}"

- name: compile libpqcrypto
  become: yes
  become_user: libpqcrypto
  shell: ./do
  args:
    chdir: /home/libpqcrypto/libpqcrypto-{{version.stdout}}

- name: timestamp
  shell: date
  register: timestamp

- name: end compile libpqcrypto
  debug:
  args:
    msg: "{{timestamp.stdout}}"

There's a lot going on here, but you can pretty much tease out what each of these steps is doing to your target host. Many Ansible tasks have an argument called "state" that can be either "present" or "absent". The task doesn't necessarily perform the work if it's already been done, so what we're really setting up is a "configuration outlining the desired state of the system" or a "desired state configuration" for short. This is a term I just now invented all by myself. You're welcome.

We ensure there's a group called "libpqcrypto" and a user in that group with the same name. We fetch the libpqcrypto version string and, optionally, fetch that particular version of the software package if that tarball doesn't exist on the target host. We check the existence of that file with the "stat" module and use a "when:" conditional to tell Ansible to run that task only if it needs to satisfy the condition.

Then we create some symlinks with the "file" module, and then we go off script for a second to make a one-line change to the C compilers setting to remove the clang line. This can be skipped if the host has clang installed. We could tailor this task with a "when:" conditional, either checking for "/usr/bin/clang" on the machine, or by comparing the task against what Ansible determines the machine's OS to be. You can pull the list of values that Ansible checks by running the "setup" module: "ansible -i ./hosts.file -m setup hosts-group-name".

We print the date by creating a "register" called "timestamp" populated with the output of the date command. We do this again when the compile task runs and that gives us a notice of how for long the task in between ran.

Finally, (except for that last timestamp task) we compile the libpqcrypto software by running ./do (with the "shell" module), as the libpqcrypto user (with become_user), in a specific directory (identified with the "chdir" argument).

Great! But how do we actually run this role? We need to point out that this role has a dependency on installing the pre-requisites, so we list them in "./roles/libpqcrypto-build/meta/main.yml":

---
dependencies:
  - { role: libpqcrypto-prereqs }

Then we put put our "libpqcrypto-build" role, complete with its listed pre-req role(s), into a new playbook file, libpqcrypto.yml:

---
- hosts: libpqcrypto
  roles:
    - { role: libpqcrypto-build }
  tasks:
    - group_by:
      args:
        key: "{{ansible_distribution}}"

Note that we never call the libpqcrypto-prereqs role directly. We call one role in the "roles:" section and with its dependency file in .../meta/main.yml Ansible figures out what to do and in which order to do it.

Naturally you can make this a fairly complicated web of dependencies: "car" requires "tires", "tires" requires "hubcap", and so on as I explained earlier. I haven't seen Ansible have a problem with sorting a dependency chain so long as it can all eventually be collapsed into a linear sequence per host.

Note also that we execute our tasks by groups of their distributions. I started doing this in order to avoid having to create specific conditionals for various target hosts:

- hosts: webservers
  roles:
     - { role: debian_stock_config, when: ansible_os_family == 'Debian' }

We can set up different roles on different machines based on how we know they'll need to execute our playbook tasks. If all your machines are homogenous, you can skip grouping your tasks, since you won't have variants. "group_by" is much more powerful than this, since you can use it to create groups on the fly by adjusting the value of "key".

Ansible Week - Part 3

We have a target machine and we've put its access credentials into our inventory. We can ping it with Ansible, but it's time for us to do something. You can define these steps into a "playbook", which is just a specially-formatted file that outlines your chosen actions and which machines upon which you want to act. This file is in YAML, which Isn't Terrible so you Shouldn't Be Afraid of It:

$ cat ./playbook.yml
---
- hosts: mymachines
  tasks:
  - name: Create a new file
    shell: echo "Hello world" >> {{ansible_ssh_user}}/hw.txt
    args:
      creates: "{{ansible_ssh_user}}/hw.txt"

I'm fairly certain no one actually understands the technical syntax of YAML. Everyone simply takes an existing valid YAML file and edits it as they desire into a new YAML file. When they need another change, they edit the previous one and make a new YAML file, and so on. The YAML specification, then, is just a matter for people writing YAML parsers. YAML writers can just steal old .YML files and go on with their lives, which is nice.

This is a simple playbook that just creates a new file on the target host. It defines your target hosts ("mymachines") and defines a task to run on it. Tasks have a name and a type, in this case the type is "shell" and the shell type takes one argument called "creates", which tells ansible that this task has a defined end result: it creates a specific file. Therefore, if this file already exists, Ansible will skip performing this task.

Now we run it:

ansible-playbook --inventory=./hosts.my ./playbook.yml

Note that the actual action of this task is echo "Hello world" and it is appended to a file. Without the "creates" line in this task, running this playbook multiple times will continue to add "Hello world" to that file over and over again. These are the kinds of tricks and gotchas that will occupy your mind as you create bigger and more complicated playbooks and which I began learning and correcting in my own sysadminning back in the days of mailcom scripting.

Note, too, that we are using a variable in the playbook called "ansible_ssh_user", which we defined in our inventory file. This means that we can reuse this playbook against a number of different inventories, across any number of environments, anywhere we want to have this new "hw.txt", without having to customize a playbook for every environment. Ansible strives to be modular, reusable, and composable, which are fancy words that really just mean "easy to mix and match pieces so I don't end up reinventing the wheel all the time".

This modularity really starts to take effect when you stop putting your actions into playbooks and evolve into putting them into roles. What's a role? It's tasks on steroids.

Next time: Role playing.

2018-04-22

Ansible Week - Part 2

We have previously added a service account with a new SSH key to a target host. With that key, we can start using our Ansible setup on it to make administrative changes.

Remember that Ansible is agentless. The service account and SSH key you created are the only elements needed to authenticate your changes against your target host from a central command machine. Because this machine literally holds the keys to your proverbial kingdom, you want to take extra-special precaution that it doesn't fall into the wrong hands. I would run my Ansible node in a dedicated VM, and I'd use full disk encryption on it so its data is protected when not in use.

First thing to do when you're setting up your Ansible config is test that you can remote into your machines. Start with a simple hosts file to describe your environment. Ansible calls this an "inventory":

cat ./hosts.my
[mymachines]
10.0.0.5

[mymachines:vars]
ansible_become_method=doas
ansible_ssh_user=ansible
ansible_ssh_port=22
ansible_ssh_private_key_file=/home/master/ansible/keys/id_ed25519.mymachines

You can see that there is a "[mymachines]" section with an IP address listed, and a "[mymachines:vars]" section with some values defined for it. The IP address is my first target host, and the vars are how I'd set up ansible: using the "ansible" user, connecting via SSH to port 22/tcp, and using a specific SSH key called "id_ed25519.mymachines". This "id_ed25519.mymachines" key was created on the target host in the last post; its corresponding public key will be on the target machine in "~ansible/.ssh/authorized_keys".

You can tell that this box is an OpenBSD machine because the "become" method is "doas", which is a system-specific replacement for "sudo". When the "ansible_ssh_user" account needs root permissions, it supports a number of builtin privilege escalation options, including "doas" and "sudo". It considers this the designated "become method". In other words, this variable answers the question "Which method do I use to become root?"

Now we test:

ansible --inventory=./hosts.my --module-name=ping mymachines

Run ansible, with the inventory file that defines your own target host and which credentials to use to remote to it. We are using the "ping" module, and we are defining the "mymachines" section of the inventory. You can infer from this that an inventory file can be enormous and you can have multiple sections within one file that you can then reference in specific lines as needed, like so:

$ cat ./hosts.full-inventory
[mail]
10.0.0.9
10.0.0.10
10.0.0.11

[dns]
ns1.mydns.host
ns2.mydns.host

[web]
webserver.cloudapp.net

If we just wanted to do something with our [web] machine among the full inventory, we could single it out:

$ ansible --inventory=./hosts.full-inventory -m ping web

Or just the [dns] hosts:

$ ansible --inventory=./hosts.full-inventory -m ping dns

The module we're using, "ping", is the Ansible version of a network test. Can you remote into your targets? Ansible-ping them. The results should be simple:

ansible --inventory=./hosts.my --module-name=ping mymachines
10.0.0.5 | SUCCESS => {
  "changed": false,
  "ping": "pong"
}

Ansible reports back (1) the host it reached, (2) success or failure on the operation, and (3) the result, which was no changes made to the target machine and the response to "ping" was "pong". You're all set to manage this machine with Ansible now.

But just pinging machines is boring. You want to add software, add and remove users, and make configuration changes. Ansible is good for setting up new machines and pushing changes to existing boxes. And the art of crafting a new Ansible configuration exists in how well you write a "playbook", which is a sequence of instructions you define, which are run against one or more hosts in your inventory.

Are you getting a sense of how powerful Ansible can be now?

Next time: Yet Another Yet Another Markup Language post.

Ansible Week - Part 1

Let's get the bad news out of the way right now: Ansible is built on Python.

I know, I know. We'll get through it, guys... somehow.

I found that a number of other things I like have Python as a pre-requisite, including but not limited to the Microsoft-blessed WALinuxAgent you need on non-OpenBSD/LibreSSL Azure VM images. I have also found that I am, like, the only guy on the planet who hates Python with a bright, firey passion. So adopting Ansible is probably going to be easier for you than it was for me.

Back to mailcom for a sec. mailcom was very much a sysadmin's tool, written by a sysadmin for sysadmins. It was sharp, unbalanced, and unapologetic. You needed to learn how to "read" mailcom's logs to see if your plan worked as expected and there was zero chance it would hold your hand or ask you for clarification before obliterating your service like it was so much your hopes and dreams. There was a skill in "seeing" the mailcom activity in your mailcom logs.

When I started looking at kinder, gentler mailcom tools, I started with Puppet. Puppet does really interesting things, but Puppet's design requires an agent be present on the host. As I recall from years back when I last worked with Puppet, you needed to install Puppet, run it as root, and have it phone home regularly. From a security perspective, this could be A Bad Idea if you don't do it right and, as a neophyte Puppeteer, that was very likely.

Ansible's one big advantage, in my opinion, is that it is totally agentless. You still need a superuser account with sudo or doas permissions, but the remote access is managed entirely through SSH. (Ansible apparently also supports Windows via WinRM. Whatever.)

So maintaining your Ansible army is an exercise in SSH key management. Modern security experts wring their hands panicking over what they call "lateral movement", where an attacker on your network who has compromised one machine can access more machines with the same set of credentials. This is why people tell you not to use the same password on multiple accounts, but when it comes to your SSH keys, it's unbelievably easy to create one key on your base image and that's the key you use on every host cloned from that image.

In general, you want to avoid doing as much as root as you can. So to prep your target machine(s), you would want to create the following:

groupadd ansible
useradd -g ansible -m ansible

You may need to add your new local service account to the root or wheel group on your host. Add it to your /etc/sudoers file or, on OpenBSD, your doas.conf:

cp -p /etc/doas.conf /etc/doas.conf.bak
cp -p /etc/doas.conf /etc/doas.conf.new
echo permit nopass ansible as root >> /etc/doas.conf.new
mv /etc/doas.conf.new /etc/doas.conf

Once you have your service account, configure an SSH key to use to authenticate into that host.

# su -l ansible
$ ssh-keygen -t ed25519 -N '' -q -f ~/.ssh/id_ed25519
$ cd ~/.ssh
$ cat ./id_ed25519 >> ./authorized_keys
$ exit

You may need to fix permissions on your authorized keys file (chmod 0600) if it is not correct. Ensure that your sshd service is running and get ready to hook up this new SSH key to your ansible machine.

Next time: Verifying your distance to the target. One ping only.

2018-04-14

Ansible Week - Prologue - Binary Exponential File Copying

Sometimes when I'm bored or trying to procrastinate I get problems in my head that I try to work out to no real advantage. The other night I began wondering if there was a good way to copy one file to an arbitrary number of hosts efficiently.

I'm not really sure what this is called, and naïve web searches for efficient file copying produced the expected pages that define rsync versus robocopy or running tar over netcat and such. I'm not interested in distributed file sharing. It would be easy enough to just set up your own BitTorrent swarm and have each host share pieces of the file amongst each other. Problem solved!

I dismissed this idea because it contains a lot of overhead and requires some kind of a tracker to help hosts meet each other. (Yes, I know about DHT.) BitTorrent also has problems with updating content. Changing a file requires a change to the torrent, so a new swarm would have to spin up every time a file gets revised. I vaguely recall a project that ibiblio.org was working on to use BitTorrent as an archival mechanism, but I can no longer find it online.

I've looked at Murder, the Twitter-developed file synchronization utility, which seems to work out OK for them, but I don't want to have to use Capistrano, and it's still BitTorrent at the core. It still needs the overhead of a tracker. And Capistrano. No thanks.

If I wanted to avoid a lot of overhead traffic and already had a list of hosts in mind, how would I go about finding a way for each host to copy to another host in some kind of pattern that doesn't leave gaps or overlaps.

I had a similar problem a few months back where I had a large, properly-formatted data file that I needed to distribute to a number of machines, so I just copied the file from the source machine to another machine. Then I copied it to a third machine, and from the second machine started copying it to a fourth. Once a box had a copy of the file, I set about copying it to a machine that didn't have it yet. This was faster, albeit more complex to keep in mind, than just copying from the source box to the second machine, then to the third, then to the fourth, and so on in sequence.

The first host has the source copy of the file (in BitTorrent parlance, this is the "seeder") and the remaining hosts need to retrieve it. First host copies to second. When that is completed, first host immediately starts copying to third host. In parallel, second host copies to fourth. Each machine other than the seeder needs to wait until it has a copy of the file, and then it needs a list of all the machines with which it needs to share the file. Another way to put it:

host0: copies to host1, host2, host4, host8

host1: waits, copies to host3, host5, host9

host2: waits, waits, copies to host6, host10

host3: waits, waits, copies to host 7, host11

And so on.

In each "round" of file copying, the number of hosts that can copy a file doubles: the one seeder with the source file, then itself and one other machine, then four machines, and so on.

What I've figured out is that as each round of file copying progresses, a host will necessarily wait until the start of round 2x, where x is 1+floor(log2(h)) and h is the host's number in the sequence. So host1 will get its copy of the file in the first round. host5 won't get a copy of the file until the third round. host16 would get a copy in the fifth round.

If a machine lacks the file, it waits. Once it has the file it shares it, but only with the set of machines for which it's considered "responsible". For any host h, that's any higher-numbered host whose sequence number is 2x+h for all x>=0.

So host0 will copy to hosts 20+0, 21+0, and 22+0, which are host1, host2, and host4.

host1 will copy to hosts 21+1, and 22+1, which are host3 and host5.

host2 will copy to hosts 22+2, and 23+2, which are host6 and host10.

This fits a simple pattern where every machine that has a copy of the file sets about sharing it until all hosts have received it. The source machine performs the most copy operations, and this algorithm doesn't account for real-world problems like failover or weighting a host based on its performance. A big fast machine with a good network card should do more copying than a flaky old box on, say, a bad wireless network, so if you implement this like I intend to do, make sure you order your hosts appropriately. Depending on your network conditions, you can run an individual hosts' list of copy operations in parallel, "host1: copy to host3 and host5 simultaneously, then host9 and host17 simultaneously....", without worrying about two machines trying to copy to the same host. This also makes it easy to identify failures: a gap between a number of hosts missing their new file will let you easily compute which machine dropped the ball.

This method of file distribution does not seem particularly complicated, new, or innovative, so I'd be surprised if there isnt already an implementation of it out there somewhere. I just don't know what it is or how to find it. Ansible has an OS-agnostic utility to handle this approach with its delegation keyword, in which case I'd assume that the playbook would need to be dynamically defined based on the number and on the sequence of hosts. Ansible supports copying, synchronizing, parallel execution, and throttling, but I don't know if it's quote-unquote "optimized" to distribute files efficiently in a manner similar to this.

It's certainly possible that this method of file synchronization has been obsoleted by distributed file copying methods like BitTorrent, but I feel this approach has a certain advantage: in distributing a file in rounds, you end up getting the complete file to at least one machine right away, as opposed to a large number of hosts having most of, but not all of, a file until the swarm is seeded. Sometimes you just need "close enough" file copying, where an update needs to go live ASAP, and if some percentage of your machines have it sooner your service is in a better position than if all of your machines "almost" have it for a much longer length of time. We're not going to solve this problem here, because this is really just a long-winded way of saying that the focus of this blog this week will be an overview of the Ansible management utility.

Next time: My favorite, crazy-complicated multi-datacenter service management tool of yore, or, "bananas for scale".

Ansible Week - Part 0

Reading up on Ansible delegation and its claims to support multi-tiered administrative actions ("I'm in charge, and I'm telling you five machines to tell five more machines to run this script....") reminded me of my first exposure to scalable remote server management software. Before Ansible. Before Puppet or Chef or Salt or DSC or Rex (or Docker or Kubernetes or whatever the new buzzword is this month). Back when men were men and so were some of the women. It was a bizarre time for everyone.

Years ago I worked on a hosted e-mail service, something along the lines of your run-of-the-mill Gmail or Hotmail/Outlook.com online offerings. We had a number of datacenters, hundreds of mail hosts, and at times a staggering number of deployment changes to stay on top of in any given week.

The team had a number of tools to accomplish this. Some new, some old, some sane, some not. The oldest and least sane was a bespoke shell/Perl fusion script that would recursively execute arbitary commands on a regex-designated list of hosts and report the results back in a multi-tiered push/pull methodology that would make everyone's eyes cross the first time I'd explain to them how it worked. It was a maddeningly difficult thing to use and was ridiculously easy to shoot yourself in the foot with it if you weren't careful.

I loved it dearly.

The program was called mailcom and I still don't exactly know why. It was used on more than just our mail hosts and was far more powerful than just communicating to any mere SMTP server. Whoever the Ghostbusters-like Ivo Shandor was that architected this thing was a delicious combination of lunatic and genius. He or she may even have literally worshipped Gozer and I wouldn't be surprised if they did. mailcom, in its infinite wisdom, was datacenter-aware. Rather than copy data from your jumpbox/bastion host/admin node to dozens of machines across the globe, mailcom would take your payload and send it to one host in each datacenter, and from there replicate it to any machines in that datacenter in parallel. Have 800 boxes in four DCs? You "pay" the price of copying it to four places, and from there it would disseminate the file to the hosts in that DC for you to eliminate unnecessary trans-continental transfer times. At each level of your tree of machines, mailcom would pick a machine to be the parent, assign it a list of children, and instruct the parent to remote into each child and tell it "I have a script/file/command for you. Take it from me, execute it, and give me the results." It wasn't exactly pull, it wasn't exactly push. It was both. Simultaneously, like its creator: lunatic and genius.

mailcom would fail. Oh lordy would it ever fail. So much so that its "never-saw-the-light-of-day" replacement was ironcially dubbed "failcom". A staggering quantity of replacements for mailcom were proposed and it outlived them all. Every single one. This more than anything is a testament to the tenacity of a very sharp razor with an indisputable track record of cutting through every damned thing which is thrown at it.

I became an expert at crafting mailcom-safe scripts and inherited the duty of peer-reviewing junior admins' proposed changes to eliminate scripting habits that weren't "safe" and "restartable" from a mailcom perspective. If a mailcom task failed partway through execution, the indeterminate meta-state a host could be in was a horrendous botch that often needed to be investigated and repaired by hand. mailcom logs were recursive: each machine would report its logs and the sum of the logs of all its children. The final mailcom.log file for an update of several hundred machines would be about as verbose as a Michner novel. So you had to be smart about not accidently doing dumb things.

Things like non-atomically editing a file. mailcom's fickle nature meant that you could never just append data to a file like "echo 'unique new line' >> /etc/file.conf". If the deployment failed and you reran mailcom, you could potentially wind up with your unique new line in /etc/file.conf twice. A "good" mailcom script (so says me) would always make a backup of the original file, preferably with the change number in it for easy blameability ("/etc/file.conf.changenum_12345.backup"), then make a temp copy of the file to edit, edit it, set the temp file ownership and mode, and move it to its intended destination if and only if the destination's checksum was not the same as the value of the final intended version and the temp file's checksum matched a known good version. Otherwise, delete the temp file.

In other words, writing a mailcom-friendly script was an exercise in paranoia. Whatever could go wrong ultimately would, and I spent more than a few spare minutes educating the newer folks on the team to never, ever assume your mailcom change would "just work". Nothing ever "just works" at scale. So every step had to be punctuated with conditions: "Has this already been done? Can we do this in a way that doesn't leave a file in an unknown state? Should the service restart? How do I check that the service has restarted successfullly?" And so forth. With mailcom's schizophrenic approach towards parallel execution and a whimsical, devil may care attitude towards failure states, you had to be damn DAMNED sure that your script wasn't going to put anything anywhere it didn't belong, because mailcom was never going to swoop in and save you.

I loved it dearly.

I didn't love mailcom because it was so damned finicky, or because in order to wield it safely one had to be so damned pedantic, but because mailcom let me make relatively safe changes to a broad number of machines. With a little bit of caution, I could run one command:

mailcom -h "mail[1-100]-(dc1|dc2|dc3)" -s my_script.sh

and efficiently update 300 machines. My change requests were renowned (or perhaps feared) for their detail, literally "copy these lines exactly and run them (1) against one machine and monitor it for x hours. If successful, run against (2) one machine per DC and monitor for x hours. If successful then run against (3) ten machines per DC and monitor for x hours...." The only change to make was the regex of the target hosts. Hosts that were already updated could run the same script a hundred times and only apply the change once because of the "only if this then precisely that" design of the mailcom script. mailcom was oblivious to which machines it had touched previously, so a mailcom-friendly script needed to carefully check if its job was already completed on a machine and then leave well enough alone. In doing, at times, hundreds of machines at once, it was easy for communications to get muddled between engineers, even when working side by side, as to which boxes were already updated and which weren't. So a mailcom script that was full of careful conditional steps meant multiple people could run it over and over again throughout the day against a diverse set of machines and not break something.

In 2008, this was a little bit revolutionary. Now we have Ansible.

Next time: Key matters.