2018-04-14

Ansible Week - Prologue - Binary Exponential File Copying

Sometimes when I'm bored or trying to procrastinate I get problems in my head that I try to work out to no real advantage. The other night I began wondering if there was a good way to copy one file to an arbitrary number of hosts efficiently.

I'm not really sure what this is called, and naïve web searches for efficient file copying produced the expected pages that define rsync versus robocopy or running tar over netcat and such. I'm not interested in distributed file sharing. It would be easy enough to just set up your own BitTorrent swarm and have each host share pieces of the file amongst each other. Problem solved!

I dismissed this idea because it contains a lot of overhead and requires some kind of a tracker to help hosts meet each other. (Yes, I know about DHT.) BitTorrent also has problems with updating content. Changing a file requires a change to the torrent, so a new swarm would have to spin up every time a file gets revised. I vaguely recall a project that ibiblio.org was working on to use BitTorrent as an archival mechanism, but I can no longer find it online.

I've looked at Murder, the Twitter-developed file synchronization utility, which seems to work out OK for them, but I don't want to have to use Capistrano, and it's still BitTorrent at the core. It still needs the overhead of a tracker. And Capistrano. No thanks.

If I wanted to avoid a lot of overhead traffic and already had a list of hosts in mind, how would I go about finding a way for each host to copy to another host in some kind of pattern that doesn't leave gaps or overlaps.

I had a similar problem a few months back where I had a large, properly-formatted data file that I needed to distribute to a number of machines, so I just copied the file from the source machine to another machine. Then I copied it to a third machine, and from the second machine started copying it to a fourth. Once a box had a copy of the file, I set about copying it to a machine that didn't have it yet. This was faster, albeit more complex to keep in mind, than just copying from the source box to the second machine, then to the third, then to the fourth, and so on in sequence.

The first host has the source copy of the file (in BitTorrent parlance, this is the "seeder") and the remaining hosts need to retrieve it. First host copies to second. When that is completed, first host immediately starts copying to third host. In parallel, second host copies to fourth. Each machine other than the seeder needs to wait until it has a copy of the file, and then it needs a list of all the machines with which it needs to share the file. Another way to put it:

host0: copies to host1, host2, host4, host8

host1: waits, copies to host3, host5, host9

host2: waits, waits, copies to host6, host10

host3: waits, waits, copies to host 7, host11

And so on.

In each "round" of file copying, the number of hosts that can copy a file doubles: the one seeder with the source file, then itself and one other machine, then four machines, and so on.

What I've figured out is that as each round of file copying progresses, a host will necessarily wait until the start of round 2x, where x is 1+floor(log2(h)) and h is the host's number in the sequence. So host1 will get its copy of the file in the first round. host5 won't get a copy of the file until the third round. host16 would get a copy in the fifth round.

If a machine lacks the file, it waits. Once it has the file it shares it, but only with the set of machines for which it's considered "responsible". For any host h, that's any higher-numbered host whose sequence number is 2x+h for all x>=0.

So host0 will copy to hosts 20+0, 21+0, and 22+0, which are host1, host2, and host4.

host1 will copy to hosts 21+1, and 22+1, which are host3 and host5.

host2 will copy to hosts 22+2, and 23+2, which are host6 and host10.

This fits a simple pattern where every machine that has a copy of the file sets about sharing it until all hosts have received it. The source machine performs the most copy operations, and this algorithm doesn't account for real-world problems like failover or weighting a host based on its performance. A big fast machine with a good network card should do more copying than a flaky old box on, say, a bad wireless network, so if you implement this like I intend to do, make sure you order your hosts appropriately. Depending on your network conditions, you can run an individual hosts' list of copy operations in parallel, "host1: copy to host3 and host5 simultaneously, then host9 and host17 simultaneously....", without worrying about two machines trying to copy to the same host. This also makes it easy to identify failures: a gap between a number of hosts missing their new file will let you easily compute which machine dropped the ball.

This method of file distribution does not seem particularly complicated, new, or innovative, so I'd be surprised if there isnt already an implementation of it out there somewhere. I just don't know what it is or how to find it. Ansible has an OS-agnostic utility to handle this approach with its delegation keyword, in which case I'd assume that the playbook would need to be dynamically defined based on the number and on the sequence of hosts. Ansible supports copying, synchronizing, parallel execution, and throttling, but I don't know if it's quote-unquote "optimized" to distribute files efficiently in a manner similar to this.

It's certainly possible that this method of file synchronization has been obsoleted by distributed file copying methods like BitTorrent, but I feel this approach has a certain advantage: in distributing a file in rounds, you end up getting the complete file to at least one machine right away, as opposed to a large number of hosts having most of, but not all of, a file until the swarm is seeded. Sometimes you just need "close enough" file copying, where an update needs to go live ASAP, and if some percentage of your machines have it sooner your service is in a better position than if all of your machines "almost" have it for a much longer length of time. We're not going to solve this problem here, because this is really just a long-winded way of saying that the focus of this blog this week will be an overview of the Ansible management utility.

Next time: My favorite, crazy-complicated multi-datacenter service management tool of yore, or, "bananas for scale".

No comments: