2012-12-29

Caveats of Experimenting with SparkleShare

Everything of value I have, I keep in a couple of places. I learned to do this from an episode of The X-Files, wherein a Native American witness is being pursued by the shadowy government agencies that secretly rule the world. To extricate himself he uses the oral tradition of his people: he has told his secret to twenty men. If he "disappears", any one of them can still reveal the information to the world. Unable to find out which twenty people also know the secret, they let him go. Even if one of my data stores is lost, damaged, or destroyed, there are geographically separated replicas upon which I can still rely.

That being said, there are certain gotchas to watch out for when using SparkleShare to sync your data. SparkleShare is a lot of things, but perfect is not one of them. It's billed as a collaboration utility: I have documents, I want us both to be able to change them. You give me your SSH public key, I give you the URL of the store. Done. While SparkleShare is great for that, you have to beware that neither Git nor SSH are interested in the particulars of your choice of file system. You will still have to be aware of making files with case-insensitive names, since "MyFile.txt" and "myfile.txt" are going to create a conflict.

SparkleShare is kinda lame when it comes to handling conflicts. But then again, every synchronization tool does it differently and none of them can yet read minds.

Another problem I have with SparkleShare is the lack of preservation of modification times. Git stores content, not files. So all of the content metadata that is so useful to have with a file, like when it was last changed, are irrelevant to Git. There are some hacks that exist to keep this metadata intact through Git, but there's a lot of it from which to choose: do you care about file permissions? How about user and group ownership? Last access time? Creation time? What about arbitrary file system parameters that you defined in your neato BeOS BFS setup? Porting file systems' metadata is tricky.

You can easily enough write hooks for your SparkleShare Git repo. Pre-commit and post-receive hooks exist in the Git plumbing. You can draft a simple utility to walk your directory tree and store stat information about each file inside a file of your choosing and force those stat values for those files when you extract them from the repository. Git stores the data for you, but it's not designed to care about preserving the metadata around it. If Git doesn't care, SparkleShare has nothing it can use to figure it out for you. That's a bummer because I like sorting some of my directories by Last Modified Date to always see the newest files at the top.

SparkleShare also has to rename any Git repositories you store in a folder synced with SparkleShare. This makes SparkleShare suboptimal for use in sharing Git repositories, which is a kind of meta. Due to its need to rename any file it finds called "HEAD" in a .git directory to "HEAD.backup" — the ones it can find, anyway — it appears you can't use a Git repo directly from a SparkleShare directory. Turns out it's not turtles all the way down.

It's also rubbish on enormous repositories. I blame Git for this one, because Git really isn't meant to be used to track every byte inside 50,000 files spanning 25 GB or so. It can, it just takes a while and sometimes segfaults if you don't have the compute resources. I'm still wondering how I'm going to have it handle my media without involving Amazon S3.

Last night I took a look at the code to figure out from where the identifier for a new hosted project is coming. It's a SHA-1 string for sure, but a SHA-1 string of what, exactly, I couldn't guess. It turns out it's the SHA-1 digest of a randomly-generated filename using .NET/Mono to call System.IO.Path::GetRandomFileName(). This means that there's a 1 in 131,621,703,842,267,136 chance of a collision. While this seems a rather huge pool from which to draw project names, it fits comfortably inside 257, ignoring the vast majority of the potential namespaces available inside 160 bits. It's about 9x10-30 percent of the total amount of possible namespaces that SHA-1 can support. In other words, imagine all the grains of sand on all the beaches and in all of the deserts in all of the planet Earth. Now fill a thimble full of sand from one of them and go, "this ought to do it". This is fixable of course, since a subscription is really just an arbitrary string that cannot contain the characters '!', ' ', or '\n'. Better still, collisions can only occur per notification service instance, so if you're using a private SparkleShare notification instance like I am, you're practically free and clear. At least until you get close to your 132nd quadrillionth SparkleShare project.

So SparkleShare isn't a perfect file synchronization and distribution utility. It's still quite good though, all things considered. It's relatively fast, easy enough for my feeble mind to understand it, and built on top of established, industry-standard technologies.

No comments: