2012-12-27

On SparkleShare

In my search for a better file synchronization solution I was not completely opposed to a cloud-based storage solution, but the ideal solution needs to support an arbitrarily large amount of data. Instead of a basic tiering model or a monthly cap, I want to be able to push as much data between my machines as I have space available on them. If there were an intermediary that can also have a copy of that data that's fine, but no one makes money handing out free disk space online. I would then still be at the mercy of any company who will see fit to have me and thus be completely vulnerable if they ever chose to discontinue their service, switch to a different billing model, or forcibly urge users onto a new service by burning down the old one.

Eventually in my fruitless search a friend pointed out SparkleShare, an open source project designed to build a transparent synchronization and collaboration engine built on top of an existing version control system and glued together with SSH. It runs on multiple platforms and, even in its infancy, is a pretty fully-featured application. In a short time I was able to install and configure one of my machines as a SparkleShare client and sync it to a Git backend running on an OpenBSD VM on the same host.

One benefit of the SparkleShare architecture is that the data store is not an active service. It's treated as an inert backend and your clients enact changes to it. There's no service to install and no additional software to run. Other than SSH for raw connectivity to the server, you don't need any more holes poked into your firewall. If you've got Git and SSH installed, you've got the tools to run your backend.

If you're really lazy, you don't even need to configure a dedicated user account to own the repository. If you're unbelievably lazy, you can use Github or another online repository host to hold the store for you and then you don't even need your own server at all. I prefer the "Everything under my own control" option. It's a little more work, but it all belongs to me.

With a repository set up and SSH configured, SparkleShare is really just a Git synchronizing service that runs continuously on your client. When you modify a file in one of your SparkleShare hosted projects, it gets checked into the local Git repository, pushed to the backend, and then a notification message is sent to the other clients informing them that they need to fetch and merge the change from the backend. By default, SparkleShare uses a notification endpoint that the admins have set up for you: notifications.sparkleshare.org:443. Despite running on port 443 it doesn't use HTTPS and is in fact not an HTTP-based service at all. It's a basic pubsub service that just keeps client connections open, tracks who's subscribed to which projects, and sends broadcast notifications to the proper subscribers when an update occurs. If you've ever used IRC, you get the idea. Did I mention it's unencrypted?

To their credit, the SparkleShare folks have made the notification service simple and relatively anonymous. You never transmit your data to the notification service. You never transmit your files, or any identifying information about you. It's so simple it's actually kind of clever. When you connect, you subscribe to your hosted project by giving a 160-bit hexadecimal identifier determined at project creation time. When a change is made, your client announces the subscription identifier of the project and the SHA-1 hash of the current head of the Git repository. If I were to randomly guess someone else's identifier, I'd know when a change was announced to that project, but not what the change was or even where the repo is kept. Other attacks are possible, but identifying a specific target when blind is hard. The best you can do is trick him into checking for a phony update when there isn't one, which is the online version of ringing someone's doorbell and then running away. In theory it's possible but in reality it just doesn't happen often enough to be a concern.

Using notifications.sparkleshare.org for your projects is great because it means you can get a working project syncing between two machines in about ten minutes, perhaps less if you use an online source control provider and eschew running your own SSH daemon and storage server. Still, I wasn't thrilled with relying on a third party notification service, no matter how anonymous the data sent and received may be. Given that I am beholden to Microsoft Corporation to keep running Live Mesh out of the kindness of their hearts, I'd also be beholden to notifications.sparkleshare.org to keep running in order to make my SparkleShare projects sync. Even if notifications went down I could still sync manually. Even better, SparkleShare looks to have a regular polling interval built into it. Without a quick notification method though, SparkleShare is lacking the seamless experience I get from Live Mesh: add a file, watch it pop up on the other machines. If the file is small, it usually distributes faster than the eye can detect. If I had to wait ten or fifteen minutes for any change to propagate, I wouldn't be so smitten with Live Mesh.

The SparkleShare notification service program is open source. Despite a couple typos it's really very easy to understand and easier still to rewrite in Perl. (The author wrote the program using epoll() for event handling, so it does not convert well to BSD. Given the number of subscriptions notifications.sparkleshare.org supports at any one time, I can't say I blame him.) Perl is not always the best solution for running an Internet service, but I enjoy writing it and it prototypes pretty well. Better still, I was able to make a SparkleShare notification service clone in less than a day by taking the chat server tutorial out of the POE Cookbook and changing fewer than 90 lines. And that includes going off spec and adding a couple features I felt the service could benefit from having.

It's also quite easy to write from scratch in C# in an afternoon. The advantage of doing it this way is that you don't even need Perl+POE to handle events. Since ActivePerl's Perl Package Manager doesn't include POE, this allows you to have a native (albeit managed) Windows-based notification service to run on your network.

SparkleShare has an odd feature that a global announcements URL trumps any folder-specific one. Meaning if you want to use two different URLs, you can't have any default URL at all. You'd have to have an <announcements_url> for every folder. It's more logical to me that your global announcements URL is the default, but each folder can override that default if it contains its own announcements URL in the folder's config. This is fixed easily enough by fetching the source, extracting msysgit from Google into the designated location in the source tree, editing CreateListener() in SparkleListenerFactory.cs and compiling your own binary, but is more of a nuisance than an actual problem with the software. If you can have more than one listener, you shouldn't make it difficult to do so.

If I want to put my own private notification service behind a firewall, I can. Proper authentication is a little trickier, but it's easy enough to use an SSH tunnel. Now not only do I have a way to sync from PC to PC, I don't have a middleman I need to worry about who could shut off the lights at any point between now and when the service becomes unprofitable. It's powered by SSH and uses a Git backend, two open source technologies that have wide adoption and support critical environments. I have added one basic (and optional) service to my network that can be locked down and sandboxed if I want to expose it to the web. Best of all, I won't have Microsoft telling me to upgrade it to an inferior replacement.

No comments: