things are as bad as they seem: Harder Than It Has to Be

OK. I have lamented, at length, about the eternal fault of all languages anywhere that think a "200 OK" HTTP server response deserves data but a "304 Not Modified" deserves a thrown exception. To hell with that, I say.

But this is neither here nor there. Language designers who do this should be shot. In the face. Until they are dead. At the very least, one should be able to work around this problem, and it just so happens that I had a Python script that I didn't feel like bloating up to six times its original size just because I want to be a good netizen and not suck a bunch of bandwidth every time I run it refetching the same data.

So instead, I bloated it up to a mere three times its original size working around Python's retarded 304-exception behavior. This isn't particularly hard to do, but every program has some kind of catch, and this one's catch is differentiating between when to use UTC and when to use local timestamps. (Python currently has god-awful date- and time-handling semantics and the documentation is no fuckin' help whatsoever.) Someday, I think I'll write a quick UTC-to-local and local-to-UTC wrapper. I really don't give a damn right now, because I'm tired and it took me way longer than it should have to get correct 304-smart behavior by avoiding it entirely.

You'll only get a 304 if you use the If-Modified-Since header. This isn't entirely true, but let's start simply. If you do a normal GET you can get the Last-Modified info, and decide from that if you want to actually retrieve the data. Python uses the urllib2 library to create requests and open URLs. You can parse the response's headers and compare the Last-Modified data against your local copy's mtime. (Interestingly, Python offers a full-on stat() and a much more obvious os.path.getmtime().) All of this is a lot more work than should be necessary.

Should be:

Check for a local copy of file.txt.
If it exists, get the mtime.
Generate your URL: <http://website/file.txt>.
Do a conditional HTTP GET with the If-Modified-Since header set to the mtime of file.txt.
If 200, fetch.
Else bummer. Spit out an error and move on.

Easy, no? Instead, Python makes us jump through these hoops:

Check for a local copy of file.txt.
If it exists, get the mtime.
Generate your URL: <http://website/file.txt>.
GET the headers for the URL.
Parse the Last-Modified header information
Do some timezone manipulation to juggle between EDT and GMT
Python doesn't have a uniform timing data structure: some functions create a struct_time type, others use a floating-point value for seconds-since-the-UNIX-Epoch. Have fun fumbling through syntax errors.
The reverse of the operation performed by the time library's gmtime() function is, inexplicably, in the calendar library. The calendar documentation refers to it as "unrelated but handy". Great job, guys. If you ever design a car, the steering wheel will be in the trunk, I'm sure.
Where was I? Oh, yeah.
Compare file.txt's mtime to the Last-Modified date.
If the Last-Modified date is more recent than the mtime, fetch the file with read(). This is, technically, the same GET as the first.
Error-checking? Surely you jest. I don't even bother closing the TCP stream. I'm unsure if Python does it for me. I mean, why would this language suddenly start trying to be helpful now?

So yeah. Python. Some people love it. I have no idea why. It can be hammered, heavily, into something that sort of works. Kind of. At least I have a script that functions and is cross-platform. That was the plan all along, especially since all of this code I've written was meant to replace this line:

  os.system('curl -ORs ' + url)

Yep. Sixty lines of code to replace one. Should've just copied curl.

things are as bad as they seem

2005-08-14

Harder Than It Has to Be

No comments:

About Me

Blog Archive

Links