2004-10-14

Yet Another Open Letter

Dear The Internet,

I appreciate that people are flocking to new RSS syndication protocols. RSS is a very easy way to stay up-to-date on a wide variety of subjects, from web comics to some anonymous shmoe's opinion on U.S. foreign policy.

But we have a teensy weensy little problem with RSS: certain people who implement it for their websites are idiots. Microserf (and inexplicable open-source advocate) Dare Obasanjo has begun illustrating a very small percentage of these people in his new series, "RSS Feeds That Suck". Dare is a very bright man, and you should value his opinion, as he is very much interested in the whole RSS thing. He should be, since he's written a very competent feed aggregator called RSS Bandit.

Well, I haven't written an aggregator, so you have no prima facie reason to value my opinion on same said topic. Still, I hope to convince you that a great number of feeds suck for exactly three very simple reasons. It is inexcusable for an RSS feed to suck because of these following reasons:

  1. Lack of full feed support.

    This is not rocket science, people. You think that perhaps a little teaser of your new content will be enough to entice folks to click through and read your swag from your webpage. Wrong! Most people who run an aggregator do so because they don't fucking feel like expending the energy it would take to use their browser. Aggregators are, above all else, designed for convenience. Opening a browser is just one more middleman I don't want to have to deal with. Full feeds are simply better, because then they can be read in their entirety without the need to establish a second connection to the same Internet source. Even aside from that, everybody understandably gets pissed off when they're trying to read something you've written but it arbitrarily gets cut off right in the middl...

    Doesn't that just drive you bonkers?

  2. Inaccurate timestamping. I loathe the fact that half of my favorite feeds don't timestamp correctly. Meaning if a feed's newest item was added at, say, 4 PM on 2004-10-03, but I don't check it until 2 AM on 2004-10-06, my aggregator is going to say that the newest item was added at 2 AM on 2004-10-06. This is clearly a lie. That item has been around for over two full days. This wouldn't be so bad by itself, but consider what happens if there have been several updates since I last checked it. Which one is newest? I don't know. I don't know because all of the new feeds have the exact same timestamp: 2 AM, when I checked them, not when they were published. Proper timestamping is easy to do overall, but it's apparently ridiculously simple in RSS 2.0 and Atom 0.3 (I'm not certain how. Ask Sam Ruby.). I prefer Atom over RSS 2.0, and RSS 2.0 over pretty much everything that remains. The reason why has nothing to do with bandwidth, or XML support, or anything other than accurate timestamping.

  3. Utter disregard for charsets. A character set is like an interpreter. It tells your computer exactly what the stream of ones and zeroes you're receiving are supposed to mean. Without the correct charset, you're just downloading garbage: meaningless babble that may or may not convey meaning.

    And guess what? Nobody seems to care about this. Time and again, I see people using absolutely the wrong charset on their feed. I'm outraged by this! I will concede that it is my responsibility to teach my computer to understand any and all far-out, funky, never-heard-of-that-one-before charset you may wish to use to publish your content, but you'd fucking well better stick to it yourself! I see people using UTF-8 characters in an ISO-8859-1 charset. Wrong! The problem here is pretty subtle in most English-speaking blogs. But what about the French? In the French language, the 127 ASCII characters get augmented with graves and acutes pretty damn quickly, and ISO-8859-1 does fuck-all to handle them. Take Belle de Jour's entire list of entries for the month of August as an example. She's English, but titles her posts with the date in French. This looks fine on her page, but the German guy who does her RSS feed chose ISO-8859-1 and thus screwed up everything. Viewing UTF-8 under the wrong charset is a mess. Take a look:

    should be: vendredi 27 août
    looks like: vendredi 27 août

    There but for the forethought of the Unicode designers can you make any sense out of this at all. Your computer does what it's supposed to do when it thinks it's seeing ASCII: convert every byte it gets into a character. Guess what? It gets it all wrong because Unicode uses both single and multiple byte lengths. But there's a way to tell them apart, and your computer could handle it if the webmaster would just rub two brain cells together and switch from ISO-8859-1 to UTF-8. It's a one line change to the feed; that's hardly a massive undertaking.

    Incorrect encodings are also hell on the people who think that "smart quotes" are universal. I assure you: they are not. This is somehow much worse than turning "août" into "août" because now you're not getting multiple junky characters, you're just getting little boxes where punctuation should be:

    ▯I▯ve seen this before and there▯s an easy fix for it▯▯ you▯re saying to yourself. But you▯re forgetting that a lot of people just don▯t understand that what they▯re seeing on their computer when they click ▯Submit▯ is not what others who visit their page are seeing▯ too.

    See what I mean? This isn't exactly an incorrect charset problem, because on Pornblography, for example, Carly doesn't include any character set encoding. Thus, the server has to assign the default, which is obviously wrong. So her punctuation could be in Swahili for all we know. All we do know is that it isn't proper ISO-8859-1. To fix it, we would have to guess at what it's supposed to be. Should Carly's readers all try to be mindreaders?

    Newsflash: this is the 21st century, folks. You need to stop breaking your feeds with this crap. These are elementary mistakes that should not happen. There's no real fix for this problem: each feed must conform to a proper character set, and only someone competent in understanding character sets can determine what's considered proper. There's a safe workaround for English-speaking feeds, though, and that is to unilaterally switch to UTF-8. UTF-8 contains ASCII as a strict subset. In other words, the first 127 characters of UTF-8 just so happen to be perfectly identical to ASCII. This was to make conversions easy. In order to take advantage of this fact, folks, you have to convert.

No comments: