Yet Another Reason Why I Like C
User was complaining about duplicate messages in her inbox, a side-effect of the old mbox-style storage format that I'm slowly beginning to move my users away from using. The problem is, one cannot simply hash every message in the inbox and remove the duplicates since every single message, even when exactly identical in every way to another message or six, has a unique "X-UID:" header.
The fix? Remove the "X-UID:" header, then hash and hack. The first step was splitting each message into its own file. Easy. Use formail:
cd /home/jimsmith chmod +t /home/jimsmith sudo -u jimsmith /var/qmail/bin/maildirmake ./Maildir/ formail -I'From ' -s maildir ./Maildir/ < /var/mail/jimsmith
Once there, you can open each file and throw out any line that begins with "X-UID:". It's a little dangerous to remove every single line that matches the pattern, but I'm incredibly confident that the user has never heard "X-UID" before, doesn't use it in conversation, and probably couldn't spell it if asked to do so.
The first thing that came to my mind was to write a Perl script that uses Tie::File. The Tie::File module turns a text file into an array, with the first line at $arrary[0], the second at $array[1], and so forth. I Googled around for a sane Tie::File example and ultimately hacked up something that looked like this:
#!/usr/bin/env perl
use Tie::File;
my @a;
my $fn = shift @ARGV;
tie (@a, 'Tie::File', $fn) or die "$!";
@a = grep{$_!~/^X-UID:/g;}@a;
untie @a or die "$!";
print $fn . "\n";
And Perl hackers get offended when people say their code looks like a typewriter just sneezed. That grep look like line noise! Don't even ask me how Python would handle this. I don't know Python's expression handling syntax very well, but I'm guessing it's pretty shitty. Perl was just designed for this kind of thing.
Too bad it was slow as hell.
I couldn't run this code on the mail server itself: namely, I never use the mail server to edit in-place messages, and besides, it didn't have the Tie::File module on it anyway. So, optimistically, I pushed the messages (140 MB worth, mind you. What was I thinking?!) over to su.bze.ro. su.bze.ro, for those of you who don't know this yet, was an old workstation from 1999 that I salvaged from the scrap heap. It's a Pentium 133MHz with, I think, 32MB of RAM.
su.bze.ro had torn through exactly one message in five seconds. It spent the next five minutes chewing on the second message. I gave up.
I took an old Celeron that wasn't doing anything and I put an OpenBSD snapshot on it. It took about an hour and a half to go through all 528 messages stripping out the "X-UID:" header, but it did get the job done.
Today, another user has the same complaint. Second user has 483 messages. It would take another eternity for my tiny little Perl script to get through them all. There's no way I'm going through this again, so I hack up a C program to do it faster than Perl ever could:
#include "buffer.h"
#include "exit.h"
#include "getln.h"
#include "strerr.h"
#include "stralloc.h"
#define FATAL "un-xuid: fatal: "
void die_nomem(){strerr_die2x(111,FATAL,"out of memory");}
stralloc line = {0};
int main(void) {
int match = 1;
while (match) {
if (getln(buffer_0,&line,&match,'\n') == -1)
strerr_die2sys(111,FATAL,"unable to read input: ");
if (!line.len) break;
if (stralloc_starts(&line,"X-UID:"))
continue;
buffer_putflush(buffer_1,line.s,line.len);
}
_exit(0); return 0;
}
This program does not edit the file directly, so it is not functionally identical to the Perl script. Instead of
for i in *; do un-xuid.pl $i; done
You'd have to add some complexity in atomically editing the files yourself:
for i in *; do un-xuid < $i > $i.new; mv $i.new $i; done
This runs at least an order of magnitude faster, as it finished while I was still watching it, as opposed to later in the day after a lunch hour and some idle web surfing.
No comments:
Post a Comment