things are as bad as they seem: Doing webstats without Apache::ParseLog

Perl is good. The Apache::ParseLog module is good. But it chokes on Apache logs once they hit 1.3 GB apparently. Screw you, Apache::ParseLog. You fucking suck. After three days of trying to wheedle webstats out of a 1.3GB file using just about every trick I could think of, I was still on square one, with no answers and 1.3 billion questions.

Enter text processing. By stripping the log file down to it barest elements and running it through a gauntlet of awk, sed, uniq, and sort, I eventually got it to a size that Perl, sans Apache module, could handle.

I'll spare you the details and tell you what I learned. sort can use a mergesort if you pick the -H flag, which is much better when using it on large files. Plus, here's a fun little Perl script:

#!/usr/bin/perl

use strict;
use warnings;

my %out;
my @a;

while (<STDIN>) {
    chomp($_);
    push(@a,split(/ /,$_));

    $out{"$a[0]"}++;

    undef @a;
}

for my $i (keys %out) {
    print "$i: $out{$i}\n";
}

undef %out;

exit(0);

I stole the meat of this little hack from perlmail, a little something I wrote awhile back to tell me about new mail whenever I'd login. Basically, $out{"$a[0]"}++; creates a key in hash variable %out called whatever the value of $a[0] is. By incrementing it for each line in which it appears, you get how many times that key occurs in a text file. If the key is a date string, and you are counting IP addresses, well that's calculating hits to a website, isn't it? Make sure you uniq the text file first. There's a way to do this all in Perl, but uniq is just more convenient.

In summary, fuck you Apache::ParseLog.

things are as bad as they seem

2004-07-24

Doing webstats without Apache::ParseLog

No comments:

About Me

Blog Archive

Links