2004-07-24

Doing webstats without Apache::ParseLog

Perl is good. The Apache::ParseLog module is good. But it chokes on Apache logs once they hit 1.3 GB apparently. Screw you, Apache::ParseLog. You fucking suck. After three days of trying to wheedle webstats out of a 1.3GB file using just about every trick I could think of, I was still on square one, with no answers and 1.3 billion questions.

Enter text processing. By stripping the log file down to it barest elements and running it through a gauntlet of awk, sed, uniq, and sort, I eventually got it to a size that Perl, sans Apache module, could handle.

I'll spare you the details and tell you what I learned. sort can use a mergesort if you pick the -H flag, which is much better when using it on large files. Plus, here's a fun little Perl script:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my %out;
    my @a;
    
    while (<STDIN>) {
        chomp($_);
        push(@a,split(/ /,$_));
    
        $out{"$a[0]"}++;
    
        undef @a;
    }
    
    for my $i (keys %out) {
        print "$i: $out{$i}\n";
    }
    
    undef %out;
    
    exit(0);
    

I stole the meat of this little hack from perlmail, a little something I wrote awhile back to tell me about new mail whenever I'd login. Basically, $out{"$a[0]"}++; creates a key in hash variable %out called whatever the value of $a[0] is. By incrementing it for each line in which it appears, you get how many times that key occurs in a text file. If the key is a date string, and you are counting IP addresses, well that's calculating hits to a website, isn't it? Make sure you uniq the text file first. There's a way to do this all in Perl, but uniq is just more convenient.

In summary, fuck you Apache::ParseLog.

No comments: