2004-09-22

I call it "log.pl"

For posterity's sake, I'm posting my Perl script for parsing Apache HTTP server log files here. This was written in an afternoon: it took about 12 minutes to write the basic loop and a few hours of testing and tweaking to make it useful. I never claimed to be a Perl god, but this script has saved my butt when webalizer wouldn't even compile. The Apache::ParseLog Perl module is better for generic Apache logs. I wrote this because ParseLog would fold on a 1.3GB log file. Since my script reads text from stdin, you can feed it a file of any size. If you so desired, you could split your log into pieces and feed each of them through separately. That's the power of stdin.

Some caveats:

  • This script was written in a day: it was supposed to give reasonably accurate answers quickly, not parse "127.0.0.1: iNvAlId tExT _$_$_ j00 g0T hAcKeD!!!". It makes some primitive attempts to skim over invalid log entries, including "newsyslog: log file turned over"-type messages. For reasons I still don't comprehend, webalizer barfs when the first line is a newsyslog message.

  • This script only reports unique hits (as determined by IP address) to any resource per day. If you host multiple sites or only care about stats for certain pages, this script won't help you.

  • IP addresses are assumed to be any four integers separated by three decimal dots. 1.2.3.4 is considered an IP address. 999.999.999.999 is considered an IP address. 192.168.1.55 is considered an IP address, but is thrown out so you can't skew your own popularity. You may just as easily add 127.0.0.1, or 10.0.0.1, or anything other IP you wish to disregard.

  • This script only works with the access_log file format. referer_log and error_log formats are not handled.

  • I don't run this script on my webserver. I rsync the log to a test system and run it from there. There's no error detection. That probably has something to do with it.

  • Positions of log entities are hard-coded: the IP address is always assumed to be in the first position in the line and separated from the rest by a space (' '). The date is assumed to be in the fourth position and preceded by a left angle bracket ('['). If you're getting gibberish, check to see if you're using the CustomLog option in httpd.conf.

The more I analyze this script the more embarrassed by it I am. Still, it can be really useful in the right circumstances. Now that most of you have been scared off, here's the script!

#!/usr/bin/env perl

use strict;
use warnings;

my $count;
my @in;
my %out;

while (<STDIN>) {
    chomp($_);

    # simple regex to look for lines that start with an IPv4 address
    if (!($_ =~ /^[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*/)) { next; }
    if ($_ =~ /^192\.168\.1\.*/) { next; }

    push(@in,split(/ /,$_));

    my $ipaddr = $in[0];

    $in[3] = substr($in[3],1);
    my @s = split(/\:/,$in[3]);
    my @t = split(/\//,$s[0]);

    if ($t[1] eq "Jan") { $t[1] = "01"; }
    if ($t[1] eq "Feb") { $t[1] = "02"; }
    if ($t[1] eq "Mar") { $t[1] = "03"; }
    if ($t[1] eq "Apr") { $t[1] = "04"; }
    if ($t[1] eq "May") { $t[1] = "05"; }
    if ($t[1] eq "Jun") { $t[1] = "06"; }
    if ($t[1] eq "Jul") { $t[1] = "07"; }
    if ($t[1] eq "Aug") { $t[1] = "08"; }
    if ($t[1] eq "Sep") { $t[1] = "09"; }
    if ($t[1] eq "Oct") { $t[1] = "10"; }
    if ($t[1] eq "Nov") { $t[1] = "11"; }
    if ($t[1] eq "Dec") { $t[1] = "12"; }

    my $date = "$t[2]\/$t[1]\/$t[0]";

    push(@{ $out{$date} },$ipaddr);

    undef @in;
    undef @s;
    undef @t;
}

for my $i (sort keys %out) {

    my %uniq;
    @uniq{ @{ $out{$i}}} = ();
    my @ipaddrs = sort keys %uniq;
    
    print "$i: " . ++$#ipaddrs . "\n";

    undef %uniq; 
    undef @ipaddrs;
}

__END__

No comments: