I call it "log.pl"
For posterity's sake, I'm posting my Perl script for parsing Apache HTTP server log files here. This was written in an afternoon: it took about 12 minutes to write the basic loop and a few hours of testing and tweaking to make it useful. I never claimed to be a Perl god, but this script has saved my butt when webalizer wouldn't even compile. The Apache::ParseLog Perl module is better for generic Apache logs. I wrote this because ParseLog would fold on a 1.3GB log file. Since my script reads text from stdin, you can feed it a file of any size. If you so desired, you could split your log into pieces and feed each of them through separately. That's the power of stdin.
Some caveats:
- This script was written in a day: it was supposed to give
reasonably accurate answers quickly, not parse "127.0.0.1: iNvAlId
tExT _$_$_ j00 g0T hAcKeD!!!". It makes some primitive attempts to
skim over invalid log entries, including "newsyslog: log file turned
over"-type messages. For reasons I still don't comprehend, webalizer
barfs when the first line is a newsyslog message.
- This script only reports unique hits (as determined by IP
address) to any resource per day. If you host multiple sites or only
care about stats for certain pages, this script won't help you.
- IP addresses are assumed to be any four integers separated
by three decimal dots. 1.2.3.4 is considered an IP
address. 999.999.999.999 is considered an IP address. 192.168.1.55 is
considered an IP address, but is thrown out so you can't skew your own
popularity. You may just as easily add 127.0.0.1, or 10.0.0.1, or
anything other IP you wish to disregard.
- This script only works with the access_log file
format. referer_log and error_log formats are not handled.
- I don't run this script on my webserver. I rsync the log
to a test system and run it from there. There's no error
detection. That probably has something to do with it.
- Positions of log entities are hard-coded: the IP address is always assumed to be in the first position in the line and separated from the rest by a space (' '). The date is assumed to be in the fourth position and preceded by a left angle bracket ('['). If you're getting gibberish, check to see if you're using the CustomLog option in httpd.conf.
The more I analyze this script the more embarrassed by it I
am. Still, it can be really useful in the right circumstances. Now
that most of you have been scared off, here's the script!
#!/usr/bin/env perl
use strict;
use warnings;
my $count;
my @in;
my %out;
while (<STDIN>) {
chomp($_);
# simple regex to look for lines that start with an IPv4 address
if (!($_ =~ /^[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*/)) { next; }
if ($_ =~ /^192\.168\.1\.*/) { next; }
push(@in,split(/ /,$_));
my $ipaddr = $in[0];
$in[3] = substr($in[3],1);
my @s = split(/\:/,$in[3]);
my @t = split(/\//,$s[0]);
if ($t[1] eq "Jan") { $t[1] = "01"; }
if ($t[1] eq "Feb") { $t[1] = "02"; }
if ($t[1] eq "Mar") { $t[1] = "03"; }
if ($t[1] eq "Apr") { $t[1] = "04"; }
if ($t[1] eq "May") { $t[1] = "05"; }
if ($t[1] eq "Jun") { $t[1] = "06"; }
if ($t[1] eq "Jul") { $t[1] = "07"; }
if ($t[1] eq "Aug") { $t[1] = "08"; }
if ($t[1] eq "Sep") { $t[1] = "09"; }
if ($t[1] eq "Oct") { $t[1] = "10"; }
if ($t[1] eq "Nov") { $t[1] = "11"; }
if ($t[1] eq "Dec") { $t[1] = "12"; }
my $date = "$t[2]\/$t[1]\/$t[0]";
push(@{ $out{$date} },$ipaddr);
undef @in;
undef @s;
undef @t;
}
for my $i (sort keys %out) {
my %uniq;
@uniq{ @{ $out{$i}}} = ();
my @ipaddrs = sort keys %uniq;
print "$i: " . ++$#ipaddrs . "\n";
undef %uniq;
undef @ipaddrs;
}
__END__
No comments:
Post a Comment