Doing webstats without Apache::ParseLog
Perl is good. The Apache::ParseLog module is good. But it chokes on Apache logs once they hit 1.3 GB apparently. Screw you, Apache::ParseLog. You fucking suck. After three days of trying to wheedle webstats out of a 1.3GB file using just about every trick I could think of, I was still on square one, with no answers and 1.3 billion questions.
Enter text processing. By stripping the log file down to it barest elements and running it through a gauntlet of awk, sed, uniq, and sort, I eventually got it to a size that Perl, sans Apache module, could handle.
I'll spare you the details and tell you what I learned. sort can use a mergesort if you pick the -H flag, which is much better when using it on large files. Plus, here's a fun little Perl script:
#!/usr/bin/perl use strict; use warnings; my %out; my @a; while (<STDIN>) { chomp($_); push(@a,split(/ /,$_)); $out{"$a[0]"}++; undef @a; } for my $i (keys %out) { print "$i: $out{$i}\n"; } undef %out; exit(0);
I stole the meat of this little hack from perlmail, a little something I wrote awhile back to tell me about new mail whenever I'd login. Basically, $out{"$a[0]"}++; creates a key in hash variable %out called whatever the value of $a[0] is. By incrementing it for each line in which it appears, you get how many times that key occurs in a text file. If the key is a date string, and you are counting IP addresses, well that's calculating hits to a website, isn't it? Make sure you uniq the text file first. There's a way to do this all in Perl, but uniq is just more convenient.
In summary, fuck you Apache::ParseLog.
No comments:
Post a Comment