HTML Parsing in Perl
I really like Ruby's Hpricot parser, but I figured I'd give Perl another go. I tied one hand behind my back by refusing to install HTML::TreeBuilder, if for no other reason than because it's not in the default Perl install that comes with Cygwin.
But HTML::Parser is.
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::Parser;
use Data::Dumper;
sub start_handler {
return unless shift eq 'a';
my $h_attr_ref = shift;
print Data::Dumper->Dump([ $h_attr_ref->{href} ]);
}
my $file = 'index.html';
if (!(-e $file)) { die("can't find index file"); }
my $p = HTML::Parser->new(api_version => 3,
handlers => { start => [ \&start_handler, "tagname,attr"] });
$p->parse_file($file);
Works like a charm against CNN.com's home page.
1 comment:
If your goal is just to extract comments, take a look at WWW::Mechanize, which gives you all the links from a page as a handy function call.
Post a Comment