2008-09-06

HTML Parsing in Perl

I really like Ruby's Hpricot parser, but I figured I'd give Perl another go. I tied one hand behind my back by refusing to install HTML::TreeBuilder, if for no other reason than because it's not in the default Perl install that comes with Cygwin.

But HTML::Parser is.

#!/usr/bin/env perl

use strict;
use warnings;
use HTML::Parser;
use Data::Dumper;

sub start_handler {
  return unless shift eq 'a';
  my $h_attr_ref = shift;
  print Data::Dumper->Dump([ $h_attr_ref->{href} ]);
}

my $file = 'index.html';
if (!(-e $file)) { die("can't find index file"); }

my $p = HTML::Parser->new(api_version => 3,
    handlers => { start => [ \&start_handler, "tagname,attr"] });
$p->parse_file($file);

Works like a charm against CNN.com's home page.

1 comment:

Andy Lester said...

If your goal is just to extract comments, take a look at WWW::Mechanize, which gives you all the links from a page as a handy function call.