HTML Parsing in Perl
I really like Ruby's Hpricot parser, but I figured I'd give Perl another go. I tied one hand behind my back by refusing to install HTML::TreeBuilder, if for no other reason than because it's not in the default Perl install that comes with Cygwin.
But HTML::Parser is.
#!/usr/bin/env perl use strict; use warnings; use HTML::Parser; use Data::Dumper; sub start_handler { return unless shift eq 'a'; my $h_attr_ref = shift; print Data::Dumper->Dump([ $h_attr_ref->{href} ]); } my $file = 'index.html'; if (!(-e $file)) { die("can't find index file"); } my $p = HTML::Parser->new(api_version => 3, handlers => { start => [ \&start_handler, "tagname,attr"] }); $p->parse_file($file);
Works like a charm against CNN.com's home page.
1 comment:
If your goal is just to extract comments, take a look at WWW::Mechanize, which gives you all the links from a page as a handy function call.
Post a Comment