Siguiendo con la fiebre de los Mashups hechos con Yahoo Pipes, llegué al punto en que decidí escribir un convertidor de HTML a RSS para Globovision. Como recordaran, Globovision no ofrece un feed RSS de sus noticias nacionales, lo cual es una verdadera lastima.
Asi que con un poco de imaginación decidí escribir este programa en Perl:
1 #!/usr/bin/perl 2 3 use strict; 4 use LWP::UserAgent; 5 use HTML::Parser; 6 use XML::RSS; 7 8 my $rss = XML::RSS->new( version => '0.9' ); 910 my $rssFile = "$ENV{HOME}/globovision.rss";1112 $rss->channel(13 title => "Globovision.com Venezuelan News",14 link => "http://globovision.com/",15 description => "Globovision.com news -Brough to you by http://KodeGeek.com");1617 # Be carefull with this one as nested elements can be ignored and18 # HTML normally is not well formed!19 my @ignore_tags = (20 "head",21 "h1",22 "strong",23 "form"24 );2526 my $baseUrl = "http://globovision.com/";2728 # We are only interested on the news from the first page,as more news come up it will push older news29 my $newsUrl = "$baseUrl" . "history.php?cha=1&pag=1";3031 use constant DEFAULT_TIMEOUT32 => 180;3334 my $agent = LWP::UserAgent->new({35 agent => 'GlobovisionHtml2Rss.pl/kodegeek 0.1',36 timeout => DEFAULT_TIMEOUT37 });38 my $response = $agent->get($newsUrl);39 if (! $response->is_success) {40 die sprintf "[ERROR]: Unable to retrieve the HTMLfrom '%s', Status: '%s'", $newsUrl, $response->status_line;41 }42 my $parser = HTML::Parser->new(43 api_version => 3,44 start_h => [ \&start_a, "tagname, attr" ],45 text_h => [ \&get_headline, "dtext" ]46 );47 $parser->ignore_tags(@ignore_tags);48 my $headlineFlag=0;49 my $currUrl=undef;50 $parser->parse($response->decoded_content());5152 $rss->save($rssFile);5354 # ****** Functions used on the script *******5556 # Get the headline57 sub get_headline {58 my $headline = $_[0];59 if ($headlineFlag) {60 $rss->add_item( title => $headline, link => $currUrl);61 $headlineFlag = 0;62 $currUrl=undef;63 }64 }6566 # Identify news items67 sub start_a {68 my $tagname = $_[0];69 my %attr = %{$_[1]};70 if ( ($tagname eq "a") && ($attr{href} =~ /^news.php?.nid=\d+/) ) {71 my $url = $baseUrl . "/" . $attr{href};72 $url =~ s/&/&/g;73 $currUrl = $url;74 $headlineFlag=1;75 }76 }77 __END__78 =head1 NAME7980 GlobovisionHtml2Rss.pl - Script to convert Globovision.com Venezuelalocal news from HTML to RSS.8182 =head1 DESCRIPTION8384 I use a combination of Yahoo Pipes and Google Reader to keep meupdated about news of any kind. However, some websites like85 Globovision.com still don't have a proper RSS feed, so one dayI decided to create my own mashup :).8687 =head1 AUTHOR8889 Jose Vicente Nunez Zuleta (josevnz@kodegeek.com)9091 =head1 BLOG9293 KodeGeek - http://kodegeek.com9495 =head1 LICENSE9697 GPL9899 cut
Lo más fastidioso de este ejercicio fué instalar EXPAT (para el procesamiento del XML del feed RSS) y el módulo parta crear el archivo RSS (me da un fastidio enorme aprender como es el formato resultante).
Mi proveedor de hosting gustosamente instaló el módulo XML::RSS. Después de probarlo un poco aquí les dejo las noticias de Globovisión para que la disfruten (planeo actualizar el lector de noticias cada 10 minutos para no matar a mi servidor).
Blogalaxia.com:globovision, rss, html to rss, perl, open source
Technorati.com:globovision, rss, html to rss, perl, open source
Sin categoría
globovision, html to rss, open source, perl, rss
Comentarios recientes