Siguiendo con la fiebre de los
Mashups hechos con Yahoo Pipes, llegué al punto en que decidí escribir un convertidor de HTML a RSS para
Globovision. Como recordaran,
Globovision no ofrece un feed RSS de sus noticias nacionales, lo cual es una verdadera lastima.
Asi que con un poco de imaginación decidí escribir este programa en Perl:
1 #!/usr/bin/perl
2
3 use strict;
4 use LWP::UserAgent;
5 use HTML::Parser;
6 use XML::RSS;
7
8 my $rss = XML::RSS->new( version => '0.9' );
9
10 my $rssFile = "$ENV{HOME}/globovision.rss";
11
12 $rss->channel(
13 title => "Globovision.com Venezuelan News",
14 link => "http://globovision.com/",
15 description => "Globovision.com news -
Brough to you by http://KodeGeek.com");
16
17 # Be carefull with this one as nested elements can be ignored and
18 # HTML normally is not well formed!
19 my @ignore_tags = (
20 "head",
21 "h1",
22 "strong",
23 "form"
24 );
25
26 my $baseUrl = "http://globovision.com/";
27
28 # We are only interested on the news from the first page,
as more news come up it will push older news
29 my $newsUrl = "$baseUrl" . "history.php?cha=1&pag=1";
30
31 use constant DEFAULT_TIMEOUT
32 => 180;
33
34 my $agent = LWP::UserAgent->new({
35 agent => 'GlobovisionHtml2Rss.pl/kodegeek 0.1',
36 timeout => DEFAULT_TIMEOUT
37 });
38 my $response = $agent->get($newsUrl);
39 if (! $response->is_success) {
40 die sprintf "[ERROR]: Unable to retrieve the HTML
from '%s', Status: '%s'", $newsUrl, $response->status_line;
41 }
42 my $parser = HTML::Parser->new(
43 api_version => 3,
44 start_h => [ \&start_a, "tagname, attr" ],
45 text_h => [ \&get_headline, "dtext" ]
46 );
47 $parser->ignore_tags(@ignore_tags);
48 my $headlineFlag=0;
49 my $currUrl=undef;
50 $parser->parse($response->decoded_content());
51
52 $rss->save($rssFile);
53
54 # ****** Functions used on the script *******
55
56 # Get the headline
57 sub get_headline {
58 my $headline = $_[0];
59 if ($headlineFlag) {
60 $rss->add_item( title => $headline, link => $currUrl);
61 $headlineFlag = 0;
62 $currUrl=undef;
63 }
64 }
65
66 # Identify news items
67 sub start_a {
68 my $tagname = $_[0];
69 my %attr = %{$_[1]};
70 if ( ($tagname eq "a") && ($attr{href} =~ /^news.php?.nid=\d+/) ) {
71 my $url = $baseUrl . "/" . $attr{href};
72 $url =~ s/&/&/g;
73 $currUrl = $url;
74 $headlineFlag=1;
75 }
76 }
77 __END__
78 =head1 NAME
79
80 GlobovisionHtml2Rss.pl - Script to convert Globovision.com Venezuela
local news from HTML to RSS.
81
82 =head1 DESCRIPTION
83
84 I use a combination of Yahoo Pipes and Google Reader to keep me
updated about news of any kind. However, some websites like
85 Globovision.com still don't have a proper RSS feed, so one day
I decided to create my own mashup :).
86
87 =head1 AUTHOR
88
89 Jose Vicente Nunez Zuleta (josevnz@kodegeek.com)
90
91 =head1 BLOG
92
93 KodeGeek - http://kodegeek.com
94
95 =head1 LICENSE
96
97 GPL
98
99 cut
Lo más fastidioso de este ejercicio fué instalar
EXPAT (para el procesamiento del XML del feed RSS) y el módulo parta crear el archivo RSS (me da un fastidio enorme aprender como es el formato resultante).
Mi proveedor de hosting gustosamente instaló el módulo
XML::RSS. Después de probarlo un poco aquí les dejo las noticias de
Globovisión para que la disfruten (planeo actualizar el lector de noticias cada 10 minutos para no matar a mi servidor).
Blogalaxia.com:
globovision,
rss,
html to rss,
perl,
open sourceTechnorati.com:
globovision,
rss,
html to rss,
perl,
open sourceEtiquetas: globovision, html to rss, open source, perl, rss