{"id":830,"date":"2005-08-13T23:20:00","date_gmt":"2005-08-14T06:20:00","guid":{"rendered":"http:\/\/kodegeek.com\/blog\/?p=830"},"modified":"2005-08-13T23:20:00","modified_gmt":"2005-08-14T06:20:00","slug":"echando-codigo-%c2%bfcomo-hacer-web-scrapping-con-perl","status":"publish","type":"post","link":"http:\/\/kodegeek.com\/blog\/2005\/08\/13\/echando-codigo-%c2%bfcomo-hacer-web-scrapping-con-perl\/","title":{"rendered":"Echando c\u00f3digo: \u00bfComo hacer web-scrapping con Perl?"},"content":{"rendered":"<p><img decoding=\"async\" src=\"http:\/\/photos23.flickr.com\/34053239_03b6c1ade7_o.gif\" \/><\/p>\n<p>Web-scrapping no es un t\u00e9rmino nuevo, y no es m\u00e1s que procesar una p\u00e1gina web utilizando un lenguaje de programaci\u00f3n; La idea es eliminar el HTML (<span style=\"font-style: italic;\">presentaci\u00f3n<\/span>) que no nos interesa para luego procesar los datos que est\u00e1n en la pag\u00edna web.<\/p>\n<p>T\u00edpicamente esta era una de las alternativas usadas por los desarrolladores antes de la venida de los &#8220;Web services&#8221; or fees en XML como <span style=\"font-style: italic;\">Atom o RSS<\/span>. Su uso a\u00fan es muy difundido, ya que no todos los sitios web proveen la informaci\u00f3n de manera amigable para programas, por diversas razones.<\/p>\n<p>Perl es una maravilla en ese sentido, ya que es casi trivial sacar la informaci\u00f3n usando expresiones regulares; Sin embargo si la p\u00e1gina es compleja ni siquiera la mejor de las expresiones regulares va a ayudar.<\/p>\n<p>Vamos a hacer un programita que haga lo siguiente:<\/p>\n<ul>\n<li>Vaya a la p\u00e1gina de los \u00faltimos 100 actualizados de <a href=\"http:\/\/www.veneblogs.com\/\"><span style=\"font-weight: bold; font-style: italic;\">VeneBlogs<\/span><\/a> y se traiga el contenido<\/li>\n<li>Nos imprima el <span style=\"font-style: italic;\">titulo de la p\u00e1gina y el URL de verdad<\/span>. Veneblogs tiene un enl<span style=\"font-weight: bold;\">ace el cual oculta el URL de verdad (y con el cual seguro miden el tr\u00e1fico) pero nosotros queremos el URL real de el sitio web.<br \/><\/span><\/li>\n<\/ul>\n<p> \u00bfSuena interesante? Es una aplicaci\u00f3n sencilla, pero eso le dar\u00e1 una idea de las cosas que se pueden hacer con Web Scraping y Perl; La estrateg\u00eda aqui es identificar el pedazo de c\u00f3digo HTML en donde podemos localizar nuestra informaci\u00f3n sin equivocarnos y luego sacarla de all\u00ed.<\/p>\n<p>Revizando el <a href=\"http:\/\/www.veneblogs.com\/feeds\/ultimos100.php\">URL de los &#8220;top100&#8221;<\/a> de VeneBlogs, podemos ver que el HTML es muy limpio y que es bastante sencillo identificar en donde est\u00e1 la informaci\u00f3n de los usuarios:<\/p>\n<style type=\"text\/css\"><!-- .syntax0 { color: #000000; } .syntax1 { color: #cc0000; } .syntax2 { color: #ff8400; } .syntax3 { color: #6600cc; } .syntax4 { color: #cc6600; } .syntax5 { color: #ff0000; } .syntax6 { color: #9966ff; } .syntax7 { background: #ffffcc; color: #ff0066; } .syntax8 { color: #006699; font-weight: bold; } .syntax9 { color: #009966; font-weight: bold; } .syntax10 { color: #0099ff; font-weight: bold; } .syntax11 { color: #66ccff; font-weight: bold; } .syntax12 { color: #02b902; } .syntax13 { color: #ff00cc; } .syntax14 { color: #cc00cc; } .syntax15 { color: #9900cc; } .syntax16 { color: #6600cc; } .syntax17 { color: #0000ff; } .syntax18 { color: #000000; font-weight: bold; } .gutter { background: #dbdbdb; color: #000000; } .gutterH { background: #dbdbdb; color: #990066; } --><br \/><\/style>\n<p><\/p>\n<pre><span class=\"gutter\">   1:<\/span>&lt;div id=\"innercontent\"&gt;<br \/><span class=\"gutter\">   2:<\/span>        &lt;h2&gt;&amp;Uacute;ltimos 100 Blogs Actualizados&lt;\/h2&gt;<br \/><span class=\"gutter\">   3:<\/span>        &lt;p&gt;&lt;strong&gt;NOTA:&lt;\/strong&gt; Todas las horas est\u00e1n en el huso<br \/>horario venezolano (UTC - 4.00 h).&lt;\/p&gt;<br \/><span class=\"gutter\">   4:<\/span>        &lt;dl&gt;<br \/><span class=\"gutterH\">   5:<\/span>        &lt;dd&gt;&lt;a title=\"Veneblogs - Directorio de Blogs de Venezuela - Enlace<br \/>Externo hacia Diario de un poeta gris\" href=\"\/?redidblog=1580\" onmouseover=\" window.status='<br \/>http:\/\/poetagris.blogspot.com'; return true\" onmouseout=\"window.status=''; return true\"&gt;<br \/>Diario de un poeta gris&lt;\/a&gt;&lt;\/dd&gt;<br \/><span class=\"gutter\">   6:<\/span>&lt;dt&gt;Fecha de Actualizaci\u00f3n :  08\/13\/05 -- 11:16:45 am&lt;\/dt&gt;<br \/><span class=\"gutter\">   7:<\/span><br \/><span class=\"gutter\">   8:<\/span>&lt;dd&gt;&lt;a <span style=\"font-weight: bold;\">title<\/span>=\"Veneblogs - Directorio de Blogs de Venezuela - Enlace Externo<br \/>hacia El Tecnorrante\" href=\"\/?redidblog=584\" <span style=\"font-weight: bold;\">onmouseover<\/span>=\" window.status='<br \/>http:\/\/atorrante.blogspot.com'; return true\" onmouseout=\"window.status=''; return true\"<br \/>&gt;El Tecnorrante&lt;\/a&gt;&lt;\/dd&gt;<br \/><span class=\"gutter\">   9:<\/span>&lt;dt&gt;Fecha de Actualizaci\u00f3n :  08\/13\/05 -- 11:06:52 am&lt;\/dt&gt;<br \/><span class=\"gutterH\">  10:<\/span>&lt;dd&gt;&lt;a title=\"Veneblogs - Directorio de Blogs de Venezuela - Enlace Externo<br \/>hacia MagicVenezuela.com\" href=\"\/?redidblog=2660\" onmouseover=\" window.status='<br \/>http:\/\/www.magicvenezuela.com'; return true\" onmouseout=\"window.status=''; return true\"<br \/>&gt;MagicVenezuela.com&lt;\/a&gt;&lt;\/dd&gt;<br \/><span class=\"gutter\">  11:<\/span>&lt;dt&gt;Fecha de Actualizaci\u00f3n :  08\/13\/05 -- 10:55:55 am&lt;\/dt&gt;<br \/><\/pre>\n<p>\u00bfQue herramientas necesitamos en Perl? Vamos a utilizar un parser de HTML, llamado <span style=\"font-style: italic;\">HTML::Parser<\/span>. E mi caso este no estaba instalado en Fedora Core 4, as\u00ed que lo mandamos a instalar (you voy a utilizar CPAN):<\/p>\n<blockquote><p>perl -MCPAN -e&#8217;install HTML::Parse&#8217;<\/p><\/blockquote>\n<p>El otro m\u00f3dulo a usar es LWP::UserAgent y es quien nos va a permitir bajarnos el URL. En mi caso ya est\u00e1 instalado en Fedora Core 4, pero si no lo tiene entonces repita los pasos anteriores para instalar un m\u00f3dulo de CPAN.<\/p>\n<p>La t\u00e9cnica tiene sus problemas; Si el HTML es complicado, o est\u00e1 mal escrito o cambia con frecuencia, tratar de sacar los datos de esta manera puede ser una pesadilla (no hay almuerzo gr\u00e1tis). En fin, el c\u00f3digo:<\/p>\n<style type=\"text\/css\"><!-- .syntax0 { color: #000000; } .syntax1 { color: #cc0000; } .syntax2 { color: #ff8400; } .syntax3 { color: #6600cc; } .syntax4 { color: #cc6600; } .syntax5 { color: #ff0000; } .syntax6 { color: #9966ff; } .syntax7 { background: #ffffcc; color: #ff0066; } .syntax8 { color: #006699; font-weight: bold; } .syntax9 { color: #009966; font-weight: bold; } .syntax10 { color: #0099ff; font-weight: bold; } .syntax11 { color: #66ccff; font-weight: bold; } .syntax12 { color: #02b902; } .syntax13 { color: #ff00cc; } .syntax14 { color: #cc00cc; } .syntax15 { color: #9900cc; } .syntax16 { color: #6600cc; } .syntax17 { color: #0000ff; } .syntax18 { color: #000000; font-weight: bold; } .gutter { background: #dbdbdb; color: #000000; } .gutterH { background: #dbdbdb; color: #990066; } --><br \/><\/style>\n<p><\/p>\n<pre><span class=\"gutter\">   1:<\/span><span class=\"syntax1\">#<\/span><span class=\"syntax1\">!\/usr\/bin\/perl<\/span><br \/><span class=\"gutter\">   2:<\/span><br \/><span class=\"gutter\">   3:<\/span><span class=\"syntax8\">use<\/span> strict;<br \/><span class=\"gutter\">   4:<\/span><span class=\"syntax8\">use<\/span> LWP<span class=\"syntax18\">:<\/span><span class=\"syntax18\">:<\/span>UserAgent;<br \/><span class=\"gutterH\">   5:<\/span><br \/><span class=\"gutter\">   6:<\/span><span class=\"syntax1\">#<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">Be<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">carefull<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">with<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">this<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">one<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">as<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">nested<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">elements<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">can<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">be<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">ignored<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">and<\/span><br \/><span class=\"gutter\">   7:<\/span><span class=\"syntax1\">#<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">HTML<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">normally<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">is<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">not<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">well<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">formed!<\/span><br \/><span class=\"gutter\">   8:<\/span><span class=\"syntax8\">my<\/span> <span class=\"syntax9\">@ignore_tags<\/span> <span class=\"syntax18\">=<\/span> (<br \/><span class=\"gutter\">   9:<\/span>        <span class=\"syntax13\">\"<\/span><span class=\"syntax13\">head<\/span><span class=\"syntax13\">\"<\/span>,<br \/><span class=\"gutterH\">  10:<\/span>        <span class=\"syntax13\">\"<\/span><span class=\"syntax13\">h1<\/span><span class=\"syntax13\">\"<\/span>,<br \/><span class=\"gutter\">  11:<\/span>        <span class=\"syntax13\">\"<\/span><span class=\"syntax13\">strong<\/span><span class=\"syntax13\">\"<\/span>,<br \/><span class=\"gutter\">  12:<\/span>        <span class=\"syntax13\">\"<\/span><span class=\"syntax13\">form<\/span><span class=\"syntax13\">\"<\/span><br \/><span class=\"gutter\">  13:<\/span>   );<br \/><span class=\"gutter\">  14:<\/span><span class=\"syntax8\">my<\/span> (<span class=\"syntax9\">$title<\/span>, <span class=\"syntax9\">$url<\/span>);<br \/><span class=\"gutterH\">  15:<\/span><br \/><span class=\"gutter\">  16:<\/span><span class=\"syntax1\">#<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">Define<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">the<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">\"format\"<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">header<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">we<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">will<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">use<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">for<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">printing:<\/span><br \/><span class=\"gutter\">  17:<\/span><span class=\"syntax10\">format<\/span> STDOUT_TOP <span class=\"syntax18\">=<\/span><br \/><span class=\"gutter\">  18:<\/span>@<span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span> @<span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><br \/><span class=\"gutter\">  19:<\/span><span class=\"syntax13\">\"<\/span><span class=\"syntax13\">Blog<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">title<\/span><span class=\"syntax13\">:<\/span><span class=\"syntax13\">\"<\/span>,                                                   <span class=\"syntax13\">\"<\/span><span class=\"syntax13\">URL<\/span><span class=\"syntax13\">:<\/span><span class=\"syntax13\">\"<\/span><br \/><span class=\"gutterH\">  20:<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span> <span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">-<\/span><br \/><span class=\"gutter\">  21:<\/span>.<br \/><span class=\"gutter\">  22:<\/span><span class=\"syntax10\">format<\/span> STDOUT <span class=\"syntax18\">=<\/span><br \/><span class=\"gutter\">  23:<\/span>@<span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span> @<span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><span class=\"syntax18\">&lt;<\/span><br \/><span class=\"gutter\">  24:<\/span><span class=\"syntax9\">$title<\/span>,                                                          <span class=\"syntax9\">$url<\/span><br \/><span class=\"gutterH\">  25:<\/span>.<br \/><span class=\"gutter\">  26:<\/span><br \/><span class=\"gutter\">  27:<\/span><span class=\"syntax8\">use<\/span> constant URL_TOP100_VENEBLOGS<br \/><span class=\"gutter\">  28:<\/span>        <span class=\"syntax18\">=<\/span><span class=\"syntax18\">&gt;<\/span> <span class=\"syntax13\">\"<\/span><span class=\"syntax13\">http<\/span><span class=\"syntax13\">:<\/span><span class=\"syntax13\">\/<\/span><span class=\"syntax13\">\/<\/span><span class=\"syntax13\">www<\/span><span class=\"syntax13\">.<\/span><span class=\"syntax13\">veneblogs<\/span><span class=\"syntax13\">.<\/span><span class=\"syntax13\">com<\/span><span class=\"syntax13\">\/<\/span><span class=\"syntax13\">feeds<\/span><span class=\"syntax13\">\/<\/span><span class=\"syntax13\">ultimos100<\/span><span class=\"syntax13\">.<\/span><span class=\"syntax13\">php<\/span><span class=\"syntax13\">\"<\/span>;<br \/><span class=\"gutter\">  29:<\/span><span class=\"syntax8\">use<\/span> constant DEFAULT_TIMEOUT<br \/><span class=\"gutterH\">  30:<\/span>        <span class=\"syntax18\">=<\/span><span class=\"syntax18\">&gt;<\/span> <span class=\"syntax5\">180<\/span>;<br \/><span class=\"gutter\">  31:<\/span><br \/><span class=\"gutter\">  32:<\/span><span class=\"syntax8\">my<\/span> <span class=\"syntax9\">$agent<\/span> <span class=\"syntax18\">=<\/span> LWP<span class=\"syntax18\">:<\/span><span class=\"syntax18\">:<\/span>UserAgent<span class=\"syntax18\">-<\/span><span class=\"syntax18\">&gt;<\/span><span class=\"syntax8\">new<\/span>(<span class=\"syntax18\">{<\/span><br \/><span class=\"gutter\">  33:<\/span>                agent <span class=\"syntax18\">=<\/span><span class=\"syntax18\">&gt;<\/span> <span class=\"syntax13\">'<\/span><span class=\"syntax13\">get_veneblogs_top100.plx\/kodegeek<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">0.1<\/span><span class=\"syntax13\">'<\/span>,<br \/><span class=\"gutter\">  34:<\/span>                timeout <span class=\"syntax18\">=<\/span><span class=\"syntax18\">&gt;<\/span> DEFAULT_TIMEOUT<br \/><span class=\"gutterH\">  35:<\/span>            <span class=\"syntax18\">}<\/span>);<br \/><span class=\"gutter\">  36:<\/span><span class=\"syntax8\">my<\/span> <span class=\"syntax9\">$response<\/span> <span class=\"syntax18\">=<\/span> <span class=\"syntax9\">$agent<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">&gt;<\/span>get(URL_TOP100_VENEBLOGS);<br \/><span class=\"gutter\">  37:<\/span><span class=\"syntax8\">if<\/span> (<span class=\"syntax18\">!<\/span> <span class=\"syntax9\">$response<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">&gt;<\/span>is_success) <span class=\"syntax18\">{<\/span><br \/><span class=\"gutter\">  38:<\/span>        <span class=\"syntax8\">die<\/span>     <span class=\"syntax13\">\"<\/span><span class=\"syntax13\">[<\/span><span class=\"syntax13\">ERROR<\/span><span class=\"syntax13\">]<\/span><span class=\"syntax13\">:<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">Unable<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">to<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">retrieve<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">the<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">HTML<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">from<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">'<\/span><span class=\"syntax13\">\"<\/span> . URL_TOP100_VENEBLOGS .<br \/><span class=\"gutter\">  39:<\/span>                <span class=\"syntax13\">\"<\/span><span class=\"syntax13\">'<\/span><span class=\"syntax13\">,<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">Status<\/span><span class=\"syntax13\">:<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">'<\/span><span class=\"syntax13\">\"<\/span> . <span class=\"syntax9\">$response<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">&gt;<\/span>status_line . <span class=\"syntax13\">\"<\/span><span class=\"syntax13\">'<\/span><span class=\"syntax13\">\"<\/span>;<br \/><span class=\"gutterH\">  40:<\/span><span class=\"syntax18\">}<\/span><br \/><span class=\"gutter\">  41:<\/span><span class=\"syntax8\">my<\/span> <span class=\"syntax9\">$parser<\/span> <span class=\"syntax18\">=<\/span> HTML<span class=\"syntax18\">:<\/span><span class=\"syntax18\">:<\/span>Parser<span class=\"syntax18\">-<\/span><span class=\"syntax18\">&gt;<\/span><span class=\"syntax8\">new<\/span>(<br \/><span class=\"gutter\">  42:<\/span>                        api_version <span class=\"syntax18\">=<\/span><span class=\"syntax18\">&gt;<\/span> <span class=\"syntax5\">3<\/span>,<br \/><span class=\"gutter\">  43:<\/span>                        start_h <span class=\"syntax18\">=<\/span><span class=\"syntax18\">&gt;<\/span> [ \\&start_a, <span class=\"syntax13\">\"<\/span><span class=\"syntax13\">tagname<\/span><span class=\"syntax13\">,<\/span><span class=\"syntax13\"> <\/span><span class=\"syntax13\">attr<\/span><span class=\"syntax13\">\"<\/span> ]<br \/><span class=\"gutter\">  44:<\/span>                 );<br \/><span class=\"gutterH\">  45:<\/span><span class=\"syntax9\">$parser<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">&gt;<\/span>ignore_tags(<span class=\"syntax9\">@ignore_tags<\/span>);<br \/><span class=\"gutter\">  46:<\/span><span class=\"syntax9\">$parser<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">&gt;<\/span>parse(<span class=\"syntax9\">$response<\/span><span class=\"syntax18\">-<\/span><span class=\"syntax18\">&gt;<\/span>decoded_content());<br \/><span class=\"gutter\">  47:<\/span><br \/><span class=\"gutter\">  48:<\/span><span class=\"syntax1\">#<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">******<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">Functions<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">used<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">on<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">the<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">script<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">*******<\/span><br \/><span class=\"gutter\">  49:<\/span><span class=\"syntax1\">#<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">Callback<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">function<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">used<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">to<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">print<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">the<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">contents<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">of<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">the<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">HTML<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">file<\/span><br \/><span class=\"gutterH\">  50:<\/span><span class=\"syntax8\">sub<\/span> start_a <span class=\"syntax18\">{<\/span><br \/><span class=\"gutter\">  51:<\/span>        <span class=\"syntax8\">my<\/span> <span class=\"syntax9\">$tagname<\/span> <span class=\"syntax18\">=<\/span> <span class=\"syntax9\">$_<\/span>[<span class=\"syntax5\">0<\/span>];<br \/><span class=\"gutter\">  52:<\/span>        <span class=\"syntax8\">my<\/span> <span class=\"syntax9\">%attr<\/span> <span class=\"syntax18\">=<\/span> <span class=\"syntax9\">%{<\/span><span class=\"syntax9\">$<\/span><span class=\"syntax9\">_<\/span><span class=\"syntax9\">[<\/span><span class=\"syntax9\">1<\/span><span class=\"syntax9\">]<\/span><span class=\"syntax9\">}<\/span>;<br \/><span class=\"gutter\">  53:<\/span>        <span class=\"syntax8\">if<\/span> ( (<span class=\"syntax9\">$tagname<\/span> <span class=\"syntax18\">eq<\/span> <span class=\"syntax13\">\"<\/span><span class=\"syntax13\">a<\/span><span class=\"syntax13\">\"<\/span>) <span class=\"syntax18\">&<\/span><span class=\"syntax18\">&amp;<\/span> (<span class=\"syntax9\">$attr<\/span><span class=\"syntax18\">{<\/span>title<span class=\"syntax18\">}<\/span> <span class=\"syntax18\">=<\/span><span class=\"syntax18\">~<\/span> <span class=\"syntax17\">m#veneblogs#i<\/span>) ) <span class=\"syntax18\">{<\/span><br \/><span class=\"gutter\">  54:<\/span>                <span class=\"syntax9\">$attr<\/span><span class=\"syntax18\">{<\/span>title<span class=\"syntax18\">}<\/span> <span class=\"syntax18\">=<\/span><span class=\"syntax18\">~<\/span> <span class=\"syntax17\">m#enlace<\/span><span class=\"syntax17\"> <\/span><span class=\"syntax17\">externo<\/span><span class=\"syntax17\"> <\/span><span class=\"syntax17\">hacia<\/span><span class=\"syntax17\"> <\/span><span class=\"syntax17\">(.*)#i<\/span>;<br \/><span class=\"gutterH\">  55:<\/span>                <span class=\"syntax9\">$title<\/span> <span class=\"syntax18\">=<\/span> <span class=\"syntax9\">$1<\/span>;<br \/><span class=\"gutter\">  56:<\/span>                <span class=\"syntax9\">$attr<\/span><span class=\"syntax18\">{<\/span>onmouseover<span class=\"syntax18\">}<\/span> <span class=\"syntax18\">=<\/span><span class=\"syntax18\">~<\/span> <span class=\"syntax17\">m#http:\/\/(.*)';#<\/span>;<br \/><span class=\"gutter\">  57:<\/span>                <span class=\"syntax9\">$url<\/span> <span class=\"syntax18\">=<\/span> <span class=\"syntax9\">$1<\/span>;<br \/><span class=\"gutter\">  58:<\/span>                <span class=\"syntax8\">if<\/span> ( (<span class=\"syntax10\">length<\/span>(<span class=\"syntax9\">$title<\/span>) <span class=\"syntax18\">=<\/span><span class=\"syntax18\">=<\/span> <span class=\"syntax5\">0<\/span>) <span class=\"syntax18\">|<\/span><span class=\"syntax18\">|<\/span> (<span class=\"syntax10\">length<\/span>(<span class=\"syntax9\">$url<\/span>) <span class=\"syntax18\">=<\/span><span class=\"syntax18\">=<\/span> <span class=\"syntax5\">0<\/span>) ) <span class=\"syntax18\">{<\/span><br \/><span class=\"gutter\">  59:<\/span>                        <span class=\"syntax8\">return<\/span>;<br \/><span class=\"gutterH\">  60:<\/span>                <span class=\"syntax18\">}<\/span><br \/><span class=\"gutter\">  61:<\/span>                <span class=\"syntax1\">#<\/span><span class=\"syntax1\">printf<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">\"%s,<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">%s\\n\",<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">$title,<\/span><span class=\"syntax1\"> <\/span><span class=\"syntax1\">$url;<\/span><br \/><span class=\"gutter\">  62:<\/span>                <span class=\"syntax10\">write<\/span>;<br \/><span class=\"gutter\">  63:<\/span>        <span class=\"syntax18\">}<\/span><br \/><span class=\"gutter\">  64:<\/span><span class=\"syntax18\">}<\/span><br \/><span class=\"gutterH\">  65:<\/span>__END__<br \/><span class=\"gutter\">  66:<\/span><span class=\"syntax12\">=head1<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">NAME<\/span><br \/><span class=\"gutter\">  67:<\/span><br \/><span class=\"gutter\">  68:<\/span><span class=\"syntax2\">get_veneblogs_top100<\/span><span class=\"syntax2\">.<\/span><span class=\"syntax2\">plx<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">-<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">Get<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">the<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">VeneBlog<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">Top<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">100<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">list<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">as<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">text<\/span><span class=\"syntax2\">.<\/span><br \/><span class=\"gutter\">  69:<\/span><br \/><span class=\"gutterH\">  70:<\/span><span class=\"syntax12\">=<\/span><span class=\"syntax12\">head1<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">DESCRIPTION<\/span><br \/><span class=\"gutter\">  71:<\/span><br \/><span class=\"gutter\">  72:<\/span><span class=\"syntax2\">This<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">script<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">will<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">download<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">show<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">the<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">Blog<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">name<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">and<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">the<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">URL<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">for<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">the<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">\"<\/span><span class=\"syntax2\">100<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">TOP<\/span><span class=\"syntax2\">\"<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">Blogs<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">of<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">Veneblogs<\/span><span class=\"syntax2\">;<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">Each<\/span><br \/><span class=\"gutter\">  73:<\/span><span class=\"syntax2\">blog<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">title<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">will<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">be<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">show<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">along<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">with<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">their<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">real<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">URL<\/span><span class=\"syntax2\">.<\/span><br \/><span class=\"gutter\">  74:<\/span><br \/><span class=\"gutterH\">  75:<\/span><span class=\"syntax12\">=<\/span><span class=\"syntax12\">head1<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">AUTHOR<\/span><br \/><span class=\"gutter\">  76:<\/span><br \/><span class=\"gutter\">  77:<\/span><span class=\"syntax2\">Jose<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">Vicente<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">Nunez<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">Zuleta<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">(<\/span><span class=\"syntax2\">josevnz<\/span><span class=\"syntax2\">@<\/span><span class=\"syntax2\">yahoo<\/span><span class=\"syntax2\">.<\/span><span class=\"syntax2\">com<\/span><span class=\"syntax2\">)<\/span><br \/><span class=\"gutter\">  78:<\/span><br \/><span class=\"gutter\">  79:<\/span><span class=\"syntax12\">=<\/span><span class=\"syntax12\">head1<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">BLOG<\/span><br \/><span class=\"gutterH\">  80:<\/span><br \/><span class=\"gutter\">  81:<\/span><span class=\"syntax2\">KodeGeek<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">-<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">http<\/span><span class=\"syntax2\">:<\/span><span class=\"syntax2\">\/<\/span><span class=\"syntax2\">\/<\/span><span class=\"syntax2\">kodegeek<\/span><span class=\"syntax2\">.<\/span><span class=\"syntax2\">com<\/span><br \/><span class=\"gutter\">  82:<\/span><br \/><span class=\"gutter\">  83:<\/span><span class=\"syntax12\">=<\/span><span class=\"syntax12\">head1<\/span><span class=\"syntax2\"> <\/span><span class=\"syntax2\">LICENSE<\/span><br \/><span class=\"gutter\">  84:<\/span><br \/><span class=\"gutterH\">  85:<\/span><span class=\"syntax2\">GPL<\/span><br \/><span class=\"gutter\">  86:<\/span><br \/><span class=\"gutter\">  87:<\/span><span class=\"syntax2\">cut<\/span><br \/><\/pre>\n<p>La salida de ejemplo:<\/p>\n<pre><br \/>Blog title:                                                      URL:<br \/>---------------------------------------------------------------- ----------------------------------<br \/>>> Rozanel                                                       www.rozanel.blogspot.com<br \/>Brea...                                                          www.carlosbrea.com<br \/>El Blog De Juancho                                               juan-casanas.blogspot.com<br \/>Hedonista reprimida en busca del nirvana                         canelita.blogspot.com<br \/>rubenologia.net                                                  www.rubenologia.net<br \/>BlogaCine | Blog Venezolano sobre Cine                           www.blogacine.com<br \/>quieto                                                           quieto.motocine.com\/<br \/>La Taguarita                                                     taguarita.blogspot.com<br \/>unocontodo light                                                 unocontodo.blogspot.com\/<br \/>Da Vinci's Element. Artes & Tecnologia                           dvinci.outnloud.net<br \/><\/pre>\n<p>Hay varias referencias obscuras en el c\u00f3digo; Entre ellas:<\/p>\n<ul>\n<li>El uso de Expresiones regulares para terminar de obtener los valores deseados de los atributos &#8216;<span style=\"font-weight: bold;\">title<\/span>&#8216; y &#8216;<span style=\"font-weight: bold;\">onmouseover<\/span>&#8216; de la etiqueta &#8216;<span style=\"font-weight: bold;\">a<\/span>&#8216; en el HTML.<\/li>\n<li>El uso de reportes en Perl, para obtener una salida formateada<\/li>\n<\/ul>\n<p> En pocas palabras, le sale estudiar ;). Le recomiendo que se lea las siguientes p\u00e1ginas man: <span style=\"font-style: italic;\">LPW::UserAngent, HTML::Parser, perlre, perlform<\/span>.<\/p>\n<p>En un pr\u00f3ximo articulo les voy a mostrar como hacer lo mismo <span style=\"font-style: italic;\">pero con Java <\/span>y utilizando un enfoque un poco menos tradicional (el cual tambi\u00e9n puede ser aplicado a Perl, pero me provoc\u00f3 jugar con Java esta vez :D). Puede <a href=\"http:\/\/prdownloads.sourceforge.net\/elangelnegro\/get_veneblogs_top100.plx?download\">bajarse el c\u00f3digo de aqui<\/a>.<\/p>\n<p>Buscar en Technorati: <a href=\"http:\/\/technorati.com\/tag\/perl\" rel=\"tag\">perl<\/a>, <a href=\"http:\/\/technorati.com\/tag\/perl\" rel=\"cpan\">cpan<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Web-scrapping no es un t\u00e9rmino nuevo, y no es m\u00e1s que procesar una p\u00e1gina web utilizando un lenguaje de programaci\u00f3n; La idea es eliminar el HTML (presentaci\u00f3n) que no nos interesa para luego procesar los datos que est\u00e1n en la pag\u00edna web. T\u00edpicamente esta era una de las alternativas usadas por los desarrolladores antes de <a class=\"read-more\" href=\"http:\/\/kodegeek.com\/blog\/2005\/08\/13\/echando-codigo-%c2%bfcomo-hacer-web-scrapping-con-perl\/\">[&hellip;]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[],"tags":[],"_links":{"self":[{"href":"http:\/\/kodegeek.com\/blog\/wp-json\/wp\/v2\/posts\/830"}],"collection":[{"href":"http:\/\/kodegeek.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/kodegeek.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/kodegeek.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/kodegeek.com\/blog\/wp-json\/wp\/v2\/comments?post=830"}],"version-history":[{"count":0,"href":"http:\/\/kodegeek.com\/blog\/wp-json\/wp\/v2\/posts\/830\/revisions"}],"wp:attachment":[{"href":"http:\/\/kodegeek.com\/blog\/wp-json\/wp\/v2\/media?parent=830"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/kodegeek.com\/blog\/wp-json\/wp\/v2\/categories?post=830"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/kodegeek.com\/blog\/wp-json\/wp\/v2\/tags?post=830"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}