...

View Full Version : Parsing XML with Perl



TaterSalad
10-05-2009, 06:49 PM
Hello,

First off, I am by no means a seasoned Perl coder, but I am hoping you guys can point me in the right direction. I am trying to parse an XML file from an RSS feed and extract a link from a certain item and have that subroutine return the link as a string. For example, my XML looks something like this:



<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
<channel>
<title>RSS Feed Example</title>
<link>http://www.rss.com</link>
<language>en-usde</language>
<copyright>Copyright � 2008</copyright>
<webMaster>support@rss.com</webMaster>
<image>
<title>RSS Feed</title>
<url>http://www.rss.com/favicon.ico</url>
<link>http://www.rss.com</link>
<width>16</width>
<height>16</height>
</image>
<item>
<title>String.To.Match</title>
<pubDate>Thu, 01 Jan 1970 00:00:00 +0000</pubDate>
<link>http://www.rss.com/download.php?id=178283</link>
<guid>http://www.rss.com/download.php?id=178283</guid>
</item>
<item>
<title>String.NOT.To.Match</title>
<pubDate>Thu, 02 Jan 1970 00:00:00 +0000</pubDate>
<link>http://www.rss.com/download.php?id=178284</link>
<guid>http://www.rss.com/download.php?id=178284</guid>
</item>
</channel>
</rss>


So I want the script to find the item titled "String.to.Match" then have the routine return the string in <link> for the "String.to.Match" item. Any help would be appreciate. Just need to be pointed in the right direction, I am not expecting someone to write the code for me. (Feel free though, ha!)

Thanks!

oesxyl
10-05-2009, 07:21 PM
you can use XML::LibXML



use strict;
use warnings;
use XML::LibXML;

my $parser = XML::LibXML->new();
my $feed = $parser->parse_file($rssfilename);
my $links = $feed->findnodes('//link');
foreach my $link ($links->get_nodelist){
print $link->value,"\n";
}

it's not tested.

are many ways to do that, this is only one of them

I was focused on parsing and extracting because searching is easy when now that
best regards

TaterSalad
10-05-2009, 07:33 PM
Thanks for the speedy reply, oesxyl!

Correct me if I am wrong, but wouldn't that display all <link> nodes? How would I filter only <link> nodes that have "String.To.Match" in <item>?

oesxyl
10-05-2009, 07:41 PM
Thanks for the speedy reply, oesxyl!

Correct me if I am wrong, but wouldn't that display all <link> nodes? How would I filter only <link> nodes that have "String.To.Match" in <item>?
yes you are right, this will display all link nodes.
item is a container.
In which element you want to search?
for example if you search for a given string in title you can use


//item[contains(title,$mystring)]/link

instead of //link. You can use any valid xpath expression.

best regards

TaterSalad
10-05-2009, 11:50 PM
Thanks again oesxyl!

But now I have a new problem. I am trying to download a file from a link. The file downloads, but the file is not the file I want to download. Instead, the contents of the file states I need to login before I can download the file.

For example, I have:




$link = "http://www.website.com/download.php?id=178452";
LWP::Simple::getstore($link,$filepath."download".".txt");



The file downloads, but the contents of the file is HTML of Website.com's login page. I guess my question is...is there any way I can 'login' with Perl so I can download the file appropriately?

I am logged in with Firefox, but I am guessing it is not pulling the cookies from there. Any ideas?

oesxyl
10-06-2009, 12:09 AM
I guess is a better idea to use LWP::UserAgent module instead of simple, see the man page. It's a good idea to look also over lwpcook man page.

in my opinion is a bad idea to hide a rss feed behind a login page but this is probably because of the bad habit of programmers who put scrips on their websites and fetch the feed on each page request

best regards

TaterSalad
10-06-2009, 02:34 AM
I've tried few different things but I can't get it to work. Any suggestions? Here is what I have. The scripts runs but doesn't seem to authenticate.



# GET REQUEST

use URI::URL;

my $url = url('http://www.website.com/login.php');
$url->query_form(username => 'my_username', password => 'my_pass');
my $content = get($url);

#-----------------------------

# POST REQUEST

use HTTP::Request::Common qw(POST);
use LWP::UserAgent;

my $ua = LWP::UserAgent->new();
my $req = POST 'http://www.website.com/login.php',
[ username => 'my_username', password => 'my_pass' ];
$content = $ua->request($req)->as_string;



Then the HTML I am trying to authenticate....




<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>


<title>Welcome to the Website</title>
<link rel="stylesheet" type="text/css" href="templates/default/common.css">
<link rel="shortcut icon" href="pic/roundcube.ico">
<style type="text/css">

#login-form {
margin-left: auto;
margin-right: auto;
margin-top: 50px;
width: 350px;
}

</style>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">


</head><body>


<img src="pic/roundcube_logo.png" id="rcmbtn104" alt="RoundCube Webmail" border="0" height="55" hspace="10" width="165">
<br /><br /><br />

<form method="post" action="takelogin.php">
<br />
<table align="center"><tbody>
<tr>
<td class="title">Username</td><td><input type="text" size="26" name="username" style="width: 200px; border: 1px solid gray" /></td>
</tr>
<tr>
<td class="title">Password</td><td><input type="password" size="26" name="password" style="width: 200px; border: 1px solid gray"/></td>

</tr>
<tr>
<td colspan="2" align=left><input type="checkbox" name="logout" value="yes"><h0>Log me out after 15 minutes inactivity<h0></td>
</tr>
<tr>
<td colspan="2" align=left><input type="checkbox" name="securelogin" value="yes" /><h0>Secure Login <h0></td>
</tr>
<tr>
<td colspan="2" align="center"><input type="submit" value="Log in!" class="button"> <input type="reset" value="Reset" class="button"></td>
</tr>
<tr>

<td colspan="2" align="center"><br /><br /></td>
</tr>
</tbody></table>
<center><h4>Forget your password? Recover <a href="recover.php"><b>via email</b></a></h4><nobr><center>
<center><h4>Need help? <a href="http://embed.site.com/?server=irc.site.net&channel=%23gft-support&noServerNotices=true&noServerMotd=true"><b>Click here</a></h4></center>
</form>

</body></html>



Thanks!

oesxyl
10-06-2009, 03:14 AM
you must post a valid username, password and submit to the script from the action tag of the form, takelogin.php.


'username' => ..., 'password' => ...., 'submit' => 'Log in!'


best regards

TaterSalad
10-06-2009, 04:06 AM
So...



# GET REQUEST

use URI::URL;

my $url = url('http://www.website.com/login.php');
$url->query_form(username => 'my_username', password => 'my_pass');
my $content = get($url);

#-----------------------------

# POST REQUEST

use HTTP::Request::Common qw(POST);
use LWP::UserAgent;

my $ua = LWP::UserAgent->new();
my $req = POST 'http://www.website.com/takelogin.php',
[ username => 'my_username', password => 'my_pass', submit => 'Log in!' ];
$content = $ua->request($req)->as_string;


Tried that but that didnt work. I'm a bit lost. Not really sure how to debug or how to tell if its working other than my script returning a file that is a "not logged in webpage".

oesxyl
10-06-2009, 04:37 AM
So...



# GET REQUEST

use URI::URL;

my $url = url('http://www.website.com/login.php');
$url->query_form(username => 'my_username', password => 'my_pass');
my $content = get($url);

#-----------------------------

# POST REQUEST

use HTTP::Request::Common qw(POST);
use LWP::UserAgent;

my $ua = LWP::UserAgent->new();
my $req = POST 'http://www.website.com/takelogin.php',
[ username => 'my_username', password => 'my_pass', submit => 'Log in!' ];
$content = $ua->request($req)->as_string;


Tried that but that didnt work. I'm a bit lost. Not really sure how to debug or how to tell if its working other than my script returning a file that is a "not logged in webpage".
print $content, maybe you can see why didn't work.
could be a proxy, cookies or something else.
Usualy webmasters try to stop this since can be used by spam bots.

best regards



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum