PDA

View Full Version : Get Data With Perl Regex


netroact
04-06-2008, 08:30 PM
I am using the following code to snatch the title out of a web page:


# Read file
open(FILE,"<$home_dir/$main_category/$sub_category/$file") || die "Could not open file for reading! $!";

while(<FILE>)
{

#grab the title from web page
$_ =~ s/^<title>(.*)<\/title>$/$1/ if ($_ =~ /^<title>(.*)<\/title>$/);
$title = $1;

}

close FILE || die "Could not close file! $!";


This works, but I would like to learn something, and I've lost patience.

How can I read the title into the $title variable without another line and without using $1?
I would think I could just do this:
$_ =~ s/^<title>(.*)<\/title>$/$title/ if ($_ =~ /^<title>(.*)<\/title>$/);

But, it doesn't work.
Thanks!

netroact
04-06-2008, 09:37 PM
Well, I made it a one-liner, but I still don't understand why the other way doesn't work.
$_ =~ s/^<title>(.*)<\/title>$/$1/ && ($title = $1) if ($_ =~ /^<title>(.*)<\/title>$/);

KevinADC
04-06-2008, 10:37 PM
The substitution is totally unecessary if all you need to do is find the tilte:


# Read file
open(FILE,"<$home_dir/$main_category/$sub_category/$file") || die "Could not open file for reading! $!";

while(<FILE>)
{

if (/^<title>(.*)<\/title>$/) {
$title = $1;
last; # break out of file since there should only be one title
}
}
close FILE || die "Could not close file! $!";
print $title;


You don't need to bind a regexp to $_ with =~ but you can if you prefer to write your code like that. You only need to use binding operators with any variable besides $_.

FishMonger
04-07-2008, 12:13 AM
while(<FILE>)
{

if (/^<title>(.*)<\/title>$/) {
$title = $1;
last; # break out of file since there should only be one title
}
}You don't need to use $1.
while(<FILE>)
{
($title) = m!^<title>(.*)</title>$!;
last if $title;
}

netroact
04-07-2008, 02:33 AM
Thanks for your help. Wonder how long it will take me to figure that code out.

netroact
04-07-2008, 02:49 AM
What if I want to get another tag's content out of the same file. It doesn't work more than once.

netroact
04-07-2008, 04:35 AM
O well, I used this:


if (/^<title>(.*)<\/title>$/i)
{
$title = $1;
}


Works perfect. Thank you very much!

awatson
04-08-2008, 03:30 PM
You may want into using a module if you're looking to parse lots of html in different ways - this may be useful:
http://www.perl.com/pub/a/2006/01/19/analyzing_html.html