PDA

View Full Version : extracting using Perl


varunkg
10-03-2003, 09:36 AM
Hi, I needed a Perl script to extract stuff from an HTML webpage. I want to get all the stuff between the tags ' <table class="article" ' and </table>. How can I go about doing this.

Also I am using the environment var PATH_INFO to pass the url of the webpage to the script. Is this a good idea?

Thanks,
varunkg

YUPAPA
10-05-2003, 10:58 AM
Originally posted by varunkg
Hi, I needed a Perl script to extract stuff from an HTML webpage. I want to get all the stuff between the tags ' <table class="article" ' and </table>. How can I go about doing this.

Also I am using the environment var PATH_INFO to pass the url of the webpage to the script. Is this a good idea?

Thanks,
varunkg

HI!~

You mean stuff between the table tags? like this?

<table>get me this</table>

or the one you stated above
<table class="article" get me this</table>

I just want to make sure before i give you the code. And also is the HTML page will be hosted else where meaning that it is not hosted on the same server? :)

varunkg
10-06-2003, 10:27 AM
Hi,

I need it to extract stuff from something like <table class="article" - </table>. The script will be on the same server as the HTML. If your script cannot handle the above would like to tk a look at it in necase.

Thanks.

varunkg
10-08-2003, 09:54 AM
Yupapa,

will be grateful if you can tell me how your script extracts stuff between html tags like <table> and </table>.

THanks,
Varun

YUPAPA
10-08-2003, 07:05 PM
sorry i forgot about it~ :o too much work.


#!/usr/bin/perl
use Fcntl qw(:DEFAULT :flock);
use CGI qw(:standard);

my $run_this_from_browser = 'no'; # SET NO if you are running this from command interface (SSH / Telnet)
my $html_file = 'page.html'; # Full path to the HTML file

# ---------- No need to edit anything below this line ---------- #
my $cgi = new CGI;
my $line = 0;


print $cgi->header(-type=>'text/html') if($run_this_from_browser eq 'yes');

sysopen(FILE, $html_file, O_RDONLY, 0600) or die print "Fail to sysopen $html_file: $!";
flock(FILE, LOCK_SH) or die print "Fail to SH-LOCK $html_file: $!";
while(<FILE>) {
$line++;
if(/\<table\s*(.*)\s*class=\"article\"(.*?)\<\/table\>/i) {
print "Line Number $line has $2";
print '<BR>' if($run_this_from_browser eq 'yes');
print "\n";
}
}
flock(FILE, LOCK_UN) or die print "Fail to UN-LOCK $html_fiel: $!";
close(FILE);

__END__


This will print out what is between <table class="article" </table> and also print out anything that matches like: <table cellpadding="0" blah, etc class="article" </table>. Case is in-sensitive so it works for <TABLE ... <tAbLe... <table, etc...