View Full Version : extracting using Perl
varunkg
10-03-2003, 09:36 AM
Hi, I needed a Perl script to extract stuff from an HTML webpage. I want to get all the stuff between the tags ' <table class="article" ' and </table>. How can I go about doing this.
Also I am using the environment var PATH_INFO to pass the url of the webpage to the script. Is this a good idea?
Thanks,
varunkg
YUPAPA
10-05-2003, 10:58 AM
Originally posted by varunkg
Hi, I needed a Perl script to extract stuff from an HTML webpage. I want to get all the stuff between the tags ' <table class="article" ' and </table>. How can I go about doing this.
Also I am using the environment var PATH_INFO to pass the url of the webpage to the script. Is this a good idea?
Thanks,
varunkg
HI!~
You mean stuff between the table tags? like this?
<table>get me this</table>
or the one you stated above
<table class="article" get me this</table>
I just want to make sure before i give you the code. And also is the HTML page will be hosted else where meaning that it is not hosted on the same server? :)
varunkg
10-06-2003, 10:27 AM
Hi,
I need it to extract stuff from something like <table class="article" - </table>. The script will be on the same server as the HTML. If your script cannot handle the above would like to tk a look at it in necase.
Thanks.
varunkg
10-08-2003, 09:54 AM
Yupapa,
will be grateful if you can tell me how your script extracts stuff between html tags like <table> and </table>.
THanks,
Varun
YUPAPA
10-08-2003, 07:05 PM
sorry i forgot about it~ :o too much work.
#!/usr/bin/perl
use Fcntl qw(:DEFAULT :flock);
use CGI qw(:standard);
my $run_this_from_browser = 'no'; # SET NO if you are running this from command interface (SSH / Telnet)
my $html_file = 'page.html'; # Full path to the HTML file
# ---------- No need to edit anything below this line ---------- #
my $cgi = new CGI;
my $line = 0;
print $cgi->header(-type=>'text/html') if($run_this_from_browser eq 'yes');
sysopen(FILE, $html_file, O_RDONLY, 0600) or die print "Fail to sysopen $html_file: $!";
flock(FILE, LOCK_SH) or die print "Fail to SH-LOCK $html_file: $!";
while(<FILE>) {
$line++;
if(/\<table\s*(.*)\s*class=\"article\"(.*?)\<\/table\>/i) {
print "Line Number $line has $2";
print '<BR>' if($run_this_from_browser eq 'yes');
print "\n";
}
}
flock(FILE, LOCK_UN) or die print "Fail to UN-LOCK $html_fiel: $!";
close(FILE);
__END__
This will print out what is between <table class="article" </table> and also print out anything that matches like: <table cellpadding="0" blah, etc class="article" </table>. Case is in-sensitive so it works for <TABLE ... <tAbLe... <table, etc...
vBulletin® v3.8.2, Copyright ©2000-2012, Jelsoft Enterprises Ltd.