View Full Version : Meta tag extractor
i have downloaded a spider script that i got from a site.
I want to change this script so that it will extract all the meta tags from that page and print them onto the screen.
E.G:
<meta name="keywords" content="javascript forum, dhtml, cgi, php, html, css, xml, xls, mysql, asp, flash, java, forums">
will print on the screen:
javascript forum, dhtml, cgi, php, html, css, xml, xls, mysql, asp, flash, java, forums
I want to make my own search engine! Thanks in advance -
Tim Groeneveld
i forgot - included is the script.
mercurus
11-09-2002, 04:24 PM
G'day
This could be a toughie...
------------------------------------------------------------------------------------
<meta name="keywords" content="javascript forum, dhtml, cgi, php, html, css, xml, xls, mysql, asp, flash, java, forums">
will print on the screen:
javascript forum, dhtml, cgi, php, html, css, xml, xls, mysql, asp, flash, java, forums
------------------------------------------------------------------------------------
Since you can be confident of the way in which this will be inputted, this should do the trick:
(In this code snippet, $line currently holds the above line)
$line =~ s/\<meta name\=\"keywords\"\scontent\=\"//;
$line =~ s/\"\>//;
print STDOUT "$line\n";
This will simply take out the surrounding html and leave you with the text...
This is the power of regular expressions...
It might be better if you were to read that little lot into an array as follows:
my @keywords = split(/, /, $line);
and THEN print them out:
foreach my $element (@keywords) {
print "element\n";
}
and that way you can use that structure to better control what's going on...
You might also like to think about what would happen if someone maliciously put something into their meta headers and what they could do to your search engine...
Hope that helps, I fear it will have generated more questions than it provided answers, but feel free to post away and we'll try and help you to answer them :)
Cheers
mercurus
The script is nearly done.
i need 2 more things:
1) That it can download the URL: microsoft.com to I:\temp.htm
2) That it will search the array @LINES for the <meta name="keywords" and if found then print the results
Thanks again -
Tim
mercurus
11-10-2002, 06:48 AM
G'day again
Part 1 involves a little socket programming or at least a module I'm not familiar with... but it shouldn't be too difficult...
Part 2 isn't too bad though...
> 2) That it will search the array @LINES for the <meta name="keywords" and if found then print the results
I assume this is because the script is reading the contents of @lines in from a webpage ?
foreach my $line (@lines) {
if ($line =~ /\<meta\sname\=\"keywords\"/ig) {
$line =~ s/\<meta\sname\=\"keywords\"//ig;
$line =~ s/\"\>//ig;
print "$line\n";
}
}
This will search through each line of input, and if it contains '<meta name="keywords"' then it will remove it, and the final part of the html code and then print it out.
Hope that helps
I'd be interested to see the completed code :) PM me if I can have a look when its done..?
Cheers
mercurus
Sure, it is mainly for if you want your very own indipedant search engine. IT WILL USE ALOT OF BANDWITH
mercurus
11-11-2002, 04:43 AM
G'day
I can't see me using it as such, more just for personal interest :)
Any more problems so far ?
Cheers
mercurus
NONE! can you believe that??
;)
vBulletin® v3.8.2, Copyright ©2000-2010, Jelsoft Enterprises Ltd.