...

View Full Version : Why is this simple php spider script results a bunch of junk



agfre44_9873
09-07-2009, 12:17 AM
Trying to lear this spider scrapping thing. Why is this script output a bunch of junk and how to I clean it?

<?php
$original_file = file_get_contents("http://www.domain.com");

$stripped_file = strip_tags($original_file, "<a>");

preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);

//DEBUGGING
//$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
//$matches[1] now contains only the HREFs in the A tags; ex: link

header("Content-type: text/plain");

//Set the content type to plain text so the print below is easy to read!

print_r($matches); //View the array to see if it worked?>

SKDevelopment
09-07-2009, 01:25 AM
Could you explain what you would like to see as a result of script work ? I have run the script. For me it has output an array with $matches[0] containing the complete <a> tags and $matches[1] containing values of href attributes... Which result would you like to get ? Could you explain the problem a little bit more please ?

agfre44_9873
09-07-2009, 03:52 AM
Apparently it's suppose to go to domain . com and pull the links... correct me if I'm wrong.
What I want it to print is the links. the below is what i'm getting. Is this what ur getting?

Array ( [0] => Array ( [0] => English [1] => Espaņol [2] => Domain.com - It All Starts with a great Domain [3] => [4] => [5] => [6] => Sign up for Domain.com Offers [7] => Support [8] => Domains [9] => Web Hosting [10] => VPS Hosting [11] => Hosting [12] => Web Design [13] => Hosting [14] => Marketing [15] => Start Your Website, Scorching Fast Rock Solid Hosting [16] => Learn More! [17] => RENEW [18] => TRANSFER TO DOMAIN.COM [19] => WHOIS Lookup [20] => See all domain extensions [21] => Included FREE: ■ Total DNS Management ■ URL Forwarding ■ Email Forwarding ■ Transfer Lock [22] => Domain names [23] => Host your site [24] => VPS hosting [25] => *Terms [26] => [27] => [28] => [29] => [30] => Home [31] => Domain Names [32] => Web Hosting [33] => Website Builder [34] => Professional Web Design [35] => Email [36] => VPS Hosting [37] => eCommerce Hosting [38] => eCommerce Web Design [39] => Online Marketing [40] => Email Marketing [41] => PPC Marketing [42] => SEO Services [43] => SSL Certificates [44] => Private Domain Registration [45] => About Us [46] => Customer Support [47] => Blog [48] => Login [49] => WHOIS [50] => About [51] => Support [52] => FAQ [53] => Affiliate Program [54] => Legal Notices [55] => Privacy Policy [56] => Registration Agreement ) [1] => Array ( [0] => javascript:void(0) [1] => ?lang=es [2] => / [3] => https://secure.domain.com/order/usc/index.php?siteid=42566 [4] => /account [5] => https://secure.domain.com/webmail/?siteid=42566 [6] => /newsletter/subscribe.php [7] => /contact [8] => /domains/ [9] => /hosting/ [10] => /vps/ [11] => /email/ [12] => /designstudio/ [13] => /ssl/ [14] => /marketing/ [15] => /hosting/ [16] => /hosting/ [17] => /domains/renewal.php [18] => /domains/transfer.php [19] => https://secure.domain.com/services/whois.php?siteid=42566 [20] => /domains/search.php [21] => /domains/tools.php [22] => /domains/ [23] => /hosting/ [24] => /vps/ [25] => javascript:void(0) [26] => /ssl/ [27] => /domains/tld_us.php [28] => http://twitter.com/domaindotcom [29] => /vps/ [30] => / [31] => /domains/ [32] => /hosting/ [33] => /sitebuilder/ [34] => /designstudio/ [35] => /email/ [36] => /vps/ [37] => /hosting/ [38] => /designstudio/ecommerce.php [39] => /marketing/ [40] => /marketing/email.php [41] => /marketing/promotion.php [42] => /marketing/seo.php [43] => /ssl/ [44] => /domains/whoisprivacy.php [45] => /about/ [46] => /contact/ [47] => /blog/ [48] => /account [49] => https://secure.domain.com/services/whois.php?siteid=42566 [50] => /about [51] => /contact [52] => https://secure.domain.com/KM/script_login.php?account_name=42566 [53] => /affiliate/ [54] => https://secure.domain.com/common/agreements/otherlegal.php?siteid=42566 [55] => https://secure.domain.com/order/register/privacy.php?siteid=42566 [56] => https://secure.domain.com/order/register/agreement.php?siteid=42566 ) )

I see links inside all this junk. how can i clean this stuff out and show just the links?

SKDevelopment
09-07-2009, 01:30 PM
You mean you would like to get rid of anything except absolute or relative URL's ? I have modified your regular expression a little bit:


<?php
$original_file = file_get_contents("http://www.domain.com");

$stripped_file = strip_tags($original_file, "<a>");

preg_match_all("/<a(?:[^>]*)href=([\"']?)(?=http|\/|\.)([^\"' >]*?)\\1(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);

echo '<pre>' . print_r($matches[2],true) . '</pre>';
?>

Now it returns absolute URL's which start with "http" or relative URL's which start with "/" or "." (e.g. "/example.php" or "../example.php" ).

Please run it and look at the result.

Also I would suggest to change domain.com to some other URL for tests... I hope domain.com which you have initially used for the test does not mind free advertising... But still they could be possibly unhappy if we experiment too much... Also the reason to use another domain could be to check different HTML and different situations for testing. E.g. all URL's at domain.com are surrounded with double quotes. This could be different if you run your script on some other site. I would even suggest to create a special page with very bad HTML for testing...

agfre44_9873
09-07-2009, 07:38 PM
hey!!! that's a lot better. Thank you very much! One of my misunderstanding is what inside the preg_match_all(). My intentions is to go into http://www.dol.gov and look for info on migrant workers. So If I use their search engine I get the following link...

http://www.dol.gov/search/AdvSearch.aspx?search_term=migrant

I put this on my small script and pulls out all the links on the above search. I want to get the small amount of info under each link with the spider. Example,

/*this is 1st link under search result*/
Compliance Assistance By Law - The Migrant and Seasonal Agricultural Worker Protection Act
/*end link*/

/*content under link*/
This Page E-mail This Page The Migrant and Seasonal Agricultural Worker ProtectionReturn to By Law Menu OVERVIEW The Migrant and Seasonal Agricultural Worker Protectionprovides employment-related protections to migrant and seasonal agricultural workers and

I want to ge the content also. Every link and content is inside a table. The following format is being use by DOL to display links and contents


<tr>
<td>
<p><a href="http://www.dol.gov/dol/compliance/comp-msawpa.htm">Compliance Assistance By Law - The Migrant and Seasonal Agricultural Worker Protection Act</a></p>
</td>
<td>
<p>34k</p>
</td>
</tr>
<tr>
<td>
<p>This Page E-mail This Page The Migrant and Seasonal Agricultural Worker ProtectionReturn to By Law Menu OVERVIEW The Migrant and Seasonal Agricultural Worker Protectionprovides employment-related protections to migrant and seasonal agricultural workers and</p>
</td>
<td>
</td>
</tr>

I know I have to look at the tags. My guestion is, where do I start? How to I set preg_match_all() to grab both the link and content. Also, how can I get rid of the array and just have the links and content.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum