Go Back   CodingForums.com > :: Client side development > General web building

Before you post, read our: Rules & Posting Guidelines

Reply
 
Thread Tools Rate Thread
Enjoy an ad free experience by logging in. Not a member yet? Register.
Old 10-30-2005, 01:08 AM   PM User | #1
Riboflavin
Regular Coder

 
Join Date: Sep 2004
Location: USA
Posts: 312
Thanks: 7
Thanked 0 Times in 0 Posts
Riboflavin is an unknown quantity at this point
Script To Automatically Follow Links

Ok, I want to make a script that I can give it a URL, it will read all the links on that page, then follow them, read all the links on those pages and follow them, and continue to do so. I also want it to record each link it follows.

Is there a way to do so in PHP, or maybe another language?
Riboflavin is offline   Reply With Quote
Old 11-04-2005, 10:01 PM   PM User | #2
VortexCortex
Regular Coder

 
Join Date: May 2005
Posts: 142
Thanks: 0
Thanked 0 Times in 0 Posts
VortexCortex is an unknown quantity at this point
Post You need more functionality than Javascript has.

LOL... That's the exact description of the first Java application I ever made. I used it to make an offline copy of Sun's Java API specifications. (I know now that I could have downloaded the .zip file, right?)

Javascript is not capable of storing the results, try Java, C, or VB...

I used sun's website to learn how to develop Java apps for free...

Java JDK 5.0
Java 1.5.0 API Specs
Java Tutorial

Good Luck
VortexCortex is offline   Reply With Quote
Old 11-06-2005, 05:43 PM   PM User | #3
hyperbole
Regular Coder

 
Join Date: Sep 2005
Location: Madison, Indiana, USA
Posts: 166
Thanks: 0
Thanked 0 Times in 0 Posts
hyperbole is an unknown quantity at this point
Here is an example of a script in Perl that will walk the links of a given URL. You could use it to create a script that will download the pages connected to a specified URL.


.
hyperbole is offline   Reply With Quote
Old 11-07-2005, 03:58 PM   PM User | #4
Riboflavin
Regular Coder

 
Join Date: Sep 2004
Location: USA
Posts: 312
Thanks: 7
Thanked 0 Times in 0 Posts
Riboflavin is an unknown quantity at this point
PHP Code:
<?php

/*
*#################################################################
*    // VERSION 0.2
*
*    RETRIEVES LINKS FROM A URL.
*    1) Include this file.
*    2) $links=GetLink('http://www.somesite.com', '');
*    3) You now have an array of all unique links at the URL specified.
*    4) You can now filter the links you want returned in the second parameter.
*       5) Added error checking if the page happened to have no links, in which case
*       the array value doesn't exist.
*
*    // Ignores javascript, links with email addresses, etc.
*    // Filters are case sensitive and must use web escape characters to match
*       i.e.  & = &amp;
*
*#################################################################
*/


function GetInput($url) {
   
$handle=fopen($url'r');
   
$contents '';
   while (!
feof($handle)) {
      
$contents .= fread($handle8192);
   }
   
fclose($handle);
   return 
$contents;
}

function 
GetLinks($url$filter)
{
    
// $url="http://server.com/somefile.html";
    
$links = array();
    
$match_domain='_[hH][tT][tT][pP]:\/\/(.*?)(/|$)_';
    
preg_match($match_domain$url$res);
    
$domain=$res[1];
    if (!
$domain)
        return 
false;
    if (!
$input=GetInput($url))
        return 
false;

    
$lookfor='/<[aA]\s.*?[hH][rR][eE][fF]=[             "\']{0,}([-.,\%_\(\)|=~;+:\?\&\/a-zA-Z0-9]+)[ "\'>]/';
    
preg_match_all($lookfor$input$data);
    while (list(
$k$v)=each($data[1]))
    {
        
// filter by
        
if( strlen($filter) )
        {
            if( !
strstr($v$filter) )
            {
                continue;
            }
        }
        if (
stristr($v'javascript:'))
        {
            
// ignore - contains javascript
        
}
        elseif (
stristr($v'//')==$v)
        {
            
$v='http:'.$v;
            
$links[]=$v;
        }
        elseif (
stristr($v'http://')!=$v)
        {
            if (
stristr($v'/')!=$v)
                
$sep='/';
            else
                
$sep='';
            
$v='http://'.$domain.$sep.$v;
            
$links[]=$v;
        }
        else
            
$links[]=$v;
    }
    if( 
count($links) )
    {
        
$links=array_flip($links);
        
$links=array_keys($links);
    }
    else
        
$links[] = 'No Data';
    return 
$links;
}

$noLinkCount 0;

while (
== 1) {

$filename 'links.txt';
$lines file($filename);
$count count($lines);

$count $count--;

$lineNumber rand(0$count);

$myFile "links.txt";
$handle fopen($myFile"r");
$contents fread($handlefilesize($myFile));
fclose($handle);
$data explode("\n"$contents);

$place $data[$lineNumber];

$links4 GetLinks($place'');

$fh fopen($myFile'a');
if(!
is_array($links4))
{
$noLinkCount 0;
} else {
while (list(
$key,$value) = each($links4)) {
$stringData $value "\n";
fwrite($fh$stringData);
}
}

fclose($fh);

}

echo 
"Done!";

 
?>


Ok, this is a code I modified to find all the links on a page, then write them to a text file. It then randomly chooses one of those links to search for links.

To make this a little bit more advanced, how could I make it so that it is only storing the domain, for example, if the link is www.example.com/folder/page.html is just stores it as www.example.com only.

Also how could I efficiently search to make sure the URL is not already in the file?

Thanks for all your help!
Riboflavin is offline   Reply With Quote
Old 11-08-2005, 09:20 AM   PM User | #5
Jack Daniels
New to the CF scene

 
Join Date: Nov 2005
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Jack Daniels is an unknown quantity at this point
Quote:
Originally Posted by Riboflavin
Ok, this is a code I modified to find all the links on a page, then write them to a text file. It then randomly chooses one of those links to search for links.

To make this a little bit more advanced, how could I make it so that it is only storing the domain, for example, if the link is www.example.com/folder/page.html is just stores it as www.example.com only.

Also how could I efficiently search to make sure the URL is not already in the file?

Thanks for all your help!
I think that would be simpler to store the urls in a database and then check if an URL is already present with a query like this:

SELECT * FROM urls WHERE url LIKE '%google.com%'

This will return every url that contains "google.com"...

Bye
Jack Daniels is offline   Reply With Quote
Old 11-09-2005, 02:53 AM   PM User | #6
Riboflavin
Regular Coder

 
Join Date: Sep 2004
Location: USA
Posts: 312
Thanks: 7
Thanked 0 Times in 0 Posts
Riboflavin is an unknown quantity at this point
I am still very unfamiliar with mysql. Could anyone help me modify my code to store to a mysql DB rather than a flat file. Then, rather than just using the root URL, we can use the entire link and search the DB before storing it to avoid restoring?
Riboflavin is offline   Reply With Quote
Old 11-11-2005, 10:16 AM   PM User | #7
anshul
Regular Coder

 
Join Date: Jul 2004
Posts: 122
Thanks: 0
Thanked 0 Times in 0 Posts
anshul is an unknown quantity at this point
Thumbs up

Well, Snoopy is fairly easy to use ..

I'm wandering myself, how to implement, a full-fledged spider
__________________
`~@#\^%&*/\.<.\/-|+|_!:;..=?>
PHP, SEO | anshul | mediasworks.org | FB
anshul is offline   Reply With Quote
Old 01-02-2007, 04:21 PM   PM User | #8
phoenix1215
New Coder

 
Join Date: Oct 2006
Location: Pittsburg, KS
Posts: 19
Thanks: 0
Thanked 0 Times in 0 Posts
phoenix1215 is an unknown quantity at this point
I'm looking for something somewhat similar to these, but I'm a total newbie to a lot of this.

First, I have a fan site for Neopets and Neopets has tons of sponsor links... all links like this:

http://www.neopets.com/process_click.phtml?item_id=

and a lot of them actually payout in neopoints (thier currency) just for visiting the sponsor. I have a section of my site dedicated to the links that currently payout and how much they pay per click and total (thankfully none are over 3 clicks) However, I have been having to manually follow each and every link then wait for 15 to 20 minutes to see if it pays out or not. Then if it does pay out, I have to follow it 2 more times, waiting 15 to 20 minutes in between to see how many times I have to click it and what the total payout is.

Since they have roughly 10,000 links that I'm having to follow, I've been spending over a month straight, if not longer, doing this and by the time I'm finished, I'm starting over to once again verify that they are or are not paying out or that they haven't changed.

There's stiff competition and since I'm just one person it's taking way too long for me to find the links. I'm trying to find something that might help me speed this up some.

I know I won't get something that will take care of all of this for me, but something that might at least verify all the links as either linking to external sponsor pages, redirecting to the home page, or as invalid would be a great help. That would certainly narrow it down at least.

Any ideas how I could do this?

I have a database that I can use to store the links but I wouldn't know how to set that up.
phoenix1215 is offline   Reply With Quote
Reply

Bookmarks

Jump To Top of Thread


Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 12:32 AM.


Advertisement
Log in to turn off these ads.