Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 8 of 8
  1. #1
    Regular Coder
    Join Date
    Sep 2004
    Location
    USA
    Posts
    314
    Thanks
    7
    Thanked 0 Times in 0 Posts

    Script To Automatically Follow Links

    Ok, I want to make a script that I can give it a URL, it will read all the links on that page, then follow them, read all the links on those pages and follow them, and continue to do so. I also want it to record each link it follows.

    Is there a way to do so in PHP, or maybe another language?

  • #2
    Regular Coder
    Join Date
    May 2005
    Posts
    142
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Post You need more functionality than Javascript has.

    LOL... That's the exact description of the first Java application I ever made. I used it to make an offline copy of Sun's Java API specifications. (I know now that I could have downloaded the .zip file, right?)

    Javascript is not capable of storing the results, try Java, C, or VB...

    I used sun's website to learn how to develop Java apps for free...

    Java JDK 5.0
    Java 1.5.0 API Specs
    Java Tutorial

    Good Luck

  • #3
    Regular Coder
    Join Date
    Sep 2005
    Location
    Madison, Indiana, USA
    Posts
    166
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Here is an example of a script in Perl that will walk the links of a given URL. You could use it to create a script that will download the pages connected to a specified URL.


    .

  • #4
    Regular Coder
    Join Date
    Sep 2004
    Location
    USA
    Posts
    314
    Thanks
    7
    Thanked 0 Times in 0 Posts
    PHP Code:
    <?php

    /*
    *#################################################################
    *    // VERSION 0.2
    *
    *    RETRIEVES LINKS FROM A URL.
    *    1) Include this file.
    *    2) $links=GetLink('http://www.somesite.com', '');
    *    3) You now have an array of all unique links at the URL specified.
    *    4) You can now filter the links you want returned in the second parameter.
    *       5) Added error checking if the page happened to have no links, in which case
    *       the array value doesn't exist.
    *
    *    // Ignores javascript, links with email addresses, etc.
    *    // Filters are case sensitive and must use web escape characters to match
    *       i.e.  & = &amp;
    *
    *#################################################################
    */


    function GetInput($url) {
       
    $handle=fopen($url'r');
       
    $contents '';
       while (!
    feof($handle)) {
          
    $contents .= fread($handle8192);
       }
       
    fclose($handle);
       return 
    $contents;
    }

    function 
    GetLinks($url$filter)
    {
        
    // $url="http://server.com/somefile.html";
        
    $links = array();
        
    $match_domain='_[hH][tT][tT][pP]:\/\/(.*?)(/|$)_';
        
    preg_match($match_domain$url$res);
        
    $domain=$res[1];
        if (!
    $domain)
            return 
    false;
        if (!
    $input=GetInput($url))
            return 
    false;

        
    $lookfor='/<[aA]\s.*?[hH][rR][eE][fF]=[             "\']{0,}([-.,\%_\(\)|=~;+:\?\&\/a-zA-Z0-9]+)[ "\'>]/';
        
    preg_match_all($lookfor$input$data);
        while (list(
    $k$v)=each($data[1]))
        {
            
    // filter by
            
    if( strlen($filter) )
            {
                if( !
    strstr($v$filter) )
                {
                    continue;
                }
            }
            if (
    stristr($v'javascript:'))
            {
                
    // ignore - contains javascript
            
    }
            elseif (
    stristr($v'//')==$v)
            {
                
    $v='http:'.$v;
                
    $links[]=$v;
            }
            elseif (
    stristr($v'http://')!=$v)
            {
                if (
    stristr($v'/')!=$v)
                    
    $sep='/';
                else
                    
    $sep='';
                
    $v='http://'.$domain.$sep.$v;
                
    $links[]=$v;
            }
            else
                
    $links[]=$v;
        }
        if( 
    count($links) )
        {
            
    $links=array_flip($links);
            
    $links=array_keys($links);
        }
        else
            
    $links[] = 'No Data';
        return 
    $links;
    }

    $noLinkCount 0;

    while (
    == 1) {

    $filename 'links.txt';
    $lines file($filename);
    $count count($lines);

    $count $count--;

    $lineNumber rand(0$count);

    $myFile "links.txt";
    $handle fopen($myFile"r");
    $contents fread($handlefilesize($myFile));
    fclose($handle);
    $data explode("\n"$contents);

    $place $data[$lineNumber];

    $links4 GetLinks($place'');

    $fh fopen($myFile'a');
    if(!
    is_array($links4))
    {
    $noLinkCount 0;
    } else {
    while (list(
    $key,$value) = each($links4)) {
    $stringData $value "\n";
    fwrite($fh$stringData);
    }
    }

    fclose($fh);

    }

    echo 
    "Done!";

     
    ?>


    Ok, this is a code I modified to find all the links on a page, then write them to a text file. It then randomly chooses one of those links to search for links.

    To make this a little bit more advanced, how could I make it so that it is only storing the domain, for example, if the link is www.example.com/folder/page.html is just stores it as www.example.com only.

    Also how could I efficiently search to make sure the URL is not already in the file?

    Thanks for all your help!

  • #5
    New to the CF scene
    Join Date
    Nov 2005
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Riboflavin
    Ok, this is a code I modified to find all the links on a page, then write them to a text file. It then randomly chooses one of those links to search for links.

    To make this a little bit more advanced, how could I make it so that it is only storing the domain, for example, if the link is www.example.com/folder/page.html is just stores it as www.example.com only.

    Also how could I efficiently search to make sure the URL is not already in the file?

    Thanks for all your help!
    I think that would be simpler to store the urls in a database and then check if an URL is already present with a query like this:

    SELECT * FROM urls WHERE url LIKE '%google.com%'

    This will return every url that contains "google.com"...

    Bye

  • #6
    Regular Coder
    Join Date
    Sep 2004
    Location
    USA
    Posts
    314
    Thanks
    7
    Thanked 0 Times in 0 Posts
    I am still very unfamiliar with mysql. Could anyone help me modify my code to store to a mysql DB rather than a flat file. Then, rather than just using the root URL, we can use the entire link and search the DB before storing it to avoid restoring?

  • #7
    Regular Coder
    Join Date
    Jul 2004
    Posts
    133
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Thumbs up

    Well, Snoopy is fairly easy to use ..

    I'm wandering myself, how to implement, a full-fledged spider
    `~@#\^%&*/\.<.\/-|+|_!:;..=?>
    PHP, SEO | anshul | mediasworks.org | FB

  • #8
    New Coder
    Join Date
    Oct 2006
    Location
    Pittsburg, KS
    Posts
    19
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I'm looking for something somewhat similar to these, but I'm a total newbie to a lot of this.

    First, I have a fan site for Neopets and Neopets has tons of sponsor links... all links like this:

    http://www.neopets.com/process_click.phtml?item_id=

    and a lot of them actually payout in neopoints (thier currency) just for visiting the sponsor. I have a section of my site dedicated to the links that currently payout and how much they pay per click and total (thankfully none are over 3 clicks) However, I have been having to manually follow each and every link then wait for 15 to 20 minutes to see if it pays out or not. Then if it does pay out, I have to follow it 2 more times, waiting 15 to 20 minutes in between to see how many times I have to click it and what the total payout is.

    Since they have roughly 10,000 links that I'm having to follow, I've been spending over a month straight, if not longer, doing this and by the time I'm finished, I'm starting over to once again verify that they are or are not paying out or that they haven't changed.

    There's stiff competition and since I'm just one person it's taking way too long for me to find the links. I'm trying to find something that might help me speed this up some.

    I know I won't get something that will take care of all of this for me, but something that might at least verify all the links as either linking to external sponsor pages, redirecting to the home page, or as invalid would be a great help. That would certainly narrow it down at least.

    Any ideas how I could do this?

    I have a database that I can use to store the links but I wouldn't know how to set that up.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •