...

View Full Version : Script To Automatically Follow Links



Riboflavin
10-30-2005, 01:08 AM
Ok, I want to make a script that I can give it a URL, it will read all the links on that page, then follow them, read all the links on those pages and follow them, and continue to do so. I also want it to record each link it follows.

Is there a way to do so in PHP, or maybe another language?

VortexCortex
11-04-2005, 11:01 PM
LOL... That's the exact description of the first Java application I ever made. I used it to make an offline copy of Sun's Java API specifications. (I know now that I could have downloaded the .zip file, right?)

Javascript is not capable of storing the results, try Java, C, or VB...

I used sun's website to learn how to develop Java apps for free...

Java JDK 5.0 (http://java.sun.com/j2se/1.5.0/download.jsp)
Java 1.5.0 API Specs (http://java.sun.com/j2se/1.5.0/docs/api/index.html)
Java Tutorial (http://java.sun.com/docs/books/tutorial/index.html)

Good Luck :thumbsup:

hyperbole
11-06-2005, 06:43 PM
Here (http://www.stonehenge.com/merlyn/WebTechniques/col27.html) is an example of a script in Perl that will walk the links of a given URL. You could use it to create a script that will download the pages connected to a specified URL.


.

Riboflavin
11-07-2005, 04:58 PM
<?php

/*
*#################################################################
* // VERSION 0.2
*
* RETRIEVES LINKS FROM A URL.
* 1) Include this file.
* 2) $links=GetLink('http://www.somesite.com', '');
* 3) You now have an array of all unique links at the URL specified.
* 4) You can now filter the links you want returned in the second parameter.
* 5) Added error checking if the page happened to have no links, in which case
* the array value doesn't exist.
*
* // Ignores javascript, links with email addresses, etc.
* // Filters are case sensitive and must use web escape characters to match
* i.e. & = &amp;
*
*#################################################################
*/


function GetInput($url) {
$handle=fopen($url, 'r');
$contents = '';
while (!feof($handle)) {
$contents .= fread($handle, 8192);
}
fclose($handle);
return $contents;
}

function GetLinks($url, $filter)
{
// $url="http://server.com/somefile.html";
$links = array();
$match_domain='_[hH][tT][tT][pP]:\/\/(.*?)(/|$)_';
preg_match($match_domain, $url, $res);
$domain=$res[1];
if (!$domain)
return false;
if (!$input=GetInput($url))
return false;

$lookfor='/<[aA]\s.*?[hH][rR][eE][fF]=[ "\']{0,}([-.,\%_\(\)|=~;+:\?\&\/a-zA-Z0-9]+)[ "\'>]/';
preg_match_all($lookfor, $input, $data);
while (list($k, $v)=each($data[1]))
{
// filter by
if( strlen($filter) )
{
if( !strstr($v, $filter) )
{
continue;
}
}
if (stristr($v, 'javascript:'))
{
// ignore - contains javascript
}
elseif (stristr($v, '//')==$v)
{
$v='http:'.$v;
$links[]=$v;
}
elseif (stristr($v, 'http://')!=$v)
{
if (stristr($v, '/')!=$v)
$sep='/';
else
$sep='';
$v='http://'.$domain.$sep.$v;
$links[]=$v;
}
else
$links[]=$v;
}
if( count($links) )
{
$links=array_flip($links);
$links=array_keys($links);
}
else
$links[] = 'No Data';
return $links;
}

$noLinkCount = 0;

while (1 == 1) {

$filename = 'links.txt';
$lines = file($filename);
$count = count($lines);

$count = $count--;

$lineNumber = rand(0, $count);

$myFile = "links.txt";
$handle = fopen($myFile, "r");
$contents = fread($handle, filesize($myFile));
fclose($handle);
$data = explode("\n", $contents);

$place = $data[$lineNumber];

$links4 = GetLinks($place, '');

$fh = fopen($myFile, 'a');
if(!is_array($links4))
{
$noLinkCount = 0;
} else {
while (list($key,$value) = each($links4)) {
$stringData = $value . "\n";
fwrite($fh, $stringData);
}
}

fclose($fh);

}

echo "Done!";

?>



Ok, this is a code I modified to find all the links on a page, then write them to a text file. It then randomly chooses one of those links to search for links.

To make this a little bit more advanced, how could I make it so that it is only storing the domain, for example, if the link is www.example.com/folder/page.html is just stores it as www.example.com only.

Also how could I efficiently search to make sure the URL is not already in the file?

Thanks for all your help!

Jack Daniels
11-08-2005, 10:20 AM
Ok, this is a code I modified to find all the links on a page, then write them to a text file. It then randomly chooses one of those links to search for links.

To make this a little bit more advanced, how could I make it so that it is only storing the domain, for example, if the link is www.example.com/folder/page.html is just stores it as www.example.com only.

Also how could I efficiently search to make sure the URL is not already in the file?

Thanks for all your help!

I think that would be simpler to store the urls in a database and then check if an URL is already present with a query like this:

SELECT * FROM urls WHERE url LIKE '%google.com%'

This will return every url that contains "google.com"...

Bye ;)

Riboflavin
11-09-2005, 03:53 AM
I am still very unfamiliar with mysql. Could anyone help me modify my code to store to a mysql DB rather than a flat file. Then, rather than just using the root URL, we can use the entire link and search the DB before storing it to avoid restoring?

anshul
11-11-2005, 11:16 AM
Well, Snoopy is fairly easy to use ..

I'm wandering myself, how to implement, a full-fledged spider ;)

phoenix1215
01-02-2007, 05:21 PM
I'm looking for something somewhat similar to these, but I'm a total newbie to a lot of this.

First, I have a fan site for Neopets and Neopets has tons of sponsor links... all links like this:

http://www.neopets.com/process_click.phtml?item_id=

and a lot of them actually payout in neopoints (thier currency) just for visiting the sponsor. I have a section of my site dedicated to the links that currently payout and how much they pay per click and total (thankfully none are over 3 clicks) However, I have been having to manually follow each and every link then wait for 15 to 20 minutes to see if it pays out or not. Then if it does pay out, I have to follow it 2 more times, waiting 15 to 20 minutes in between to see how many times I have to click it and what the total payout is.

Since they have roughly 10,000 links that I'm having to follow, I've been spending over a month straight, if not longer, doing this and by the time I'm finished, I'm starting over to once again verify that they are or are not paying out or that they haven't changed.

There's stiff competition and since I'm just one person it's taking way too long for me to find the links. I'm trying to find something that might help me speed this up some.

I know I won't get something that will take care of all of this for me, but something that might at least verify all the links as either linking to external sponsor pages, redirecting to the home page, or as invalid would be a great help. That would certainly narrow it down at least.

Any ideas how I could do this?

I have a database that I can use to store the links but I wouldn't know how to set that up.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum