Ok, I want to make a script that I can give it a URL, it will read all the links on that page, then follow them, read all the links on those pages and follow them, and continue to do so. I also want it to record each link it follows.
Is there a way to do so in PHP, or maybe another language?
LOL... That's the exact description of the first Java application I ever made. I used it to make an offline copy of Sun's Java API specifications. (I know now that I could have downloaded the .zip file, right?)
Javascript is not capable of storing the results, try Java, C, or VB...
I used sun's website to learn how to develop Java apps for free...
Here is an example of a script in Perl that will walk the links of a given URL. You could use it to create a script that will download the pages connected to a specified URL.
/*
*#################################################################
* // VERSION 0.2
*
* RETRIEVES LINKS FROM A URL.
* 1) Include this file.
* 2) $links=GetLink('http://www.somesite.com', '');
* 3) You now have an array of all unique links at the URL specified.
* 4) You can now filter the links you want returned in the second parameter.
* 5) Added error checking if the page happened to have no links, in which case
* the array value doesn't exist.
*
* // Ignores javascript, links with email addresses, etc.
* // Filters are case sensitive and must use web escape characters to match
* i.e. & = &
*
*#################################################################
*/
function GetInput($url) {
$handle=fopen($url, 'r');
$contents = '';
while (!feof($handle)) {
$contents .= fread($handle, 8192);
}
fclose($handle);
return $contents;
}
function GetLinks($url, $filter)
{
// $url="http://server.com/somefile.html";
$links = array();
$match_domain='_[hH][tT][tT][pP]:\/\/(.*?)(/|$)_';
preg_match($match_domain, $url, $res);
$domain=$res[1];
if (!$domain)
return false;
if (!$input=GetInput($url))
return false;
Ok, this is a code I modified to find all the links on a page, then write them to a text file. It then randomly chooses one of those links to search for links.
To make this a little bit more advanced, how could I make it so that it is only storing the domain, for example, if the link is www.example.com/folder/page.html is just stores it as www.example.com only.
Also how could I efficiently search to make sure the URL is not already in the file?
Ok, this is a code I modified to find all the links on a page, then write them to a text file. It then randomly chooses one of those links to search for links.
To make this a little bit more advanced, how could I make it so that it is only storing the domain, for example, if the link is www.example.com/folder/page.html is just stores it as www.example.com only.
Also how could I efficiently search to make sure the URL is not already in the file?
Thanks for all your help!
I think that would be simpler to store the urls in a database and then check if an URL is already present with a query like this:
SELECT * FROM urls WHERE url LIKE '%google.com%'
This will return every url that contains "google.com"...
I am still very unfamiliar with mysql. Could anyone help me modify my code to store to a mysql DB rather than a flat file. Then, rather than just using the root URL, we can use the entire link and search the DB before storing it to avoid restoring?
and a lot of them actually payout in neopoints (thier currency) just for visiting the sponsor. I have a section of my site dedicated to the links that currently payout and how much they pay per click and total (thankfully none are over 3 clicks) However, I have been having to manually follow each and every link then wait for 15 to 20 minutes to see if it pays out or not. Then if it does pay out, I have to follow it 2 more times, waiting 15 to 20 minutes in between to see how many times I have to click it and what the total payout is.
Since they have roughly 10,000 links that I'm having to follow, I've been spending over a month straight, if not longer, doing this and by the time I'm finished, I'm starting over to once again verify that they are or are not paying out or that they haven't changed.
There's stiff competition and since I'm just one person it's taking way too long for me to find the links. I'm trying to find something that might help me speed this up some.
I know I won't get something that will take care of all of this for me, but something that might at least verify all the links as either linking to external sponsor pages, redirecting to the home page, or as invalid would be a great help. That would certainly narrow it down at least.
Any ideas how I could do this?
I have a database that I can use to store the links but I wouldn't know how to set that up.