...

View Full Version : Crawler issue



ckfoxtrot
04-25-2012, 09:36 PM
Hey all,

I am still learning PHP and toying with a web crawler that uses curl. For some reason followlocation doesn't seem to be working.

Here is the code



<?php

function LoadCURLPage($url, $agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0', $cookie = '', $referer = '', $post_fields = '', $return_transfer = TRUE, $follow_location = TRUE, $ssl = '', $curlopt_header = 0)
{
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);

if($ssl)
{
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
}

curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);

if($agent)
{
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
}

if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 0);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
}

curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

if($referer)
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}

if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}

$result = curl_exec($ch);

curl_close($ch);

return $result;
}


Any ideas on what might be wrong?

Fou-Lu
04-25-2012, 10:17 PM
Setting looks right. I'll test it out when I get home and see what it does.


This works fine for me. Chained as so:


// curlfollow1.php
<?php

header('Location: http://localhost:14055/curlfollow2.php');
print __FILE__ . PHP_EOL;
phpinfo();
?>



// curlfollow2
<?php

print __FILE__ . PHP_EOL;
phpinfo();
?>

And called with


print loadCurlPage('http://localhost:14055/curlfollow1.php');

My results are:


D:\TestingProject\curlfollow2.php
// PHPinfo follows


So there appears to be nothing wrong with the setup.

ckfoxtrot
04-26-2012, 01:31 AM
strange, i'll have to play with it more

ckfoxtrot
04-26-2012, 01:37 AM
Doh, I realize the problem now (I think). The links in my test site are setup using javascript

Fou-Lu
04-26-2012, 03:12 AM
Whoa whoa. This won't follow links. It follows a location header redirect only.
If you want to follow links you need to parse the links out and execute a new curl on them.

ckfoxtrot
04-26-2012, 07:25 AM
ah, back to the learning board

ckfoxtrot
04-26-2012, 08:49 AM
Well, making progress. I am able to parse out links from a site and store them in a file after checking if it already exists or not. However, I'm pulling out the whole <a href=""></a> so I need to cut out just the url from that, and, if an internal site link, add the domain back onto the front end.

ckfoxtrot
04-26-2012, 08:58 AM
Well, extracting just the url from the html code was easy enough. Now I just need to write up something to check to see if the domain is at the start of the link or not... will have to think about that one as it isn't jumping out at me right away.

Any suggestions?

Fou-Lu
04-26-2012, 04:42 PM
Use parse_url (http://ca2.php.net/manual/en/function.parse-url.php) to split it up. Compare it to the site you are curled to.

ckfoxtrot
04-26-2012, 11:16 PM
I ended up going a slightly different route:



$pos0 = stripos($info, $findslash);
$pos1 = stripos($info, $findhttp);
$pos2 = stripos($info, $findwww);
$pos3 = stripos($info, $findmail);
$pos4 = stripos($info, $findjs);

if ($pos0 === 0)
{
$info = $url . $info;

$s_file = readfile('sites.txt');

if (preg_match('#' . $info . '#', $s_file, $match))
{
}

else {

echo $info . "\n";

$s_fh = fopen($s_file, 'a+') or die("can't open file");

fwrite($s_fh, $info . "\n");

fclose($s_fh);

// Follow link

$data = LoadCURLPage($info);

}
}


Now it is ending (it seems) after the echo without writing anything to sites.txt and I am not sure why.

I also know that I need to set it up to follow the new link differently.

Validating the rest of the url follows the same format as above, eg:



else {

if ($pos1 !== true && $pos2 !== true && $pos3 !== true && $pos4 !== true)
{
$info = "http://www." . $info;

// Checking for URL

$s_file = readfile('sites.txt');

if (preg_match('#' . $info . '#', $s_file, $match))
{
}

else {

echo $info . "\n";

$s_fh = fopen($s_file, 'a+') or die("can't open file");

fwrite($s_fh, $info . "\n");

fclose($s_fh);

// Follow link

$data = LoadCURLPage($info);

}
}

else {

if ($pos2 == true)
{
$info = "http://www" . $info;

// Checking for URL

$s_file = readfile('sites.txt');

if (preg_match('#' . $info . '#', $s_file, $match))
{
}

else {

echo $info;

$s_fh = fopen($s_file, 'a+') or die("can't open file");

fwrite($s_fh, $info . "\n");

fclose($s_fh);

// Follow link

$data = LoadCURLPage($info);

}
}

else {

if ($pos3 == true)
{
}

else {
}
}
}
}

Fou-Lu
04-27-2012, 12:27 AM
For the write problem, check if the newlines are added. If so, its the $info that's no good (the first block I don't see $url or $info defined).

As for parse_url, I think you may be better doing what you are for your approach. The problem with parse_url it is that it doesn't properly parse unless it has the parts. Ie: codingforums.com/showthread.php will just return that as the path. Its not until http:// is added that it splits properly. So relative links won't work properly.

readfile itself isn't right. That should likely be file_get_contents. Readfile works the same as an fopen() followed by fpassthru and finally fclose. It just dumps to the screen, not to a string. So your pattern match is simply comparing a url to the integer representing the number of bytes. I'd use a database. Then just apply a constraint that doesn't allow duplicates, and insert away. Given this, I'd say that you don't even need to check for success, just let it fail silently if it creates a PK violation.

ckfoxtrot
04-27-2012, 05:12 AM
New lines aren't being added. $url and $info are defined before what I posted on here. $info should be fine as it echos just fine (including showing that my attempt to append http://www. to the beginning of relative links works).

I'll change my use of readfile and see what happens, though what your database suggestion sounds pretty good.

ckfoxtrot
04-27-2012, 08:42 AM
I have it working better now but am still having a problem.

main.php



<?php

error_reporting (E_ALL ^ E_NOTICE);

include 'functions.php';

// Connect to this url using CURL

$url = '<url here>;

$data = LoadCURLPage($url);

//Site Function

$sites = sites($data, $url);

?>


functions.php



<?php

function LoadCURLPage($url, $agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0', $cookie = '', $referer = '', $post_fields = '', $return_transfer = TRUE, $follow_location = TRUE, $ssl = '', $curlopt_header = 0)
{

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);

if($ssl)
{
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
}

curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);

if($agent)
{
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
}

if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 0);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
}

curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

if($referer)
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}

if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}

$result = curl_exec($ch);

curl_close($ch);

return $result;
}

// Find links

function sites($s_string, $url)
{
preg_match_all('/<a href="(.*)">(.*)<\/a>/', $s_string, $s_matches);
foreach ($s_matches as $s_value)
{

$string_one = '<a href="';
$string_two = '"';
$new_string = $s_value[0];


$pos = stripos($new_string, $string_one);

$str = substr($new_string, $pos);

$str_two = substr($str, strlen($string_one));

$second_pos = stripos($str_two, $string_two);

$str_three = substr($str_two, 0, $second_pos);

$info = $str_three;

// Validating URL

$findslash = '/';
$findhttp = 'http://';
$findwww = 'www';
$findmail = 'mailto';
$findjs = 'javascript';

$pos0 = stripos($info, $findslash);
$pos1 = stripos($info, $findhttp);
$pos2 = stripos($info, $findwww);
$pos3 = stripos($info, $findmail);
$pos4 = stripos($info, $findjs);

if ($pos0 === 0)
{
$info = $url . $info;

$s_file = file_get_contents('sites.txt');

if (preg_match('#' . $info . '#', $s_file, $match))
{
}

else {

echo $info . "\n";

$s_fh = fopen('sites.txt', 'a+') or die("can't open file");

fwrite($s_fh, $info . "\n");

fclose($s_fh);

}
}

else {

if ($pos1 !== true && $pos2 !== true && $pos3 !== true && $pos4 !== true)
{
$info = "http://www." . $info;

// Checking for URL

$s_file = file_get_contents('sites.txt');

if (preg_match('#' . $info . '#', $s_file, $match))
{
}

else {

echo $info . "\n";

$s_fh = fopen('sites.txt', 'a+') or die("can't open file");

fwrite($s_fh, $info . "\n");

fclose($s_fh);

}
}

else {

if ($pos2 == true)
{
$info = "http://www" . $info;

// Checking for URL

$s_file = file_get_contents('sites.txt');

if (preg_match('#' . $info . '#', $s_file, $match))
{
}

else {

echo $info;

$s_fh = fopen('sites.txt', 'a+') or die("can't open file");

fwrite($s_fh, $info . "\n");

fclose($s_fh);

}
}

else {

if ($pos3 == true)
{
}

else {
}
}
}
}
}
}

?>


It works right the first time through for me, with the first link it finds being a relative link. However, on the second pass it butchers the first url and tacks http://www. onto the front of it. Then, it appears to fail on the third time through.

So I'm guessing the internal array pointer isn't advancing for some reason?

ckfoxtrot
04-27-2012, 11:15 AM
Ah, problem lies in my pattern and array :/

ckfoxtrot
04-27-2012, 11:58 AM
Works with this!



function sites($s_string, $url)
{
preg_match_all("/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]/isU", $s_string, $s_matches);
foreach ($s_matches[1] as $s_value)
{

$string_one = '<a href="';
$string_two = '"';
$info = $s_value;

/* Code not needed with above pattern

$pos = stripos($new_string, $string_one);

$str = substr($new_string, $pos);

$str_two = substr($str, strlen($string_one));

$second_pos = stripos($str_two, $string_two);

$str_three = substr($str_two, 0, $second_pos);

$info = $str_three;

*/


But it does throw a few errors:

PHP Warning: preg_match(): Unknown modifier 'c' in /var/www/projects/functions.php on line 121

Shows up 4 times, with the unknown modifier as c, p, p, and d.

Line 121 is the first instance of

if (preg_match('#' . $info . '#', $s_file, $match))

that I have.

Maybe some weird link formatting throwing it off? I'll have to compare the array to the output.

Now back to figuring out how to start following links and repeating this process.

Fou-Lu
04-27-2012, 02:45 PM
Yeah, it is the creation of it.
Use ~ instead of # for the preg delimiters and make sure you call preg_quote before you provide $info. Chances are the c, p, p, d are representing a hash tag in the url, so it assumes the hashtag is the end of pattern.

ckfoxtrot
04-27-2012, 09:44 PM
Update: This seems to be working pretty well (with one exception)



<?php

function LoadCURLPage($url, $agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0', $cookie = '', $referer = '', $post_fields = '', $return_transfer = TRUE, $follow_location = TRUE, $ssl = '', $curlopt_header = 0)
{

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);

if($ssl)
{
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
}

curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);

if($agent)
{
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
}

if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 0);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
}

curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

if($referer)
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}

if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}

$result = curl_exec($ch);

curl_close($ch);

return $result;
}

// Find links

function sites($s_string, $url)
{
preg_match_all("/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]/isU", $s_string, $s_matches);
foreach ($s_matches[1] as $s_value)
{

preg_quote($s_value);
$info = $s_value;

/* Code not needed with above pattern

$string_one = '<a href="';
$string_two = '"';

$pos = stripos($new_string, $string_one);

$str = substr($new_string, $pos);

$str_two = substr($str, strlen($string_one));

$second_pos = stripos($str_two, $string_two);

$str_three = substr($str_two, 0, $second_pos);

$info = $str_three;

*/

// Validating URL

$findslash = '/';
$findhttp = 'http://';
$findwww = 'www';
$findmail = 'mailto';
$findjs = 'javascript';
$findhttps = 'https://';
$findhash = '#';

$pos0 = stripos($info, $findslash);
$pos1 = stripos($info, $findhttp);
$pos2 = stripos($info, $findwww);
$pos3 = stripos($info, $findmail);
$pos4 = stripos($info, $findjs);
$pos5 = stripos($info, $findhttps);
$pos6 = stripos($info, $findhash);

if ($pos0 === 0)
{
$info = $url . $info;

$s_file = file_get_contents('sites.txt');

if (preg_match('~' . $info . '~', $s_file, $match))
{
}

else {

echo $info . "\n";

$s_fh = fopen('sites.txt', 'a+') or die("cannot open file");

fwrite($s_fh, $info . "\n");

fclose($s_fh);

// Follow New Link

$newdata = LoadCURLPage($info);

//Site Function

$newites = sites($newdata, $url);

}
}

else {

if ($pos1 === false && $pos2 === false && $pos3 === false && $pos4 === false && $pos5 === false && $pos6 !== 0)
{
$info = "http://www." . $info;

// Checking for URL

$s_file = file_get_contents('sites.txt');

if (preg_match('~' . $info . '~', $s_file, $match))
{
}

else {

echo $info . "\n";

$s_fh = fopen('sites.txt', 'a+') or die("cannot open file");

fwrite($s_fh, $info . "\n");

fclose($s_fh);

// Follow New Link

$newdata = LoadCURLPage($info);

//Site Function

$newites = sites($newdata, $url);

}
}

else {

if ($pos2 === 0 && $pos3 === false && $pos4 === false)
{
$info = "http://" . $info;

// Checking for URL

$s_file = file_get_contents('sites.txt');

if (preg_match('~' . $info . '~', $s_file, $match))
{
}

else {

echo $info . "\n";

$s_fh = fopen('sites.txt', 'a+') or die("cannot open file");

fwrite($s_fh, $info . "\n");

fclose($s_fh);

// Follow New Link

$newdata = LoadCURLPage($info);

//Site Function

$newites = sites($newdata, $url);
}
}

else {
if ($pos3 === false && $pos4 === false && $pos5 === false && $pos6 === false)
{
// Checking for URL

$s_file = file_get_contents('sites.txt');

if (preg_match('~' . $info . '~', $s_file, $match))
{
}

else {
echo $info . "\n";

$s_fh = fopen('sites.txt', 'a+') or die("cannot open file");

fwrite($s_fh, $info . "\n");

fclose($s_fh);

// Follow New Link

$newdata = LoadCURLPage($info);

//Site Function

$newites = sites($newdata, $url);
}
}

else
{
if ($pos3 != false)
{
}

else {
if ($pos4 != false)
{
}
}
}
}
}
}
}
}

?>


The only problem now is I'm getting stuck in an infinite loop at one point.

Fou-Lu
04-27-2012, 10:04 PM
Yes, if you are following links eventually it will time out.
Give it a maximum number its allowed to follow. I don't want to go through that second function too much (it looks like you can probably compact it quite a bit more or use some functions to clean it up). The easiest thing to do is add to it at the top of the function:


function sites($s_string, $url, $iMaximumFollows = 10)
{
static $iFollowed = 0;
if ($iFollowed > $iMaximumFollows)
{
return;
}
//...


Surround each call to the function with an increment / decrement:


++$iFollowed;
$newites = sites($newdata, $url);
--$iFollowed;

That will keep it from the cyclical recursion you are seeing now by allowing only 10 (in this example) maximum hops. There are a few places you'll need to change it if you don't compact the existing sites function.

ckfoxtrot
04-27-2012, 11:01 PM
I should probably set a max number to follow.

Right now, I run into a link that requires users to be logged in, if not it redirects to the login page, which my script keeps reloading over and over.

I also need to fix appending the url to relative links if/when it gets away from the original domain.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum