Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Page 1 of 2 12 LastLast
Results 1 to 15 of 19

Thread: Crawler issue

  1. #1
    New Coder
    Join Date
    May 2011
    Posts
    19
    Thanks
    3
    Thanked 0 Times in 0 Posts

    Crawler issue

    Hey all,

    I am still learning PHP and toying with a web crawler that uses curl. For some reason followlocation doesn't seem to be working.

    Here is the code

    Code:
    <?php
    
    function LoadCURLPage($url, $agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0', $cookie = '', $referer = '', $post_fields = '', $return_transfer = TRUE, $follow_location = TRUE, $ssl = '', $curlopt_header = 0)
    {
    	$ch = curl_init(); 
    
    	curl_setopt($ch, CURLOPT_URL, $url);
    
    	if($ssl)
    	{
    		curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  2);
    	}
    
    	curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);
    
    	if($agent)
    	{
    		curl_setopt($ch, CURLOPT_USERAGENT, $agent);
    	}
    
    	if($post_fields)
    	{
    		curl_setopt($ch, CURLOPT_POST, 0); 
    		curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields); 
    	}
    
    	curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); 
    	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    
    	if($referer)
    	{
    		curl_setopt($ch, CURLOPT_REFERER, $referer);
    	}
    
    	if($cookie)
    	{
    		curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
    		curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
    	}
    
    	$result = curl_exec($ch);
    
    	curl_close($ch);
    
    	return $result;
    }
    Any ideas on what might be wrong?

  • #2
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,979
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    Setting looks right. I'll test it out when I get home and see what it does.

    Edit:
    This works fine for me. Chained as so:
    PHP Code:
    // curlfollow1.php
    <?php

    header
    ('Location: http://localhost:14055/curlfollow2.php');
    print 
    __FILE__ PHP_EOL;
    phpinfo();
    ?>
    PHP Code:
    // curlfollow2
    <?php

    print __FILE__ PHP_EOL;
    phpinfo();
    ?>
    And called with
    PHP Code:
    print loadCurlPage('http://localhost:14055/curlfollow1.php'); 
    My results are:
    Code:
    D:\TestingProject\curlfollow2.php 
    // PHPinfo follows
    So there appears to be nothing wrong with the setup.
    Last edited by Fou-Lu; 04-26-2012 at 12:07 AM.

  • Users who have thanked Fou-Lu for this post:

    ckfoxtrot (04-26-2012)

  • #3
    New Coder
    Join Date
    May 2011
    Posts
    19
    Thanks
    3
    Thanked 0 Times in 0 Posts
    strange, i'll have to play with it more

  • #4
    New Coder
    Join Date
    May 2011
    Posts
    19
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Doh, I realize the problem now (I think). The links in my test site are setup using javascript

  • #5
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,979
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    Whoa whoa. This won't follow links. It follows a location header redirect only.
    If you want to follow links you need to parse the links out and execute a new curl on them.

  • #6
    New Coder
    Join Date
    May 2011
    Posts
    19
    Thanks
    3
    Thanked 0 Times in 0 Posts
    ah, back to the learning board

  • #7
    New Coder
    Join Date
    May 2011
    Posts
    19
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Well, making progress. I am able to parse out links from a site and store them in a file after checking if it already exists or not. However, I'm pulling out the whole <a href=""></a> so I need to cut out just the url from that, and, if an internal site link, add the domain back onto the front end.

  • #8
    New Coder
    Join Date
    May 2011
    Posts
    19
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Well, extracting just the url from the html code was easy enough. Now I just need to write up something to check to see if the domain is at the start of the link or not... will have to think about that one as it isn't jumping out at me right away.

    Any suggestions?

  • #9
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,979
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    Use parse_url to split it up. Compare it to the site you are curled to.

  • #10
    New Coder
    Join Date
    May 2011
    Posts
    19
    Thanks
    3
    Thanked 0 Times in 0 Posts
    I ended up going a slightly different route:

    Code:
    $pos0 = stripos($info, $findslash);
    $pos1 = stripos($info, $findhttp);
    $pos2 = stripos($info, $findwww);
    $pos3 = stripos($info, $findmail);
    $pos4 = stripos($info, $findjs);
    
    if ($pos0 === 0)
    {
    	$info = $url . $info;
    	
    	$s_file = readfile('sites.txt');
    	
    		if (preg_match('#' . $info . '#', $s_file, $match))
    		{
    		}
    
    		else {
    			
    			echo $info . "\n";
    		
    			$s_fh = fopen($s_file, 'a+') or die("can't open file");
    
    			fwrite($s_fh, $info . "\n");
    
    			fclose($s_fh);
    			
    			// Follow link
    
    			$data = LoadCURLPage($info);
    		
    		}
    }
    Now it is ending (it seems) after the echo without writing anything to sites.txt and I am not sure why.

    I also know that I need to set it up to follow the new link differently.

    Validating the rest of the url follows the same format as above, eg:

    Code:
    else {
    
    	if ($pos1 !== true && $pos2 !== true && $pos3 !== true && $pos4 !== true)
    	{
    		$info = "http://www." . $info;
    	
    		// Checking for URL
    
    		$s_file = readfile('sites.txt');
    	
    		if (preg_match('#' . $info . '#', $s_file, $match))
    		{
    		}
    
    		else {
    			
    			echo $info . "\n";
    		
    			$s_fh = fopen($s_file, 'a+') or die("can't open file");
    
    			fwrite($s_fh, $info . "\n");
    
    			fclose($s_fh);
    			
    			// Follow link
    
    			$data = LoadCURLPage($info);
    		
    		}
    	}
    
    	else {
    		
    		if ($pos2 == true)
    		{
    			$info = "http://www" . $info;
    		
    			// Checking for URL
    
    			$s_file = readfile('sites.txt');
    	
    			if (preg_match('#' . $info . '#', $s_file, $match))
    			{
    			}
    
    			else {
    			
    				echo $info;
    			
    				$s_fh = fopen($s_file, 'a+') or die("can't open file");
    
    				fwrite($s_fh, $info . "\n");
    
    				fclose($s_fh);
    			
    				// Follow link
    
    				$data = LoadCURLPage($info);
    				
    			}
    		}
    
    		else {
    		
    			if ($pos3 == true)
    			{
    			}
    		
    			else {
    			}
    		}
    	}
    }

  • #11
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,979
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    For the write problem, check if the newlines are added. If so, its the $info that's no good (the first block I don't see $url or $info defined).

    As for parse_url, I think you may be better doing what you are for your approach. The problem with parse_url it is that it doesn't properly parse unless it has the parts. Ie: codingforums.com/showthread.php will just return that as the path. Its not until http:// is added that it splits properly. So relative links won't work properly.

    readfile itself isn't right. That should likely be file_get_contents. Readfile works the same as an fopen() followed by fpassthru and finally fclose. It just dumps to the screen, not to a string. So your pattern match is simply comparing a url to the integer representing the number of bytes. I'd use a database. Then just apply a constraint that doesn't allow duplicates, and insert away. Given this, I'd say that you don't even need to check for success, just let it fail silently if it creates a PK violation.

  • #12
    New Coder
    Join Date
    May 2011
    Posts
    19
    Thanks
    3
    Thanked 0 Times in 0 Posts
    New lines aren't being added. $url and $info are defined before what I posted on here. $info should be fine as it echos just fine (including showing that my attempt to append http://www. to the beginning of relative links works).

    I'll change my use of readfile and see what happens, though what your database suggestion sounds pretty good.

  • #13
    New Coder
    Join Date
    May 2011
    Posts
    19
    Thanks
    3
    Thanked 0 Times in 0 Posts
    I have it working better now but am still having a problem.

    main.php

    Code:
    <?php
    
    error_reporting (E_ALL ^ E_NOTICE);
    
    include 'functions.php';
    
    // Connect to this url using CURL
    
    $url = '<url here>;
    
    $data = LoadCURLPage($url);
    
    //Site Function
    
    $sites = sites($data, $url);
    
    ?>
    functions.php

    Code:
    <?php
    
    function LoadCURLPage($url, $agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0', $cookie = '', $referer = '', $post_fields = '', $return_transfer = TRUE, $follow_location = TRUE, $ssl = '', $curlopt_header = 0)
    {
    	
    	$ch = curl_init(); 
    
    	curl_setopt($ch, CURLOPT_URL, $url);
    
    	if($ssl)
    	{
    		curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  2);
    	}
    
    	curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);
    
    	if($agent)
    	{
    		curl_setopt($ch, CURLOPT_USERAGENT, $agent);
    	}
    
    	if($post_fields)
    	{
    		curl_setopt($ch, CURLOPT_POST, 0); 
    		curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields); 
    	}
    
    	curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); 
    	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    
    	if($referer)
    	{
    		curl_setopt($ch, CURLOPT_REFERER, $referer);
    	}
    
    	if($cookie)
    	{
    		curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
    		curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
    	}
    
    	$result = curl_exec($ch);
    
    	curl_close($ch);
    
    	return $result;
    }
    
    // Find links
    
    function sites($s_string, $url)
    {
    	preg_match_all('/<a href="(.*)">(.*)<\/a>/', $s_string, $s_matches);
    	foreach ($s_matches as $s_value)
    	{
    		
    		$string_one = '<a href="';
    		$string_two = '"';
    		$new_string = $s_value[0];
    		
    		
    		$pos = stripos($new_string, $string_one);
    
    		$str = substr($new_string, $pos);
    
    		$str_two = substr($str, strlen($string_one));
    
    		$second_pos = stripos($str_two, $string_two);
    
    		$str_three = substr($str_two, 0, $second_pos);
    
    		$info = $str_three;
    
    		// Validating URL
    
    		$findslash = '/';
    		$findhttp = 'http://';
    		$findwww = 'www';
    		$findmail = 'mailto';
    		$findjs = 'javascript';
    
    		$pos0 = stripos($info, $findslash);
    		$pos1 = stripos($info, $findhttp);
    		$pos2 = stripos($info, $findwww);
    		$pos3 = stripos($info, $findmail);
    		$pos4 = stripos($info, $findjs);
    
    		if ($pos0 === 0)
    		{
    			$info = $url . $info;
    	
    			$s_file = file_get_contents('sites.txt');
    	
    				if (preg_match('#' . $info . '#', $s_file, $match))
    				{
    				}
    
    				else {
    			
    					echo $info . "\n";
    		
    					$s_fh = fopen('sites.txt', 'a+') or die("can't open file");
    
    					fwrite($s_fh, $info . "\n");
    
    					fclose($s_fh);
    		
    				}
    		}
    
    		else {
    
    			if ($pos1 !== true && $pos2 !== true && $pos3 !== true && $pos4 !== true)
    			{
    				$info = "http://www." . $info;
    	
    				// Checking for URL
    
    				$s_file = file_get_contents('sites.txt');
    	
    				if (preg_match('#' . $info . '#', $s_file, $match))
    				{
    				}
    
    				else {
    			
    					echo $info . "\n";
    		
    					$s_fh = fopen('sites.txt', 'a+') or die("can't open file");
    
    					fwrite($s_fh, $info . "\n");
    
    					fclose($s_fh);
    
    				}
    			}
    
    			else {
    		
    				if ($pos2 == true)
    				{
    					$info = "http://www" . $info;
    	
    					// Checking for URL
    
    					$s_file = file_get_contents('sites.txt');
    	
    					if (preg_match('#' . $info . '#', $s_file, $match))
    					{
    					}
    
    					else {
    			
    						echo $info;
    			
    						$s_fh = fopen('sites.txt', 'a+') or die("can't open file");
    
    						fwrite($s_fh, $info . "\n");
    
    						fclose($s_fh);
    				
    					}
    				}
    
    				else {
    		
    					if ($pos3 == true)
    					{
    					}
    		
    					else {
    					}
    				}
    			}
    		}
    	}
    }
    
    ?>
    It works right the first time through for me, with the first link it finds being a relative link. However, on the second pass it butchers the first url and tacks http://www. onto the front of it. Then, it appears to fail on the third time through.

    So I'm guessing the internal array pointer isn't advancing for some reason?
    Last edited by ckfoxtrot; 04-27-2012 at 08:49 AM.

  • #14
    New Coder
    Join Date
    May 2011
    Posts
    19
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Ah, problem lies in my pattern and array :/

  • #15
    New Coder
    Join Date
    May 2011
    Posts
    19
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Works with this!

    Code:
    function sites($s_string, $url)
    {
    	preg_match_all("/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]/isU", $s_string, $s_matches);
    	foreach ($s_matches[1] as $s_value)
    	{
    		
    		$string_one = '<a href="';
    		$string_two = '"';
    		$info = $s_value;
    		
    		/*  Code not needed with above pattern
    		
    		$pos = stripos($new_string, $string_one);
    
    		$str = substr($new_string, $pos);
    
    		$str_two = substr($str, strlen($string_one));
    
    		$second_pos = stripos($str_two, $string_two);
    
    		$str_three = substr($str_two, 0, $second_pos);
    
    		$info = $str_three;
    		
    		*/
    But it does throw a few errors:

    PHP Warning: preg_match(): Unknown modifier 'c' in /var/www/projects/functions.php on line 121

    Shows up 4 times, with the unknown modifier as c, p, p, and d.

    Line 121 is the first instance of

    if (preg_match('#' . $info . '#', $s_file, $match))

    that I have.

    Maybe some weird link formatting throwing it off? I'll have to compare the array to the output.

    Now back to figuring out how to start following links and repeating this process.


  •  
    Page 1 of 2 12 LastLast

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •