Go Back   CodingForums.com > :: Server side development > PHP

Before you post, read our: Rules & Posting Guidelines

Reply
 
Thread Tools Rate Thread
Enjoy an ad free experience by logging in. Not a member yet? Register.
Old 01-08-2013, 10:47 PM   PM User | #1
jotfish
New to the CF scene

 
Join Date: Jan 2012
Posts: 3
Thanks: 0
Thanked 0 Times in 0 Posts
jotfish is an unknown quantity at this point
PHP web crawler

I have a basic PHP web crawler script and I need to expand its functionality, the problem is I'm a total noob at PHP and my knowledge is very basic so I'm coming here for some help.

My goal is to have a basic user input (text box) and when the user types in a phrase; let's say "Red Apples" and hits the enter button the script should start crawling the web for the phrase "Red Apples" and store the plain text results along with the URL they originated from in a database.

Here is what I've got so far:

Code:
error_reporting( E_ERROR );
 
define( "CRAWL_LIMIT_PER_DOMAIN", 50 );
 

$domains = array();

$urls = array();
 
function crawl( $url )
{
  global $domains, $urls;
 
  echo "Crawling $url... ";
 
  $parse = parse_url( $url );

  $domains[ $parse['host'] ]++;
  $urls[] = $url;
 
  $content = file_get_contents( $url );
  if ( $content === FALSE )
  {
    echo "Error.\n";
    return;
  }
 
 
  $content = stristr( $content, "body" );
  preg_match_all( '/http:\/\/[^ "\']+/', $content, $matches );
 
  echo 'Found ' . count( $matches[0] ) . " urls.\n";
 
  foreach( $matches[0] as $crawled_url )
  {
    $parse = parse_url( $crawled_url );
 
    if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN
        && !in_array( $crawled_url, $urls ) )
    {
      sleep( 1 );
      crawl( $crawled_url );
    }
  }
}
If anybody could point me in the right direction that would be awesome.
jotfish is offline   Reply With Quote
Reply

Bookmarks

Jump To Top of Thread


Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 05:22 PM.


Advertisement
Log in to turn off these ads.