Go Back   CodingForums.com > :: Server side development > PHP > Post a PHP snippet

Before you post, read our: Rules & Posting Guidelines

Reply
 
Thread Tools Rate Thread
Enjoy an ad free experience by logging in. Not a member yet? Register.
Old 12-28-2005, 04:55 PM   PM User | #1
firepages
Super Moderator


 
Join Date: May 2002
Location: Perth Australia
Posts: 3,942
Thanks: 7
Thanked 82 Times in 81 Posts
firepages will become famous soon enough
basic site spider

I have just rehashed an old site spider script that I used to use for building an internal search engine page index. I don't use it for that anymore, none the less its proven useful in the last few days so thought I would post it here, as per usual its quite pointless on its own but can be useful combined with other classes or functions.

e.g. I recently tied this into a pspell routine which spell checks all of a given site including dynamic and aggregated content (a very slow process but thats another story)

Anyway the basic class..
PHP Code:
<?php
/*spider class , only spiders within $root !!

TODO
variable spider depth
mime-types to ignore
static calls to external methods
multiple filters (this is slow enough already so perhaps/perhaps not)
prefilters (rules to see if its even worth passing to filter ?)
*/
class spider{
    var 
$start            '';                #initial page to spider
    
var $root            '';                #domain root
    
var $caught        = array();        #pages found (no root)
    
var $_curr_idx    '';                #the current page we are spidering
    
var $is_filter        false;            #flag if we have set a filter object 
    
var $filter_res        = array();        #if we used an external filter the results of that call are stored in here

    
function spider($root$start$crawl_now=false){
        
$this->root $root;
        
$this->start $this->_curr_idx $start;
        
$this->caught = array($start);
        if(
$crawl_now === true){
            
$this->crawl_now();
        }
    }

/**you can start the spidering in the constructor or here , whatever stirs your bucket*/
    
function crawl_now(){
        
$this->crawl($this->root.'/'.$this->start);
    }

/**regex could probably be better but this works*/
    
function get_links($str){
        
$rets '';
        
preg_match_all("|<a href=\"".$this->root."\/(.*)\".*>.*\<\/a>|Uis",$str,$rets);
        return 
$rets[1] ;
    }

    function 
crawl$page ){
        
$cnt file_get_contents($page);
        if(
$cnt){
            if(
$this->is_filter===true){
                
$this->call_filter($cnt);
            }
            
$links $this->get_links($cnt);
            foreach(
$links as $l){
                if(!
in_array($l,$this->caught)){
                    
$this->_curr_idx $this->caught[] = $l;
                    
$this->crawl($this->root.'/'.$l);
                }
            }
        }
    }
/***********************************FILTERS***********************************************/
/**
if you want to work on the content of spidered pages,here is a good place to 
do so since we already have the page content to hand , how you store or process
the external objects results will vary wildly so we dont bother  much here except
to store any results in $this->filter_res

set the callback
*/
    
function set_filter(&$obj){
        
$this->is_filter true;
        
$this->filter_method false#flag if object callback or a regular function
        
if(is_array($obj)){
            
$this->filter_callback $obj[0];
            
$this->filter_method=$obj[1];
            return;
        }
        
$this->filter_callback=$obj;
    }

/**
call the callback , be it a class method or a regular function
only called if set_filter has been called first
*/
    
function call_filter(&$cnt){
        if(
$this->filter_method === false){
            
$filter $this->filter_callback;
            
$this->filter_res[$this->_curr_idx] = $filter($cnt);
        }else{
            
$filter $this->filter_callback;
            
$meth $this->filter_method;
            
$this->filter_res[$this->_curr_idx]=$filter->$meth($cnt);
        }
    }
}
?>

Example :: crawl and print the list of pages crawled
PHP Code:
<?php
    $yaks 
= new spider('http://localhost/pixelpushers''index.htm',true);
    
print_r($yaks->caught);
?>
Example :: pass crawled pages to an external object method for munging
PHP Code:
<?php
class test_filter{
    function 
filter($str){
        return 
strlen($str);
    }
}

$yaks = new spider('http://localhost/pixelpushers''index.htm',false);
# uses std PHP callback syntax for objects
$obj = array(new test_filter(), 'filter');
//#or regular functions (here a PHP function)
//$obj = 'strlen'; 
//$yaks->set_filter($obj);
$yaks->set_filter($obj);
# in this case we now have to start the filter ourselves
$yaks->crawl_now();
# see what we got
print_r($yaks->filter_res);
?>

thats about it , there is much more that could be done but I like small compact classes where possible , I only added the filter functionality because it is the most common and (I think) useful addition to the basic idea.

Note that there is no validation done on paths etc nor is security even something I have considered since this script (due to the snails pace at which it runs) is not really much use for production work more for admin backends etc.

Any suggestions/critiques etc welcomed
__________________
resistance is...

MVC is the current buzz in web application architectures. It comes from event-driven desktop application design and doesn't fit into web application design very well. But luckily nobody really knows what MVC means, so we can call our presentation layer separation mechanism MVC and move on. (Rasmus Lerdorf)

Last edited by firepages; 12-28-2005 at 04:59 PM..
firepages is offline   Reply With Quote
Reply

Bookmarks

Jump To Top of Thread


Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 01:08 AM.


Advertisement
Log in to turn off these ads.