Hello and welcome to our community! Is this your first visit?
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 1 of 1
  1. #1
    Super Moderator
    Join Date
    May 2002
    Perth Australia
    Thanked 110 Times in 108 Posts

    basic site spider

    I have just rehashed an old site spider script that I used to use for building an internal search engine page index. I don't use it for that anymore, none the less its proven useful in the last few days so thought I would post it here, as per usual its quite pointless on its own but can be useful combined with other classes or functions.

    e.g. I recently tied this into a pspell routine which spell checks all of a given site including dynamic and aggregated content (a very slow process but thats another story)

    Anyway the basic class..
    PHP Code:
    /*spider class , only spiders within $root !!

    variable spider depth
    mime-types to ignore
    static calls to external methods
    multiple filters (this is slow enough already so perhaps/perhaps not)
    prefilters (rules to see if its even worth passing to filter ?)
    class spider{
    $start            '';                #initial page to spider
    var $root            '';                #domain root
    var $caught        = array();        #pages found (no root)
    var $_curr_idx    '';                #the current page we are spidering
    var $is_filter        false;            #flag if we have set a filter object 
    var $filter_res        = array();        #if we used an external filter the results of that call are stored in here

    function spider($root$start$crawl_now=false){
    $this->root $root;
    $this->start $this->_curr_idx $start;
    $this->caught = array($start);
    $crawl_now === true){

    /**you can start the spidering in the constructor or here , whatever stirs your bucket*/
    function crawl_now(){

    /**regex could probably be better but this works*/
    function get_links($str){
    $rets '';
    preg_match_all("|<a href=\"".$this->root."\/(.*)\".*>.*\<\/a>|Uis",$str,$rets);
    $rets[1] ;

    crawl$page ){
    $cnt file_get_contents($page);
    $links $this->get_links($cnt);
    $links as $l){
    $this->_curr_idx $this->caught[] = $l;
    if you want to work on the content of spidered pages,here is a good place to 
    do so since we already have the page content to hand , how you store or process
    the external objects results will vary wildly so we dont bother  much here except
    to store any results in $this->filter_res

    set the callback
    function set_filter(&$obj){
    $this->is_filter true;
    $this->filter_method false#flag if object callback or a regular function
    $this->filter_callback $obj[0];

    call the callback , be it a class method or a regular function
    only called if set_filter has been called first
    function call_filter(&$cnt){
    $this->filter_method === false){
    $filter $this->filter_callback;
    $this->filter_res[$this->_curr_idx] = $filter($cnt);
    $filter $this->filter_callback;
    $meth $this->filter_method;

    Example :: crawl and print the list of pages crawled
    PHP Code:
    = new spider('http://localhost/pixelpushers''index.htm',true);
    Example :: pass crawled pages to an external object method for munging
    PHP Code:
    class test_filter{

    $yaks = new spider('http://localhost/pixelpushers''index.htm',false);
    # uses std PHP callback syntax for objects
    $obj = array(new test_filter(), 'filter');
    //#or regular functions (here a PHP function)
    //$obj = 'strlen'; 
    # in this case we now have to start the filter ourselves
    # see what we got

    thats about it , there is much more that could be done but I like small compact classes where possible , I only added the filter functionality because it is the most common and (I think) useful addition to the basic idea.

    Note that there is no validation done on paths etc nor is security even something I have considered since this script (due to the snails pace at which it runs) is not really much use for production work more for admin backends etc.

    Any suggestions/critiques etc welcomed
    Last edited by firepages; 12-28-2005 at 04:59 PM.
    resistance is...

    MVC is the current buzz in web application architectures. It comes from event-driven desktop application design and doesn't fit into web application design very well. But luckily nobody really knows what MVC means, so we can call our presentation layer separation mechanism MVC and move on. (Rasmus Lerdorf)


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts