Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 2 of 2
  1. #1
    New Coder
    Join Date
    Oct 2008
    Posts
    13
    Thanks
    3
    Thanked 0 Times in 0 Posts

    help with function to crawl for links in all website.

    I'm trying to crawl for links in a specific website and show them at the end. The problem i'm facing is that it only show the links from the specific page not the whole pages in the website. I tried several loops with no success please give some advise.
    Here is the code:
    Code:
    <?php
    if (isset($_POST['Submit'])) {
    
    
        function getLinks($link)
        {
            /*** return array ***/
            $ret = array();
    
            /*** a new dom object ***/
            $dom = new domDocument;
    
            /*** get the HTML (suppress errors) ***/
            @$dom->loadHTML(file_get_contents($link));
    
            /*** remove silly white space ***/
            $dom->preserveWhiteSpace = false;
    
            /*** get the links from the HTML ***/
            $links = $dom->getElementsByTagName('a');
        
            /*** loop over the links ***/
            foreach ($links as $tag)
            {
                $ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
            }
    
            return $ret;
        }
    
        /*** a link to search ***/
        $link = $_POST['address'];
    
        /*** get the links ***/
        $urls = getLinks($link);
    
        /*** check for results ***/
        if(sizeof($urls) > 0)
        {
            foreach($urls as $key=>$value)
            {
    
    	
    
    	
    
    	
    
    if (preg_match('/^(http|https):\/\/([a-z0-9-]\.+)*/i',$key)) {
    
    	
    
    	
    
    	
    
    	
    
    echo '<span style="color:RED;">' . $key .' - external</span><br >';
    
    	
    
    	
    
    	
    
    } else {
                
    
    	
    
    echo '<span style="color:BLUE;">' . $link . $key . ' - internal</span><br >';
    
    	
    
    	
    
    	
    
    }
            }
        }
        else
        {
            echo "No links found at $link";
        }
    }
    ?>
    <br /><br />
    <form action="" method="post" enctype="multipart/form-data" name="link">
    <input name="address" type="text" value="" />
    <input name="Submit" type="Submit" />
    </form>

  • #2
    Regular Coder adarshakb's Avatar
    Join Date
    Jun 2009
    Location
    Silicon valley of india
    Posts
    247
    Thanks
    11
    Thanked 1 Time in 1 Post
    Call recursively the function getLinks()

    After getting all the links in the page do the following
    1. Store the current page link in a global array/any data strcture such as linked list
    2. Call getLinks() for all the links in the current page IF its not present in the global array(i.e not already crawled) AND You need to check if the link is in the same website or not. If you are crawling with all the links you may end up crawling other websites also.
    Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.

    Albert Einstein
    -----------------------------------------------------
    My Blog songs


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •