Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 2 of 2
  1. #1
    New to the CF scene
    Join Date
    Sep 2012
    Posts
    4
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Scraping and Comparing Page Similarity

    A client of mine has multiple websites that contain rampant duplicate content (as in 30,000+ pages). Does anyone have any advice on how to go about narrowing down duplicate content among many sites? For instance, the procedue could execute like this:
    -Scrape all pages on sites A,B,C, & D and strip out all html tags and cache/save the raw content.
    -Compare every single page to every other single page across all sites and spit out a content similarity percentage into a CSV format for all pages that come up with a similarity over 60%.

    Is there a specific programming language that would be good for this? How many hours could I expect a programmer to actually develop this code?

    Thanks.

  • #2
    New Coder
    Join Date
    Apr 2011
    Location
    California
    Posts
    52
    Thanks
    0
    Thanked 2 Times in 2 Posts
    use any comparing software (there are really a lot of them out there) and use them in terminal so you can process it in php

    http://php.net/manual/en/function.exec.php
    App Developer, Front End Designer and Learner at>> Facebook App Development Company - Cygnis Media


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •