View Full Version : Scraping and Comparing Page Similarity

10-03-2012, 04:56 PM
A client of mine has multiple websites that contain rampant duplicate content (as in 30,000+ pages). Does anyone have any advice on how to go about narrowing down duplicate content among many sites? For instance, the procedue could execute like this:
-Scrape all pages on sites A,B,C, & D and strip out all html tags and cache/save the raw content.
-Compare every single page to every other single page across all sites and spit out a content similarity percentage into a CSV format for all pages that come up with a similarity over 60%.

Is there a specific programming language that would be good for this? How many hours could I expect a programmer to actually develop this code?


10-05-2012, 08:31 AM
use any comparing software (there are really a lot of them out there) and use them in terminal so you can process it in php