Scraping and Comparing Page Similarity
A client of mine has multiple websites that contain rampant duplicate content (as in 30,000+ pages). Does anyone have any advice on how to go about narrowing down duplicate content among many sites? For instance, the procedue could execute like this:
-Scrape all pages on sites A,B,C, & D and strip out all html tags and cache/save the raw content.
-Compare every single page to every other single page across all sites and spit out a content similarity percentage into a CSV format for all pages that come up with a similarity over 60%.
Is there a specific programming language that would be good for this? How many hours could I expect a programmer to actually develop this code?
use any comparing software (there are really a lot of them out there) and use them in terminal so you can process it in php
|All times are GMT +1. The time now is 12:27 AM.|
Powered by vBulletin®
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.