![]() |
Scraping and Comparing Page Similarity
A client of mine has multiple websites that contain rampant duplicate content (as in 30,000+ pages). Does anyone have any advice on how to go about narrowing down duplicate content among many sites? For instance, the procedue could execute like this:
-Scrape all pages on sites A,B,C, & D and strip out all html tags and cache/save the raw content. -Compare every single page to every other single page across all sites and spit out a content similarity percentage into a CSV format for all pages that come up with a similarity over 60%. Is there a specific programming language that would be good for this? How many hours could I expect a programmer to actually develop this code? Thanks. |
use any comparing software (there are really a lot of them out there) and use them in terminal so you can process it in php
http://php.net/manual/en/function.exec.php |
| All times are GMT +1. The time now is 12:27 AM. |
Powered by vBulletin®
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.