CodingForums.com

CodingForums.com (http://www.codingforums.com/index.php)
-   General web building (http://www.codingforums.com/forumdisplay.php?f=10)
-   -   Scraping and Comparing Page Similarity (http://www.codingforums.com/showthread.php?t=275060)

bewildebeest 10-03-2012 04:56 PM

Scraping and Comparing Page Similarity
 
A client of mine has multiple websites that contain rampant duplicate content (as in 30,000+ pages). Does anyone have any advice on how to go about narrowing down duplicate content among many sites? For instance, the procedue could execute like this:
-Scrape all pages on sites A,B,C, & D and strip out all html tags and cache/save the raw content.
-Compare every single page to every other single page across all sites and spit out a content similarity percentage into a CSV format for all pages that come up with a similarity over 60%.

Is there a specific programming language that would be good for this? How many hours could I expect a programmer to actually develop this code?

Thanks.

kevin619 10-05-2012 08:31 AM

use any comparing software (there are really a lot of them out there) and use them in terminal so you can process it in php

http://php.net/manual/en/function.exec.php


All times are GMT +1. The time now is 12:27 AM.

Powered by vBulletin®
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.