Go Back   CodingForums.com > :: Client side development > General web building

Before you post, read our: Rules & Posting Guidelines

Reply
 
Thread Tools Rate Thread
Enjoy an ad free experience by logging in. Not a member yet? Register.
Old 10-03-2012, 04:56 PM   PM User | #1
bewildebeest
New to the CF scene

 
Join Date: Sep 2012
Posts: 4
Thanks: 1
Thanked 0 Times in 0 Posts
bewildebeest is an unknown quantity at this point
Scraping and Comparing Page Similarity

A client of mine has multiple websites that contain rampant duplicate content (as in 30,000+ pages). Does anyone have any advice on how to go about narrowing down duplicate content among many sites? For instance, the procedue could execute like this:
-Scrape all pages on sites A,B,C, & D and strip out all html tags and cache/save the raw content.
-Compare every single page to every other single page across all sites and spit out a content similarity percentage into a CSV format for all pages that come up with a similarity over 60%.

Is there a specific programming language that would be good for this? How many hours could I expect a programmer to actually develop this code?

Thanks.
bewildebeest is offline   Reply With Quote
Old 10-05-2012, 08:31 AM   PM User | #2
kevin619
New Coder

 
Join Date: Apr 2011
Location: California
Posts: 37
Thanks: 0
Thanked 1 Time in 1 Post
kevin619 is an unknown quantity at this point
use any comparing software (there are really a lot of them out there) and use them in terminal so you can process it in php

http://php.net/manual/en/function.exec.php
kevin619 is offline   Reply With Quote
Reply

Bookmarks

Jump To Top of Thread


Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 06:48 PM.


Advertisement
Log in to turn off these ads.