dance621
01-26-2007, 03:46 PM
Hi, I am trying to get some software developed to extract info from a website. The website lists info and descriptions of old books from several different sellers. For each book we are searching for a url would be created looking something like this
widgit.com/search?author+title+year
This bit is working OK. The page of results might contain between 1 and 100 different descriptions for each book.
eg
results
1). Spiders by dance621
year 1976
Publisher Smith and weston
Description: Great book about spiders.
Price $55
2). Spiders by dance621
year 1976
Publisher Smith and weston
Description: Lond book with loads of pictures about spiders. dance621 did a great job. 77pages long.
Price $23
3). Spiders by dance621
year 1976
Publisher Smith Weston
Description: 77 pages. loads of pictures. loads of things about spiders. How to cook them. How to teach them cricket.
Price $109
We need to extract the prices so we can work out the average price. We also want to extract the two longest descriptions from on each page.
The ideal output would be in an excel format so the result would have the url, no. resulats, average price, lowest price, highest price, median price, longest description, second longest description.
Is this a perl job? I plan to enploy a coder to sort this out but not sure where to start. Any pointers would be much appreciated. Thanks, James
widgit.com/search?author+title+year
This bit is working OK. The page of results might contain between 1 and 100 different descriptions for each book.
eg
results
1). Spiders by dance621
year 1976
Publisher Smith and weston
Description: Great book about spiders.
Price $55
2). Spiders by dance621
year 1976
Publisher Smith and weston
Description: Lond book with loads of pictures about spiders. dance621 did a great job. 77pages long.
Price $23
3). Spiders by dance621
year 1976
Publisher Smith Weston
Description: 77 pages. loads of pictures. loads of things about spiders. How to cook them. How to teach them cricket.
Price $109
We need to extract the prices so we can work out the average price. We also want to extract the two longest descriptions from on each page.
The ideal output would be in an excel format so the result would have the url, no. resulats, average price, lowest price, highest price, median price, longest description, second longest description.
Is this a perl job? I plan to enploy a coder to sort this out but not sure where to start. Any pointers would be much appreciated. Thanks, James