...

View Full Version : Manually Crawling for Data (w/o RSS Feed)



kraftomatic
12-05-2008, 09:33 PM
Hey Guys,

I'm trying to find some information on manually crawling/pulling news data from a particular site, when it doesn't offer an RSS feed.

I currently have a tabbed-panel of RSS fed content, but there's one place I need to manually pull the news info from.

Any good place to start reading up/looking?

Thanks in advance.

AlexV
12-05-2008, 10:31 PM
Not sure if I understand you...

You want to get some data from an external website and that data isn't XML (RSS) formatted?

If so, you will need to read the whole page then parse it and then displaying this parsed data on your side...

Depending of which technology you can use (PHP or other) and how the target data look like solution may vary greatly...

More details are needed :)

kraftomatic
12-07-2008, 03:22 AM
Correct. I would like to retrieve it via PHP somehow.

The site is very simple. News site with headlines. Clicking the headlines will take you the specific news page. All I would like to grab is the name of the headline, and the link that it will go to.

Does that help?

194673
12-07-2008, 04:49 AM
Maybe something like the cURL function in combination with regular expressions?

kraftomatic
12-07-2008, 09:05 PM
I see something like this for pulling a direct RSS feed using cURL:

http://phpsense.com/php/php-curl-functions.html

But I'm not sure how I would get the info/links directly from a specific page ..

194673
12-08-2008, 12:26 AM
Well you can download the webpage source using cURL and then use a regular expression to find the HTML tag which encloses the heading. Then you can extract the info between the tags.

kraftomatic
12-08-2008, 04:14 PM
Well, the source code looks like this:


<LI TYPE=news><A HREF="titlename.htm">new title here</A></LI>
<LI TYPE=news><A HREF="titlename2.htm">news title 2 here</A></LI>

Do you think that's the best way to do it? I'm curious what kind of load time/delay there would be in loading the whole page ... Any thoughts?

Thanks.

mlseim
12-08-2008, 04:18 PM
I think the term you're looking for is "web page scraping".
http://www.google.com/search?hl=en&q=web+page+scraping&btnG=Google+Search&aq=f&oq=

several techniques ... some ethical issues to deal with too.

kraftomatic
12-08-2008, 04:42 PM
Well you can download the webpage source using cURL and then use a regular expression to find the HTML tag which encloses the heading. Then you can extract the info between the tags.

Could I also use fsockopen instead of cURL? Seems that's more popular with hosts ...

kraftomatic
12-08-2008, 04:44 PM
I think the term you're looking for is "web page scraping".
http://www.google.com/search?hl=en&q=web+page+scraping&btnG=Google+Search&aq=f&oq=

several techniques ... some ethical issues to deal with too.

Yes I have found similar info. It won't be an issue, as I've confirmed with the site of what this will be doing. It simply will be grabbing news titles and linking directly to their site for added traffic.

The only potential issue is the bandwidth hit in grabbing the links - although I would like to make it only happen a limited number of times a day.

mlseim
12-08-2008, 05:48 PM
If you were able to contact that other site's owner, maybe you could have them
put an RSS feed for the data you want. That would be a win-win situation for the
both of you. It would help their bandwidth, and it would give you the XML you need
to easily display the information (with links back to their site).

That's really the best way it should be done.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum