Go Back   CodingForums.com > :: Server side development > PHP

Before you post, read our: Rules & Posting Guidelines

Reply
 
Thread Tools Rate Thread
Enjoy an ad free experience by logging in. Not a member yet? Register.
Old 12-05-2008, 08:33 PM   PM User | #1
kraftomatic
Regular Coder

 
Join Date: Jul 2003
Posts: 593
Thanks: 16
Thanked 0 Times in 0 Posts
kraftomatic is an unknown quantity at this point
Manually Crawling for Data (w/o RSS Feed)

Hey Guys,

I'm trying to find some information on manually crawling/pulling news data from a particular site, when it doesn't offer an RSS feed.

I currently have a tabbed-panel of RSS fed content, but there's one place I need to manually pull the news info from.

Any good place to start reading up/looking?

Thanks in advance.
kraftomatic is offline   Reply With Quote
Old 12-05-2008, 09:31 PM   PM User | #2
AlexV
Regular Coder

 
AlexV's Avatar
 
Join Date: Oct 2008
Location: Near Montreal, Quebec - Canada.
Posts: 214
Thanks: 5
Thanked 22 Times in 22 Posts
AlexV is an unknown quantity at this point
Not sure if I understand you...

You want to get some data from an external website and that data isn't XML (RSS) formatted?

If so, you will need to read the whole page then parse it and then displaying this parsed data on your side...

Depending of which technology you can use (PHP or other) and how the target data look like solution may vary greatly...

More details are needed
AlexV is offline   Reply With Quote
Old 12-07-2008, 02:22 AM   PM User | #3
kraftomatic
Regular Coder

 
Join Date: Jul 2003
Posts: 593
Thanks: 16
Thanked 0 Times in 0 Posts
kraftomatic is an unknown quantity at this point
Correct. I would like to retrieve it via PHP somehow.

The site is very simple. News site with headlines. Clicking the headlines will take you the specific news page. All I would like to grab is the name of the headline, and the link that it will go to.

Does that help?
kraftomatic is offline   Reply With Quote
Old 12-07-2008, 03:49 AM   PM User | #4
194673
Regular Coder

 
Join Date: Dec 2006
Posts: 166
Thanks: 9
Thanked 4 Times in 4 Posts
194673 is an unknown quantity at this point
Maybe something like the cURL function in combination with regular expressions?
194673 is offline   Reply With Quote
Old 12-07-2008, 08:05 PM   PM User | #5
kraftomatic
Regular Coder

 
Join Date: Jul 2003
Posts: 593
Thanks: 16
Thanked 0 Times in 0 Posts
kraftomatic is an unknown quantity at this point
I see something like this for pulling a direct RSS feed using cURL:

http://phpsense.com/php/php-curl-functions.html

But I'm not sure how I would get the info/links directly from a specific page ..
kraftomatic is offline   Reply With Quote
Old 12-07-2008, 11:26 PM   PM User | #6
194673
Regular Coder

 
Join Date: Dec 2006
Posts: 166
Thanks: 9
Thanked 4 Times in 4 Posts
194673 is an unknown quantity at this point
Well you can download the webpage source using cURL and then use a regular expression to find the HTML tag which encloses the heading. Then you can extract the info between the tags.
194673 is offline   Reply With Quote
Old 12-08-2008, 03:14 PM   PM User | #7
kraftomatic
Regular Coder

 
Join Date: Jul 2003
Posts: 593
Thanks: 16
Thanked 0 Times in 0 Posts
kraftomatic is an unknown quantity at this point
Well, the source code looks like this:

Code:
<LI TYPE=news><A HREF="titlename.htm">new title here</A></LI>
<LI TYPE=news><A HREF="titlename2.htm">news title 2 here</A></LI>
Do you think that's the best way to do it? I'm curious what kind of load time/delay there would be in loading the whole page ... Any thoughts?

Thanks.
kraftomatic is offline   Reply With Quote
Old 12-08-2008, 03:18 PM   PM User | #8
mlseim
Master Coder

 
mlseim's Avatar
 
Join Date: Jun 2003
Location: Cottage Grove, Minnesota
Posts: 9,045
Thanks: 8
Thanked 1,029 Times in 1,020 Posts
mlseim has a spectacular aura aboutmlseim has a spectacular aura aboutmlseim has a spectacular aura about
I think the term you're looking for is "web page scraping".
http://www.google.com/search?hl=en&q...earch&aq=f&oq=

several techniques ... some ethical issues to deal with too.
mlseim is offline   Reply With Quote
Old 12-08-2008, 03:42 PM   PM User | #9
kraftomatic
Regular Coder

 
Join Date: Jul 2003
Posts: 593
Thanks: 16
Thanked 0 Times in 0 Posts
kraftomatic is an unknown quantity at this point
Quote:
Originally Posted by 194673 View Post
Well you can download the webpage source using cURL and then use a regular expression to find the HTML tag which encloses the heading. Then you can extract the info between the tags.
Could I also use fsockopen instead of cURL? Seems that's more popular with hosts ...
kraftomatic is offline   Reply With Quote
Old 12-08-2008, 03:44 PM   PM User | #10
kraftomatic
Regular Coder

 
Join Date: Jul 2003
Posts: 593
Thanks: 16
Thanked 0 Times in 0 Posts
kraftomatic is an unknown quantity at this point
Quote:
Originally Posted by mlseim View Post
I think the term you're looking for is "web page scraping".
http://www.google.com/search?hl=en&q...earch&aq=f&oq=

several techniques ... some ethical issues to deal with too.
Yes I have found similar info. It won't be an issue, as I've confirmed with the site of what this will be doing. It simply will be grabbing news titles and linking directly to their site for added traffic.

The only potential issue is the bandwidth hit in grabbing the links - although I would like to make it only happen a limited number of times a day.
kraftomatic is offline   Reply With Quote
Old 12-08-2008, 04:48 PM   PM User | #11
mlseim
Master Coder

 
mlseim's Avatar
 
Join Date: Jun 2003
Location: Cottage Grove, Minnesota
Posts: 9,045
Thanks: 8
Thanked 1,029 Times in 1,020 Posts
mlseim has a spectacular aura aboutmlseim has a spectacular aura aboutmlseim has a spectacular aura about
If you were able to contact that other site's owner, maybe you could have them
put an RSS feed for the data you want. That would be a win-win situation for the
both of you. It would help their bandwidth, and it would give you the XML you need
to easily display the information (with links back to their site).

That's really the best way it should be done.
mlseim is offline   Reply With Quote
Reply

Bookmarks

Jump To Top of Thread


Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 01:09 AM.


Advertisement
Log in to turn off these ads.