...

View Full Version : Stripping dynamic HTML from a string...



thomasgrant
02-17-2010, 06:41 PM
In a project I am working on at the moment, I am looking to pull specific data from ESPN's RSS feed. This was working fine until they recently changed the format for their feed. I now have to adjust my code to get the result I had before.

Here is a current example of one items from ESPN's Top News feed:


<item>
<dc:creator><![CDATA[ESPN.com news services]]></dc:creator>
<title><![CDATA[Donte Stallworth to sign contract with Baltimore Ravens, source says]]></title>
<description><![CDATA[<a href="http://api.tweetmeme.com/share?url=http://sports.espn.go.com/nfl/news/story?id=4921333&amp;campaign=rss&amp;source=ESPNHeadlines&amp;service=tinyurl.com&amp;source=espn"><img style="padding-left:10px;" align="right" border="0" style="border:none;" src="http://api.tweetmeme.com/imagebutton.gif?url=http://sports.espn.go.com/nfl/news/story?id=4921333&amp;campaign=rss&amp;source=ESPNHeadlines" height="49" width="41" /></a>Wide receiver Donte Stallworth, whose contract with Cleveland was terminated last week, will sign a one-year contract with the Baltimore Ravens, a league source told ESPN NFL Insider Adam Schefter.]]></description>
<pubDate>Wed, 17 Feb 2010 09:10:52 PST</pubDate>
<guid>http://sports.espn.go.com/nfl/news/story?id=4921333&amp;campaign=rss&amp;source=ESPNHeadlines</guid>
<link>http://sports.espn.go.com/nfl/news/story?id=4921333&amp;campaign=rss&amp;source=ESPNHeadlines</link>
</item>

The problem I'm encountering is that I want to pull only the text summary from the description field. They are including a HTML as well, a linked image. I want to strip this away and leave just the text summary of the news item. In the above example, I want to end up with just: Wide receiver Donte Stallworth, whose contract with Cleveland was terminated last week, will sign a one-year contract with the Baltimore Ravens, a league source told ESPN NFL Insider Adam Schefter.

What is the best approach to this? Do I want to use the preg_split() function? To pull away all of the HTML with regular expression. Or is this example not suited for using regular expression?

I don't necessarily want someone to do the work for me, just point me into the right direction. I'd like the challenge of figuring it out for myself.

Thanks!

Fou-Lu
02-17-2010, 06:54 PM
You can pull the full result as html? Use strip_tags to remove any html from it. Since the only data you don't want are elements and not text within them, the strip_tags should remove both the a and img tags.

thomasgrant
02-17-2010, 10:03 PM
You can pull the full result as html? Use strip_tags to remove any html from it. Since the only data you don't want are elements and not text within them, the strip_tags should remove both the a and img tags.

I was unaware of that particular function. That worked quite well, thank you.

Fou-Lu
02-17-2010, 10:25 PM
No problem, glad it worked out!



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum