...

View Full Version : How do you parse XML data?



ptmuldoon
04-30-2009, 04:49 PM
I'm trying to learn how to parse some xml for further use to possibly create some charts, and am pretty lost at how to proceed. The plan is to parse the xml into a new format to be used for charting.

I've starting searching the boards, but still feel overwhelmed at this point.

I can view the xml information by visiting the url directly and adding &xml to the url name. Example of Report 77 is.

http://501.synsport.com/index.php?id=77&xml

In searching the boards, I tried to start small with the below code, but I'm getting a number of errors off the start.



$xmlstr = file_get_contents('http://501.synsport.com/index.php?id=77&xml'); // read your file

$xml = new SimpleXMLElement($xmlstr);

And the first error along with a bunch of others is:

Warning: SimpleXMLElement::__construct() [simplexmlelement.--construct]: Entity: line 12: parser error : Entity 'nbsp' not defined in C:\wamp\www\xml parse\test.php on line 5

TheShaner
04-30-2009, 05:01 PM
I believe the problem is that the file you're pointing to is not a real XML file. Go to that link again and do a View Source. That's the contents the XML parser is attempting to read, which you can see is not a proper XML file. It's XML contents written out for the web browser to display.

Instead, try pointing to a real XML instead:
http://www.w3schools.com/XML/cd_catalog.xml

-Shane

CFMaBiSmAd
04-30-2009, 05:04 PM
Edit: Basically the same as above ^^^

Do a view source of that page and you will find out why. It is not actually an xml document. It is a html page that has dumped xml data between <pre> </pre> tags and I'm not sure if the <pre> tags did it or they are using htmlentities on it but the < and > are actually & lt; and & gt; (without the spaces.)

ptmuldoon
04-30-2009, 06:15 PM
Thanks for the help guys.

So, in looking at the source data, I need to convert htmlentities back their characters of < and >, etc before continuing.

Now can I do that with get_file_contents? Or would I better to use fsockopen to read the file?

I've been trying to use fsockopen, but I keep having problems with the url including http:// even though allow_url_fopen is set to On in my php.ini settings.

CFMaBiSmAd
04-30-2009, 06:33 PM
I would use preg_match() to get everything between the <pre> </pre> tags, then use html_entity_decode() to get it to a usable form to supply to the simple XML function.

Edit: you will also find by echoing the result of the file_get_contents() that you must be using cookies to access that page. A browser can, but a php script would need to use curl with cookies.

Specifically, you will get -


Cookie Scan Error
Synsport uses cookies to identify returning league owners so they don't need to log in every time they visit. We previously provided an alternative, but now we require browsers to accept cookies, and we detect that this browser is not accepting cookies. If you wish to continue, your options are to either configure this browser to accept cookies from Synsport or use another browser on your computer that is already configured to accept cookies.
Q1. Why did you disable the alternative identification method of passing the session identifier through the URL?
It's because of the evil spidering robots. Long ago we realized that with our millions of valid URLs, the Google bots were eating up tons of bandwidth and loading the servers, so we changed the robot.txt instructions to tell Google and Yahoo and the other search engines not to index our site, and they respect our wishes. However, there are many evil bots out there scanning sites for private information and email addresses, looking to send spam to those accounts later. They obviously ignore our wishes to be left alone by robots. Many of these spiders won't accept cookies, so by requiring them, we cut out a large percentage of these intruders. The ones that present valid cookies can be tracked by our sessions, so we can ban those that are hitting the servers at a rate faster than 1 page per second, which is many times faster a legitimate user can browse. It's a war against the spammers, and the dumb bots will get stuck on this page.
Q2. I use Internet Explorer 7.0. How do I enable cookies this browser?
Choose “Internet Options” from the Tools menu in IE 7.0
Click on the “Privacy” tab.
Click the “Default” button (or manually slide the bar down to “Medium”) under “Settings”.
Click “OK“.
Q3. I use Firefox 2+. How do I enable cookies this browser?
From the Tools Menu, click Options
From the Options window, click Privacy
Under Cookies check Accept cookies from sites
Click OK
Q4. I use Opera 9+. How do I enable cookies this browser?
From the Tools Menu, click Quick Preferences
Check "Enable Cookies"
Q5. I use Macintosh Safari 2+. How do I enable cookies this browser?
From the Safari menu, select Preferences
Select Securities
Select Accept Cookies
Check "Only from sites you navigate to"

ptmuldoon
04-30-2009, 07:45 PM
Thanks again for the help, and I'm making some progress. Is using preg_match the best way to find the beginning and ending <pre> tags to get the xml data only?

Here's where I am now


/* STEP 1. let’s create a cookie file */
$ckfile = tempnam ("/tmp", "CURLCOOKIE");

/* STEP 2. visit the homepage to set the cookie properly */
$ch = curl_init ("http://501.synsport.com/index.php");
curl_setopt ($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec ($ch);

/* STEP 3. visit cookiepage.php */
$ch = curl_init ("http://501.synsport.com/index.php?id=77&xml");
curl_setopt ($ch, CURLOPT_COOKIEFILE, $ckfile);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);

$output = curl_exec ($ch);

if (preg_match("<pre>", $output))
{
echo "Found Start <br/>";
}

if (preg_match("</pre>", $output))
{
echo "Found End <br/>";
}


$a = html_entity_decode($output);
echo $a;

CFMaBiSmAd
04-30-2009, 08:06 PM
Yes (the function is something I found, seems to work as expected) -


function between_tags($string, $tagname)
{
$pattern = "/<$tagname>(.*)<\/$tagname>/is";
preg_match($pattern, $string, $matches);
return $matches[1];
}

... your code ...

$output = curl_exec ($ch);
$output = between_tags($output,'pre');
$output = html_entity_decode($output,ENT_QUOTES);

ptmuldoon
05-01-2009, 04:20 PM
OK, back it this some more today. Now that I've learn how to use curl a little, I've moved my testing offline with some sample data.xml from the original file, trying to parse/process the data.

What I can't figure out is how to parse some the xml data that has spaces and quotes (" ") used.

Sample XML data from file


<?xml version="1.0" encoding="ISO-8859-1"?>
<Synsport report="77">
<sample>
<var>TEST 1 AREA</var>
<var name="STAGE_STATUS">Week</var>
</sample>
<sample>
<var>TEST IN HERE</var>
<var name="STAGE_STATUS">Week</var>
</sample>
</Synsport>


And current code to read and parse the file.


$xmlstr = file_get_contents('data.xml'); // read your file

$xml = new SimpleXMLElement($xmlstr);

foreach ( $xml->sample as $sample ) {
$vartest = $sample->var;
echo $vartest . '<br/>';

//How do I include <var name="STAGE_STATUS"> ???
}

TheShaner
05-01-2009, 04:47 PM
You would get the attributes for the tag using:

$sample->var->attributes();

http://us.php.net/manual/en/function.simplexml-element-attributes.php

You should probably read over this whole section here. It'll explain just about everything you need to know in order to parse XML using SimpleXML:
http://us.php.net/manual/en/book.simplexml.php

-Shane

CFMaBiSmAd
05-01-2009, 04:49 PM
Here is how you access the name attribute directly -

$attrib = $xml->sample[0]->var[1]->attributes();
echo $attrib['name'];

It would take some experimenting to incorporate that into your existing code.

ptmuldoon
05-01-2009, 09:31 PM
I feel I should know this better than I do, yet I can't seem to figure out how you echo out both the attribute name and its value. I think/feel once I grasp that I should be able to begin manipulating the data for graphing and chart presentation (probably with xml/swf charts).

Updated Sample xml and Test Code Combined to one file for each testing.


$data = '<?xml version="1.0" encoding="ISO8859-1" ?>
<Synsport report="77">
<sample>
<var name="DESC">TEST 1 AREA</var>
<var name="WEEK">Week 1</var>
<var name="SCORE">88</var>
</sample>
<sample>
<var name="DESC">TEST 2 AREA</var>
<var name="WEEK">Week 2</var>
<var name="SCORE">96</var>
</sample>
</Synsport>';

$xml = new SimpleXMLElement($data);

foreach ( $xml->sample as $sample ) {
//How do you echo out the attribute name and value for each ???

$attrib = $xml->sample->var->attributes();
$value = $sample->var;
echo $attrib . '= ' . $value . '<br/>';

}

ptmuldoon
05-02-2009, 09:23 PM
Still learning here with parsing some xml data:)

Now why would the below only loop and give me the first set of sample data, Week 1 only, and not continue and give me Week 2 as well? It will loop, but it shows me Week 1 data twice.

Is there an easier way then placing foreach statements inside foreach statements?



$data = '<?xml version="1.0" encoding="ISO8859-1" ?>
<Synsport report="77">
<sample>
<var name="DESC">TEST 1 AREA</var>
<var name="WEEK">Week 1</var>
<var name="SCORE">88</var>
</sample>
<sample>
<var name="DESC">TEST 2 AREA</var>
<var name="WEEK">Week 2</var>
<var name="SCORE">96</var>
</sample>
</Synsport>';

$xml = new SimpleXMLElement($data);

foreach ( $xml->sample as $sample ){

foreach ($xml->sample->var as $var) {
echo $var['name'] . ' is ' . $var .'<br/>';
}
}

CFMaBiSmAd
05-02-2009, 09:31 PM
Because your inner foreach() loop is not using what you think it is. It should be -
foreach ($sample->var as $var) {

ptmuldoon
05-02-2009, 10:49 PM
Sweet, thanks, and that makes perfect sense now that you pointed it out.

Now...continuing down the path. I assume its more complex to dig deeper into the xml data as additional layers/childs are added?

Looking back at the original xml file that I will be starting with here: http://501.synsport.com/index.php?id=77&xml

You can see that deeper into the xml (middle of the xml data), you have something like this


<block level="1" name="ScoringGrid">
<row level="2" number="0">
<block level="3" name="Header">
<row level="4" number="0">
<var name="WEEK">1</var>
</row>
<row level="4" number="1">
<var name="WEEK">2</var>
</row>
<row level="4" number="2">
<var name="WEEK">3</var>
</row>
<row level="4" number="3">
<var name="WEEK">4</var>
</row>
......... Continuing on.


I'm unsure how the levels work. Would need to keep using additional foreach loops inside each to get the data you want to pull out?

I hope to eventually pull out enough data into a new xml file for eventual graphing.

FYI........This is all for fun for my fantasy-football league. Just something to keep learning with :)

CFMaBiSmAd
05-02-2009, 11:29 PM
If you use print_r on the object (at any level) it makes it easier to see how to access the available data and attributes -

echo "<pre>", print_r($xml,true), "</pre>";

or

echo "<pre>", print_r($xml->attributes(),true), "</pre>";

ptmuldoon
05-05-2009, 10:08 PM
Thanks again for the help. I was actually already printing/echoing out the array info for guidance as well. I've learned to use foreach loops within each other to reach the appropriate data.

Now, my next question is to how to possibly move from echo'ing the data I want to be used to a new xml file with the new tags. For example, within my code, I have the below. But now I want to create the new xml data, and not echo the information. Its probably pretty simple, and perhaps I'm just not thinking clearly today. I did try the asXMLl (http://us3.php.net/manual/en/function.simplexml-element-asXML.php) function, but not sure if that is correct, as I couldn't get it work.


foreach ($level_6->var as $var)
{
if($var['name'] == 'WEEKLY_SCORE')
{
echo '<number>'.$var.'</number>';
}
}

ptmuldoon
05-13-2009, 06:18 PM
Hey guys. Sorry for bringing this older post up again. But can someone maybe explain how I would get the attribute value of the 'number' from the an xml file like below?



<block level="3" name="Averages">
<row level="4" number="0">
<var name="AVERAGE">74.6</var>
</row>
<row level="4" number="1">
<var name="AVERAGE">81.4</var>
</row>

etc, etc.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum