PDA

View Full Version : What's currently the best way to parse XML files?


WA
11-07-2002, 12:22 PM
I was reading up on a few tutorials on XML support in PHP, specifically, parsing XML files, and was wondering, what's currently the best way to go about the task? I understand there's SAX, DOM, and PRAX, though based on this article (http://www.wdvl.com/Authoring/Languages/PHP/Pro/prophp4_4.html), all three seem to have rather serious shortcomings.

Is there a more generic, default way to parsing XML in PHP, such as writing your own functions? Ideally something that simulates DOM, but without the problems mentioned in the article.

Thanks,[

brothercake
11-07-2002, 12:47 PM
My understanding is that PRAX and DOM are both unproven technologies - prax has problems with large or complex documents, and DOM has memory leaks.

Of the three, only DOM has native xml writing capabilities - and of course the fact that it can use XSL makes it inherently more attractive. So I think for the future, DOM module is looking good; perhaps another 6-12 months of dev cycle would be enough to get it to professional reliability.

But for now, I've been using the sax method - well documented and described at http://www.php.net/manual/en/ref.xml.php - and it has proved perfectly stable and reliable. I have also heard of issues with this, where it's considered resource-heavy, but I haven't had any actual problems so far.


Of course .... you could always open the XML doc using file() and then parse it with preg_split :D

WA
11-09-2002, 01:24 PM
Thanks brothercake. Your post has gotten me using SAX instead, which I thought would be a lot more work relative to the DOM, but has turned out ok so far.

Does anyone here use (or have used) SAX on a popular site and to parse large XML files (ie: 10k), and if so, how does SAX perform in terms of efficiency and stability on the server?

Thanks,

brothercake
11-09-2002, 07:29 PM
I agree, it's proved very useful and comparitively easy. The thing that initially confused me is how the nodes are parsed - not in dom order but as "all the open tags, then all the tag data, then all the close tags" - once I'd grocked that it was simple. It's also worth remembering, which the php.net documentation mentions, that tag data is not necessarily a string - it might be an array.

I can show you a couple of things:

http://www.mori.com/news.phtml - each of those items is an xml block from an 11k document (about 25 items). It gets around 400 views per day, and has performed admirably. No reported problems, and subjectively no slower than other, static pages of a similair size on the same site.

Also http://www.brothercake.com/Ref/MORI_XML/articles.phtml - this draws its data from a 160k XML doc with around 700 blocks of data. The data is parsed into an array and then sorted according to get criteria. It works ... but it's kinda slow and so far still in dev, so I don't know how it will perform when used extensively. I'm gonna ask our server admin to "keep an eye on it ..." This is probably beyond what's reasonable :rolleyes:

WA
11-11-2002, 01:15 AM
Quite impressive brothercake! Certainly gives me new ideas and perspective into using PHP SAX.

For my application, I'll be dealing with a xml file that doesn't change often (ie: maybe once every few days), so my current thinking is to only parse it only once a day, and store the results as an array and serialized in a text file. Then for subsequent requests throughout the day, I'll simply unserialize and display what's in this text file. Since the xml file is remotely located, this should save on my bandwidth as well.

brothercake
11-11-2002, 05:38 PM
Yeah that makes sense. We're doing something for similair on our intranet, where the xml file is opened from the web server every half an hour, and saved as a delimited text file on the intranet server; this is purely for bandwidth reduction, and makes a big difference.