View Full Version : Why, whitespace, why?

05-10-2006, 10:14 AM
I'm trying to figure out reading XML into Javascript, and, frustrating as that is alone, what really boggles my mind is Mozilla's default NOT to ignore whitespace! I realize this may have its applications, but for the sake of my sanity (not to mention being cross-browser), I NEED to parse my XML document WITHOUT whitespace!

All I want to do essentially, is read in an XML document that has, let's say, 100 or so <character> nodes off the root, and output their text values into the HTML. Not so hard right? But if I make an XML document that I can actually READ (so that I don't go insane), I'm going to wind up with much more than 100 nodes thanks to reading in the whitespace... what can I do about this?

I've seen custom functions that will remove the whitespace nodes for me, but certainly there's an easier way to do this?!?

I can't imagine implementations not anticipating this need with XML.

05-11-2006, 08:26 PM
Well, the answer on why no rational implementation of an XML DOM ignores text nodes that contain only whitespace should be pretty clear when you add some other facts:

- In XML, the XML application is not bound to the engine, it's bound to the document, entirely unlike the HTML engines.
- For HTML, it's reasonable to remove whitespace nodes from the DOM when they are not significant. There are good arguments either way.
- XML has entirely different settings. Since the XML application and the XML engine are unconnected, the engine must defer the handling of whitespace.
- In XML, only validating engines have any knowledge at all about the whitespace handling of the XML application.
- But that doesn't help much since CSS allows changing the whitespace treatment, and whether whitespace between elements is significant or not.

(XML application = XML language, not "computer program handling XML")

In short, for XML there isn't really any choice. Keep the whitespace in the DOM or lose functionality.

As for your problem, there are a slew of solutions to it:
- Do you know the element types you want to handle? Then getElementsByTagName (for namespace unaware engines like that in iew) and getElementsByTagNameNS (for namespace aware engines, like pretty much everything else) might be a good solution.
- Do you know what element types may contain significant whitespace? If so, you can filter out the whitespace nodes from the childNodes nodelist on beforehand.
- You could also do a simple check on each node for nodeType to tell whether it is an element or not, and jump to the next one if it isn't.
- XPath, NodeIterators or TreeWalkers should be able to do the filtering you want effectively, but are not supported by iew.

05-11-2006, 10:42 PM
It is not just Mozilla that ignores whitespace. ALL browsers except intranet exploder do the same. IE is the non-standard one.

05-11-2006, 10:51 PM
- You could also do a simple check on each node for nodeType to tell whether it is an element or not, and jump to the next one if it isn't.
attribute (This will come through as either a 2 or undefined for backwards compatibility)
text node

05-11-2006, 10:57 PM
Fine... IE is the exceptional one.

Fine... there are reasons to read in the white space.

It's even FINE that it DEFAULTS to reading in the whitespace!

But if there are such good arguments either way, then why is there no simple option to tell the parser how to handle whitespace?? Or is there?

05-11-2006, 11:54 PM
But if there are such good arguments either way, then why is there no simple option to tell the parser how to handle whitespace??Well, good arguments either way goes for HTML. And that is because with the HTML engine, the browser knows where text nodes may appear or not.

However, the XML engine doesn't know. The XML application may, or may not, allow text node children, but everything that has to do with the XML application happens after the XML engine has already parsed the XML. Take XHTML for example: Sure, it has the same whitespace treatment as HTML, and the browser knows this. But, the browser also uses the XML engine for all other XML content types, and they may have different whitespace treatment. So the engine is built so as to retain all the data.

The engine builds either the DOM itself, or an internal structure that can be interfaced through the DOM. Iew does the latter, and that' why it's DOM works different - the DOM is not it's internal representation. The internal representation keeps the whitespace (since it allows setting random elements to display: inline; it has to keep the whitespace), but the DOM doesn't.
Or is there?Well, in at least some XML engines there problably is one such option for the program that includes the XML engine. But that is a compile time option I would wager, which means the script is stuck with what the program is built to use.

Just use one of the methods I listed above and work around it. It's really not that much of a roadblock unless you have really high demands on the performance of the script.

05-12-2006, 01:01 AM
Well I have already found a script that removes the whitespace nodes...I will use that. I just wanted to be sure that I wasn't missing an option to be passed to the parser. It appears that I'm not...

However this still seems like a ridiculous problem. If the need was anticipated the option could easily exist... instead everyone has to do it all by hand. I guess I should just be happy that I'm not using a large XML file where removing whitespace nodes would reduce performance heavily. If that were the case I think I would have to resort to a custom parser.