PDA

View Full Version : Cleaning useless whitespace in Mozilla DOM


Alex Vincent
09-29-2002, 02:50 AM
const notWhitespace = /\S/

function cleanWhitespace(node) {
for (var x = 0; x < node.childNodes.length; x++) {
var childNode = node.childNodes[x]
if ((childNode.nodeType == 3)&&(!notWhitespace.test(childNode.nodeValue))) {
// that is, if it's a whitespace text node
node.removeChild(node.childNodes[x])
x--
}
if (childNode.nodeType == 1) {
// elements can have text child nodes of their own
cleanWhitespace(childNode)
}
}
}

document.addEventListener("load", function() {
cleanWhitespace(document)
}, true)


This script is intended to remove whitespace text nodes from a document. These nodes show up far more often than we want to admit, and leads to a DOM that is different in Mozilla than IE.

Make sure you use this only in documents where whitespace is expendable. XHTML documents are among these, as are MathML expressions and SVG images.

Theoretically, whitespace can be significant in some XML documents.

jkd
09-29-2002, 03:14 AM
document.addEventListener('load', function() {
var treeWalker = document.createTreeWalker(document, NodeFilter.SHOW_TEXT, { acceptNode: function(node) { return /\S/.test(node.nodeValue) ? NodeFilter.FILTER_REJECT : NodeFilter.FILTER_ACCEPT } }, false);

while (treeWalker.nextNode())
treeWalker.currentNode.parentNode.removeChild(treeWalker.currentNode);

}, true);


Behold the awesomeness of DOM2 Traversal. :)

Alex Vincent
10-01-2002, 03:48 AM
I think I like mine better, as it can be much easier modified to work in IE browsers as well. We just change the const to a var, and use window.onload instead of document.addEventListener.

jkd
10-01-2002, 04:30 AM
Originally posted by Alex Vincent
I think I like mine better

Well naturally ;)


, as it can be much easier modified to work in IE browsers as well.


And the purpose would be...? IE only has empty text nodes when you progmatically create them through createTextNode().


We just change the const to a var, and use window.onload instead of document.addEventListener.

i.e. make the code bad. :p

Alex Vincent
10-01-2002, 05:03 AM
Originally posted by jkd

And the purpose would be...?

To avoid browser-sniffing code. :cool:

Note for all the spectators: Jason and I have been debating code for years. We love to present alternatives to each other's codes and arguments.

Of course, he never admits that I can occasionally code better than he can...

jkd
10-01-2002, 05:25 AM
Originally posted by Alex Vincent
Of course, he never admits that I can occasionally code better than he can...

The exact opposite could be said of you. :p

I just don't like the thought of recursively calling cleanWhitespace on every node in the document. Not only do you get n steps where n is the number of nodes, but in each n, you iterate through its child nodes, which is some exponential growth of the number of calculations the code needs to go through.

Using treewalker (I have no idea how it works internally), it should just take a little bit longer to initialize the object, then perform n iterations, where n is the number of text nodes - much lower than the previous n.
Of course, this is in the case of a static DOM - because it is all dynamic in Mozilla, I'm sure this algorithm is slightly more than an n one, but I believe still less than n^(some power).

On average, I believe the TreeWalker solution would prove to be faster - but who feels like calling the two different versions a few hundred times, and averaging them out? ;)

beetle
10-14-2002, 08:34 PM
To the contrary, IE DOES create some empty text nodes...

Ever try placing what should be two, horizontally adjacent images into the HTML?<img src="pic1.jpg">
<img src="pic2.jpg">IE renders a space between the two this way. You have to do this:<img src="pic1.jpg"><img src="pic2.jpg">To elminate the space...
In short, IE makes them too...just not nearly as many as Gecko.

P.S. I have my own whitespace cleaner that I made a bit ago (very similar, but then, wouldn't it be?)...I've used it quite extensively and even on large HTML pages I've never seen it take longer than .5 secs. (AMD 650Mhz) I think even in today's broadband plentiful internet world, most people are patient enough for that.

jkd
10-14-2002, 08:48 PM
Originally posted by beetle
I think even in today's broadband plentiful internet world, most people are patient enough for that.

Has nothing to do with the Internet connection, rather, the computing power at its disposal.

Believe it or not, modifying the DOM of a page on the fly relatively takes a lot of operation. (Update the objects, notify the renderer something has changed, rerender the page, etc. And each of those aren't necessarily efficient or simple.)

beetle
10-14-2002, 08:55 PM
Originally posted by jkd
Has nothing to do with the Internet connection, rather, the computing power at its disposal.I know that...I intended to use it as a comparison for the speed of operations versus what people are willing to wait for. Even on broadband connections quite a few pages take a second or two to access, so another .5sec or less is no biggie.

whammy
10-17-2002, 01:58 AM
Just curious... why not:

const notWhitespace = /\S+/

?

jkd
10-17-2002, 02:11 AM
You know it is no longer an empty text node when it can match \S just once. Using \S+ is unnecessary, as it doesn't matter if it has more than one nonwhitespace character or not.

whammy
10-17-2002, 04:23 AM
Thanks for clarifying that! That makes sense, I think... since a space is usually if not always defined as a string... I assume from your answer that the empty text node always consists of only one or more spaces or line feed characters? :D

jkd
10-17-2002, 04:54 AM
Originally posted by whammy
I assume from your answer that the empty text node always consists of only one or more spaces or line feed characters? :D

Generally newline characters, tabs, and spaces. (Whatever you use to pretty print your markup)

whammy
10-18-2002, 03:10 AM
That's about what I figured... so pretty much whatever matches /\s+/ if you were using a regular expression? Actually that's what I was trying to convey by my original post, but perhaps I wasn't very clear. :D

WA
10-18-2002, 11:12 AM
Sometimes it takes a while before you realize just how useful a code snippet is :). I'm currently playing around with using the DOM to retrieve a XML file, and the above really came in handy in getting a consistent document tree across browsers within the XML file. BTW, I resorted to using Alex's code, for share ease of legibility.

Is there a logic behind Mozilla/NS inserting whitespaces into a document in such a manner? It seems to accomplish nothing but complicate matters.

beetle
10-18-2002, 02:36 PM
Originally posted by WA
Is there a logic behind Mozilla/NS inserting whitespaces into a document in such a manner? It seems to accomplish nothing but complicate matters. Well, jkd will tell you that it's nice because it 'preserves' the code format. Although this is true, I have serious reservations about how 'useful' that actually is.

jkd
10-18-2002, 06:50 PM
1. It is correct.

2. It preserves the format of the code. IE mangles all code it parses. Mozilla doesn't. This is useful when utilizing future DOM3 Load and Save modules to dynamically load or generate XML files. You'll be assured that the structure isn't modified from the original one you loaded.

3. It is not any harder to deal with if you are expecting it.

beetle
10-18-2002, 07:15 PM
1. Agree

2. Agree, but not a concern to me now.

3. Disagree. Doable, yes. Just as easy? No, because it takes extra time to figure out where these textnodes are going to show up, and just for that time commitment it becomes 'harder' (for lack of a better word) I mean, Vladdy wrote an entire script (http://www.vladdy.net/WebDesign/DOM_TreeViewer.html) just to help him solve these sort of inconsistencies. His tool makes dealing with this easier, but knowledge of the problem does not make it 'just as easy'

<div id="parent" onClick="alert(this.childNodes[1]);">
<span id="child1">text</span>
<span id="child2">text</span>
</div>

As far as I'm concerned, the above reference should ALWAYS find span#child2, regardless of whether or not there are valid (albeit empty) textNodes there. References like the one above should obey the structure of the HTML objects and valid textNodes but ignore 'empty' ones. In short, I'd rather not have to strip them out...It would be nice if Gecko would recognize them only when it's necessary or requested (such as the Save and Load interface you speak of)

jkd
10-18-2002, 07:26 PM
I find it interesting that in IE, the Adobe SVG Plugin also parses indentation as empty text nodes. This leads me to prefer the idea of parsing them, because you have two separate XML parsers found in web browsers (the Adobe SVG one, and expat in Mozilla) that agree, as opposed to just one (MSXML in IE) that doesn't.

And expat is basically an industry standard, and is accessible in most scripting languages, and C/C++. And as forementioned, Mozilla also uses expat.

WA
10-18-2002, 10:25 PM
When it comes to technologies, IE historically seems to have been more about ease of use and dare I say, practicality, while NS likes to take the more extreme route, whether in NS4 (extremely poor), or NS6 (extremely rigid in its application of standards). I don't want to argue which is better, though I will say getting rid of all those white spaces can become an agitating and resource hogging task. And as beetle says, unless one has a thorough understanding of the document tree, the task can also be unpredictable. IE's treatment of the document is more inline with logical thinking.

jkd
10-18-2002, 11:52 PM
Originally posted by WA
I will say getting rid of all those white spaces can become an agitating and resource hogging task.

Then don't get rid of them, and expect a more standard XML parser from IE.

Alex Vincent
10-19-2002, 03:09 AM
Originally posted by WA
Is there a logic behind Mozilla/NS inserting whitespaces into a document in such a manner? It seems to accomplish nothing but complicate matters.

http://bugzilla.mozilla.org/show_bug.cgi?id=26179

This bug has a very lengthy debate on whitespace text nodes in XML. Basically, mozilla.org believes there is no justification in a generic XML document to remove whitespace.

I agree with them: the duty of removing whitespace should be placed on the webpage author. Which is why I wrote the script.

Here's a lengthier article on the subject:

http://mozilla.org/docs/dom/technote/whitespace/

beetle
10-19-2002, 03:39 AM
Good reading, Alex. Thanks for that. I like what this person had to say...------- Additional Comment #3 From lhylan 2000-05-31 23:58 -------
I'm reopening this one because the more I think about it, the more I think exposing whitespace as a text node is the wrong thing to do. Why? Because you can't do it consistently. Text nodes are invalid in between table rows and table cells, for example -- so the same newline that appears after a </td> cannot be represented as a text node the way it can be if it appears after a </p>. This harms the roundtripping argument, IMHO. Also, it seems to me that whitespace shouldn't affect the structure of a document; I consider an HTML document with no linebreaks to be structurally identical to one with all kinds of whitespace between tags. If newlines are represented as text nodes, the two documents would be structurally different.

whammy
10-19-2002, 03:43 AM
Ok.. say I use 5 spaces instead of a tab character when I code. Since many EDITORS screw up tabs. And I use a Line Feed character as the (standard XML) return character.

Good?

It's not only annoying, but if another developer uses tabs, the resulting code can be practically unreadable due to differences in text editors.

Sorry not exactly on topic...

beetle
10-28-2002, 08:50 PM
Differences in text editors? I don't get that....

Maybe you mean difference in settings. In every text-editor I've worked with you can specify the size of your tabulation...for example...this forum uses 8-length tabs for the CODE tags, but both my text-editor and Dreamweaver use 4-length tabs....

So, if I want something to post here identical to how i see it in my editor, I need to change my tab setting to 8.

is that what you mean?

whammy
10-29-2002, 12:37 AM
Yeah... exactly. If they're different, and you don't use spaces, I've seen some code that looks perfectly formatted in one developer's editor end up looking like:


This
is
some
formatted

code
but
it's
messed
up!


that ^ in mine (and others).

beetle
10-29-2002, 01:14 AM
Whammy...

I see your point...but using spaces almost exponentially adds to the filesize. At times this can be negligible, othertimes not. If you substituted every tab for five spaces in some code, you could increase filesize significantly.

The data is the same from screen to screen, only the output is different. The fundamental makeup of the code should NOT have to be changed to accomodate everyone's editor. Try this, I do it on any project that I know is gonna be shared or dispersed...at the top of the page...

<!-- This document is best viewed if you set your tab-spacing to 4 -->

Or something similar.

beetle
11-14-2002, 08:15 AM
Man oh man oh man. How did I (and so many people here) miss the normalize() method? :rolleyes: :rolleyes: I did a search (here, at sitepoint AND webxpertz) and came up with nothing on this method. No-one is talking about it.

Now, from my preliminary testing, this method doesn't remove visibly intruding textNodes like the functions in this post do, but any reference you make (firstChild, parentNode, etc) after enacting normalize() will grab the expected node, and not any intermediary textNode.

Again, wow. This makes my life easier :D

normalize() @
MSDN (http://msdn.microsoft.com/workshop/author/dhtml/reference/methods/normalize.asp)
mozilla.org (http://www.mozilla.org/docs/dom/domref/dom_el_ref50.html#1028981)
W3C (http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/core.html#ID-normalize)

jkd
11-14-2002, 03:50 PM
I was unsure if normalize() normalized the immediate child nodes of the node it is called on, or all child nodes within the depth of the subtree.

document.documentElement.normalize()

if it works on the entire subtree.

beetle
11-14-2002, 04:19 PM
My tests thus far indicate it normalizes all child nodes within the subtree.

Alex Vincent
11-16-2002, 04:12 AM
That's what the DOM spec says it should do. :)

Alex Vincent
06-17-2003, 11:53 PM
Originally posted by jkd

document.addEventListener('load', function() {
var treeWalker = document.createTreeWalker(document, NodeFilter.SHOW_TEXT, { acceptNode: function(node) { return /\S/.test(node.nodeValue) ? NodeFilter.FILTER_REJECT : NodeFilter.FILTER_ACCEPT } }, false);

while (treeWalker.nextNode())
treeWalker.currentNode.parentNode.removeChild(treeWalker.currentNode);

}, true);


Behold the awesomeness of DOM2 Traversal. :)

EVIL, EVIL, BROKEN CODE!!!

And besides, it's hard to read.

Jason, your code just does not work.
Once you remove the node, you've got to reset the treeWalker.currentNode to a node still in the document.

Also, for some XML languages, whitespace nodes matter. XHTML is not one of them, except in the <pre/> element (which I think I can get away with removing whitespace nodes from anyway.)

(Note: this is something I forgot for my own code as well.)

This code works.


const nsIDOMNodeFilter = Components.interfaces.nsIDOMNodeFilter;
window.addEventListener("load", function(evt) {
var filter = {
acceptNode: function(node) {
if ((/\S/.test(node.nodeValue) || (node.parentNode.namespaceURI != "http://www.w3.org/1999/xhtml"))) {
return nsIDOMNodeFilter.FILTER_SKIP;
}
return nsIDOMNodeFilter.FILTER_ACCEPT;
}
}

var treeWalker = document.createTreeWalker(document, nsIDOMNodeFilter.SHOW_TEXT, filter, true);
while (treeWalker.nextNode()) {
treeWalker.currentNode.parentNode.removeChild(treeWalker.currentNode);
treeWalker.currentNode = document;
}
var output = document.getElementById("output");
var node = document.documentElement.firstChild;
output.appendChild(document.createTextNode(node.nodeType));
}, true);


And to think I was going to use that code you gave me in a DevEdge article I'm writing... :( Read the spec, Jason.

jkd
06-18-2003, 02:05 AM
You went thread digging just to blast me for code I hadn't tested? Now you're getting desperate :p.

liorean
02-21-2004, 12:22 PM
Just a question, but how would you handle significant whitespace between nodes, such as "<a ...></a> <em>...</em> <img .../> <strong>...</strong>"? If I'm not mistaken, this function will remove those whitespaces.

Alex Vincent
02-23-2004, 12:29 AM
liorean, for some reason I don't see anything significant about that whitespace. Unless you're talking about single spaces.

The code does remove that, and if that's what you're referring to, then yes, that's probably a bug. Easily fixed, though.


acceptNode: function(node) {
if ((/\S/.test(node.nodeValue) || (node.parentNode.namespaceURI != "http://www.w3.org/1999/xhtml"))) {
return nsIDOMNodeFilter.FILTER_SKIP;
}
if (node.nodeValue.length == 1) {
return nsIDOMNodeFilter.FILTER_SKIP;
}
return nsIDOMNodeFilter.FILTER_ACCEPT;
}

hemebond
11-26-2004, 09:51 PM
So does the normalize (sic) function actually work? I've never managed to get it to work. Is there a trick to it?

liorean
11-27-2004, 12:02 AM
It works, kinda. It just isn't suitable for this purpose. What it does is to merge one or several #text nodes into a single one. It doesn't remove any node except for when merging it's value into another node.

Passin Thru
11-27-2004, 05:54 PM
Re: jkd's post #6

There is some kind of growth going on in the function, but not for the reason you put forward. There are n nodes in total, and it visits them all once.

The TreeWalker uses much the same algorithm.

The reasons why it will be faster are that

1) it is "internal", not relying itself on script.

2) It is a proper list iterator. Collections are probably implemented internally as some weirdo data type like a linked list (Have you ever noticed that you can loop an array faster than you can loop a collection ?).

When moving through each flat collection (childNodes) iterator object keeps a reference to the current link, and moves directly on from there. Meanwhile, the scripted function accesses each childNode by index. This appears to be "direct access" - but it probably isn't. Internally, the list must be searched from [0] up to the required index each time.

I suppose that must lead to an arithmetic progression. Looping a collection by index gets more inefficient, compared with an iterator object, the longer the collection gets, to the tune of

n(n-1)/2 *n --> (n-1)/2

It could be that using neighbour relationships, instead of indices could remove this issue. So maybe trying to walk using node.nextSibling.

Then again, all this could all be 'ked up.

liorean
11-27-2004, 08:41 PM
There is some kind of growth going on in the function, but not for the reason you put forward. There are n nodes in total, and it visits them all once.

The TreeWalker uses much the same algorithm.

The reasons why it will be faster are that

1) it is "internal", not relying itself on script.No more internal than the item() syntax is. They're both layers above an array of references, and TreeWalker has way more overhead in terms of scripting since it has to create closures while the loop doesn't.
2) It is a proper list iterator. Collections are probably implemented internally as some weirdo data type like a linked list (Have you ever noticed that you can loop an array faster than you can loop a collection ?).Of course, arrays are presumably more compact, don't need to be "live", don't have to carry synchronisation code etc.When moving through each flat collection (childNodes) iterator object keeps a reference to the current link, and moves directly on from there. Meanwhile, the scripted function accesses each childNode by index. This appears to be "direct access" - but it probably isn't. Internally, the list must be searched from [0] up to the required index each time.No, it doesn't. The access time for object 0 and object length-1 (http://testsuite.liorean.net/dom/domcollection-access-speeds.html) are over a sequence 5000 accesses about equal.I suppose that must lead to an arithmetic progression. Looping a collection by index gets more inefficient, compared with an iterator object, the longer the collection gets, to the tune of

n(n-1)/2 *n --> (n-1)/2As I've just showed, it doesn't.It could be that using neighbour relationships, instead of indices could remove this issue. So maybe trying to walk using node.nextSibling.Actually, the only place I see TreeWalkers or simple node traversal as faster than indiced access is in either concurrent handling (which neither is well suited to, really) or in precompiled static arrays, which DOMCollections are NOT, according to specification. TreeWalkers still have to walk through each element (so it doesn't get the benefit of just travelling #text nodes).






Alex: Whitespace is significant in the following cases:
- Formatting preserved contexts.
- Elements with CDATA content.
- Elements with #PCDATA content.
In both the latter cases, the normalised whitespace is significant (or is it only #PCDATA normalisation takes place? I'll have to go read the XML spec again...). In the former case, all whitespace is significant. In cases with element only content models, whitespace is not significant. Thus, an implementation can only know when it may skip whitespace nodes it it's got a knowledge of the DTD. However, thanks to CSS, we can change the first point as necessary, so it doesn't make sense to not contain the text nodes in the DOM tree, even if the behavior itself is entirely up to the user agent - iew does not break any spec by not containing the source-formatting-only whitespace.

Passin Thru
11-28-2004, 02:02 PM
Thanks for that liorean. I wasn't 100% sure about my premises, so I didn't want to sound too certain. The motivation was that I thought the criticism of the script - that it was somehow walking further than it should - was unfair.

Interesting, the test that accesses a 'near' and a 'distant' collection member. I think I've actually done similar myself, then completely forgotten about it. Your explanation has opened some things up a little. I have tried using a JScript Iterator and found that it actually seems slower than simple indexed looping. I've been thinking that I perhaps could have used a more efficient control structure for the iterator, but your info confirms that I shouldn't bother using it at all apart from for objects that can't be enumerated any other way.

Javascript 'arrays' surely aren't arrays either, when it comes to access, internally speaking. They just happen to be more 'digit oriented'.

Using the 'correct' item(i) method does seem marginally slower than array-style access. Does anyone think that scripts that use the convenient approach will one day pay the price, and fail ?

NakedIntruder
11-24-2006, 08:28 PM
Hi all,
I've written a way of doing this which is hopefully smaller and faster.


var someElementRef = document.getElementsByTagName('body')[0];
someElementRef.innerHTML = someElementRef.innerHTML.replace(/\B\s\B|[\n\r\t]/g,'');

Most of the above is for the purpose of example, it's the regular expression doing the work.

What it's looking for is;

1) /\B\s\B - A single whitespace node without a word boundary on either side
2) \n - a new line anywhere
3) \r - a new line anywhere
4) \t - a tab anywhere


Thanks a lot,



Jamie Mason