View Full Version : HTML tag/node frequency statistics

Harry Armadillo
03-16-2005, 08:04 PM
A script I'm building needs to look at and fiddlle with nearly every node in the document's body. To make it run as fast as possible, the switch/case statement (that decides what to do based on what kind of node it is) needs to be in the optimum order - common to rare.

For example: I don't care about #comment nodes. Are they common enough that a case '#comment': break; will save time? Or are they uncommon enough that it'll be faster to let them fall out the bottom of the switch (despite having to checked them against more case statements)?

Does anyone know where I can find (or how can I generate) statistics on the relative frequency of nodes? I can examine pages myself one at a time (I have a bookmarklet that pops up a window with how many of each node type), but it would be too tedious to manually examine enough pages for good stats on the rarer tags.

03-16-2005, 09:03 PM
is this of any assistance?



<script language="JavaScript" type="text/javascript">
EAry=new Array('IMG','INPUT');
PAry=new Array();

function Priority(){
for (i=0;i<EAry.length;i++){
PAry[i]=new Array();

function tsoSortNumeric(tso0,tso1){
tsoA=tso0[1]; tsoB=tso1[1];
if (isNaN(tsoA)){ return 0;}
else {
if (isNaN(tsoB)){ return 0; }
else { return tsoA-tsoB; }



<body onload="Priority();" >

<img src="111.gif" width="10" height="10">
<img src="111.gif" width="10" height="10">
<img src="111.gif" width="10" height="10">
<img src="111.gif" width="10" height="10">
<img src="111.gif" width="10" height="10">
<img src="111.gif" width="10" height="10">
<img src="111.gif" width="10" height="10">

<script> vic=0; </script>
<form name=Show id=Show style="position:absolute;visibility:visible;top:450px;left:0px;" >
<input size=100 name=Show1 >
<input size=10 name=Show2 >
<input size=10 name=Show3 >
<input size=10 name=Show4 >
<input size=10 name=Show5 >
<input size=10 name=Show6 >



Harry Armadillo
03-16-2005, 11:52 PM
Yeah, my bookmarklet does basically the same thing, only moreso.
javascript:(function(){var total=new Array();function sortThem(a,b){return(b.count-a.count)}function countObj(nodeName){this.nodeName=nodeName;this.count=1;}function totalNodes(obj){var i=total.length;var n=obj.nodeName.toLowerCase();dude:{while(i--){if(n==total[i].nodeName){total[i].count++;break dude;};}total[total.length]=new countObj(n);}for(var i=0;i<obj.childNodes.length;i++)totalNodes(obj.childNodes[i])}totalNodes(document.body);total.sort(sortThem);var w=window.open('','_blank');for(var i=0;i<total.length;i++)w.document.write(total[i].count+" "+total[i].nodeName+"<br>");w.document.close();})()Readable version:
var total=new Array();
function sortThem(a,b){
function countObj(nodeName){
function totalNodes(obj){
var i=total.length;
var n=obj.nodeName.toLowerCase();
break dude;
total[total.length]=new countObj(n);
for(var i=0;i<obj.childNodes.length;i++)
var w=window.open('','_blank');
for(var i=0;i<total.length;i++)
w.document.write(total[i].count+" "+total[i].nodeName+"<br>");
Which gives me something like this for this page:
819 #text
125 font
86 br
81 #comment
79 a
71 div
56 td
37 option
33 tr
22 img
21 strong
16 table
16 tbody
15 input
6 span
6 script
4 form
4 b
3 optgroup
2 select
2 p
2 code
2 hr
1 i
1 body
1 thead
That's great for a single page, but I need that sort of list for the 'average' page (or for the web as a whole...). I could generate that sort of list on a bunch of random pages (and have), but I don't have an easy way to total them over a ton of pages.

03-17-2005, 12:00 AM
Why don't you let it build the source code for a JavaScript object. Then you run it on twenty different sites and get twenty objects that you place in an array. Add together all nodes of the same kind to a total, and divide by twenty, and you have the average.

Harry Armadillo
03-17-2005, 12:31 AM
How do I do that on twenty pages from twenty different sites without triggering cross-site scripting warnings?

03-17-2005, 01:05 AM
There's a discussion about that kind of thing and xml http requests a few threads down, I think... (i'd probably just visit a bunch of sites in IE, and then read files from the cache, using the Shell/FSO).

03-17-2005, 03:10 AM
Harry: You don't. Twenty sites is low enough to collect one object literal for each manually. Then you manually enter those into the source code of the script that calculates the averages..

You see, the time you took for worrying about how to do it automatically is probably way larger than it would have been to do it manually.

03-17-2005, 04:33 AM
Eliminate the root of the problem....

processNode = new Array();
processNode['font'] = function(node)
{ /* process font node */
processNode['p'] = function(node)
{ /* process paragraph node */

function doNode(node)
{ processNode[node.nodeName](node);

03-17-2005, 06:03 AM
I did a rough performance test on both solutions (hash and switch) and found interesting results for IE6 and Firefox. Their results are contrasting. In IE, hash is faster than switch but the other way around for FF and it's also interesting to note that FF is much faster (about twice as fast) than IE in processing the code. Here's the code and the results:


//for hash
var processNode = new Array();
processNode['font'] = function(node)
{ /* process font node */
processNode['p'] = function(node)
{ /* process paragraph node */
processNode['div'] = function(node)
{ /* process div node */

function doNode(node)
{ processNode[node](node);

function process(node){
var s = new Date();
for (var i=0;i<100000;i++){
var e = new Date();
var d = (e-s)/1000;
document.getElementById('output1').innerHTML+=d+' '+node+'<br />';

//for switch
function doNode2(node)
switch (node){
case 'font':processNode2(node);break;
case 'p':processNode2(node);break;
case 'div':processNode2(node);break;

function processNode2(node){
/* process node */

function process2(node){
var s = new Date();
for (var i=0;i<100000;i++){
var e = new Date();
var d = (e-s)/1000;
document.getElementById('output2').innerHTML+=d+' '+node+'<br />';


<input type="button" value="hash" onclick="process(prompt('node?',''))" />
<div id="output1"></div>
<hr />
<input type="button" value="switch" onclick="process2(prompt('node?',''))" />
<div id="output2"></div>


hash (IE) switch (IE) hash (FF) switch (FF)
1.112 div 1.262 div 0.631 div 0.571 div
1.101 div 1.261 div 0.641 div 0.56 div
1.102 div 1.261 div 0.641 div 0.561 div
1.102 div 1.251 div 0.641 div 0.561 div
1.102 div 1.252 div 0.641 div 0.56 div
1.102 p 1.212 p 0.631 p 0.551 p
1.092 p 1.212 p 0.631 p 0.551 p
1.101 p 1.212 p 0.641 p 0.551 p
1.092 p 1.202 p 0.631 p 0.55 p
1.101 p 1.212 p 0.631 p 0.551 p
1.101 font 1.161 font 0.641 font 0.541 font
1.112 font 1.162 font 0.651 font 0.541 font
1.102 font 1.161 font 0.651 font 0.541 font
1.101 font 1.162 font 0.651 font 0.541 font
1.111 font 1.162 font 0.641 font 0.541 font
After I executed all the repetitions for hash method, I refreshed the page then execute the items for the switch method to make the scenario even.

Harry Armadillo
03-17-2005, 06:33 AM
Vladdy...won't work. The parent/child/sibling/grandparent/cousin relationships count, so I gotta just walk the the tree.

liorean, haven't you even had a puzzle that you just had to crack? Do I need to spend six hours figuring out how to cut a script down from 600 to 500 ms of run time? No, but I am going to. :) Anyway I found some code from your xml http postings - I didn't realize the IE5.5 would let you do out of domain xmlhttp requests.

I as type this, an ugly hunk of code is pulling urls from a list in a textarea, grabbing the html, chopping off the pieces I don't want, and dumping the rest into an iframe. Then a script totals up the different types of nodes and updates an output area.

IE5.5 seems the parse HTML into a slight different tree tha Firefox (not to mention how IE creates nodes from broken tags), so I still want to figure out how to do something similar in FF.

Meanwhile, it's nice to know that small is five time as common as blockquote and that #text is 185 times more common than hr.

03-17-2005, 12:36 PM
I still want to figure out how to do something similar in FF.

That may be documented on xulplanet (if not, just save the source files, and read them locally). ;)

Harry Armadillo
03-17-2005, 08:14 PM
Not being the sort of person who stops just because something may be pointless...I tried hashes.

The various functions all wanted different sets of parameters, so I end up building an object containing the minimal set and passed a reference to it. With the extra overhead, hashes are a lot slower. Especially with the error handling needed with IE's habit of making pointless nodes from broken tags (an '/img' node? a 'C160DB3548BEA4' node?)

FWIW, this the relative frequency I found in the bodies of a sample of 3021 pages:
31.2818 #text
9.5278 br
9.1822 a
7.4688 td
6.2378 font
4.7963 tr
3.9655 img
3.8765 p
3.5474 span
3.2721 b
2.2265 ! or #comment
2.1986 center
1.7305 tbody
1.6318 table
1.5344 div
0.9987 option
0.9682 nobr
0.8739 li
0.8541 input
0.5880 i
0.3109 strong
0.2909 spacer
0.2580 hr
0.2282 noscript
0.2156 script
0.2067 small
0.1776 area
0.1602 form
0.1547 u
0.1535 ul
0.1340 dd
0.1056 body
0.0782 sub
0.0738 em
0.0671 big
0.0645 h2
0.0585 h1
0.0515 select
0.0512 h3
0.0486 th
0.0451 dt
0.0430 blockquote
0.0397 code
0.0337 pre
0.0301 h4
0.0277 map
0.0203 wbr
0.0090 style
0.0086 meta
0.0086 h5
0.0086 dl
0.0081 label
0.0080 size
0.0077 ol
0.0070 h6
0.0048 optgroup
0.0041 tt
0.0041 iframe
0.0034 noindex
0.0032 s
0.0028 textarea
0.0028 link
0.0018 base
0.0016 textbox
0.0016 noembed
0.0016 address
0.0015 caption
0.0012 dir
0.0011 strike
0.0008 cite
0.0008 acronym
0.0006 fieldset
0.0005 frame
0.0004 thead
0.0004 ilayer
0.0004 frameset
0.0004 col
0.0003 nolayer
0.0003 layer
0.0002 object
0.0002 menu
0.0001 nowrap
0.0001 embed
0.0001 dfn
0.0001 dev
0.0001 colgroup
0.0001 blink

Harry Armadillo
03-19-2005, 08:33 AM
For compiling my stats with a real browser, the key line is
which causes Firefox to ask if I want to allow a script to do potentially unsafe and obviously evil things. Which of course I do. :)