Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 13 of 13
  1. #1
    Regular Coder
    Join Date
    Feb 2005
    Posts
    400
    Thanks
    0
    Thanked 0 Times in 0 Posts

    HTML tag/node frequency statistics

    A script I'm building needs to look at and fiddlle with nearly every node in the document's body. To make it run as fast as possible, the switch/case statement (that decides what to do based on what kind of node it is) needs to be in the optimum order - common to rare.

    For example: I don't care about #comment nodes. Are they common enough that a case '#comment': break; will save time? Or are they uncommon enough that it'll be faster to let them fall out the bottom of the switch (despite having to checked them against more case statements)?

    Does anyone know where I can find (or how can I generate) statistics on the relative frequency of nodes? I can examine pages myself one at a time (I have a bookmarklet that pops up a window with how many of each node type), but it would be too tedious to manually examine enough pages for good stats on the rarer tags.

  • #2
    Senior Coder
    Join Date
    Mar 2005
    Location
    Portsmouth UK
    Posts
    4,467
    Thanks
    3
    Thanked 495 Times in 482 Posts
    is this of any assistance?

    PHP Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd"
    >

    <
    html>

    <
    head>
      <
    title></title>
    <
    script language="JavaScript" type="text/javascript">
    <!--
    EAry=new Array('IMG','INPUT');
    PAry=new Array();

    function 
    Priority(){
     for (
    i=0;i<EAry.length;i++){
      
    PAry[i]=new Array();
      
    PAry[i][0]=EAry[i];
      
    PAry[i][1]=document.getElementsByTagName(EAry[i]).length;
     }
     
    PAry.sort(tsoSortNumeric);
     
    document.Show.Show1.value=PAry;
    }

    function 
    tsoSortNumeric(tso0,tso1){
     
    tsoA=tso0[1]; tsoB=tso1[1];
     if (
    isNaN(tsoA)){ return 0;}
     else {
     if (
    isNaN(tsoB)){ return 0; }
      else { return 
    tsoA-tsoB; }
     }
    }


    //-->
    </script>

    </head>

    <body onload="Priority();" >

    <img src="111.gif" width="10" height="10">
    <img src="111.gif" width="10" height="10">
    <img src="111.gif" width="10" height="10">
    <img src="111.gif" width="10" height="10">
    <img src="111.gif" width="10" height="10">
    <img src="111.gif" width="10" height="10">
    <img src="111.gif" width="10" height="10">

    <script> vic=0; </script>
    <form name=Show id=Show style="position:absolute;visibility:visible;top:450px;left:0px;" >
    <input size=100 name=Show1 >
    <input size=10 name=Show2 >
    <input size=10 name=Show3 >
    <input size=10 name=Show4 >
    <input size=10 name=Show5 >
    <input size=10 name=Show6 >
    </form>

    </body>

    </html> 

  • #3
    Regular Coder
    Join Date
    Feb 2005
    Posts
    400
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Yeah, my bookmarklet does basically the same thing, only moreso.
    Code:
    javascript:(function(){var total=new Array();function sortThem(a,b){return(b.count-a.count)}function countObj(nodeName){this.nodeName=nodeName;this.count=1;}function totalNodes(obj){var i=total.length;var n=obj.nodeName.toLowerCase();dude:{while(i--){if(n==total[i].nodeName){total[i].count++;break dude;};}total[total.length]=new countObj(n);}for(var i=0;i<obj.childNodes.length;i++)totalNodes(obj.childNodes[i])}totalNodes(document.body);total.sort(sortThem);var w=window.open('','_blank');for(var i=0;i<total.length;i++)w.document.write(total[i].count+"  "+total[i].nodeName+"<br>");w.document.close();})()
    Readable version:
    Code:
    javascript:(function(){
      var total=new Array();
      function sortThem(a,b){
        return(b.count-a.count)
        }
      function countObj(nodeName){
        this.nodeName=nodeName;
        this.count=1;
        }
      function totalNodes(obj){
        var i=total.length;
        var n=obj.nodeName.toLowerCase();
        dude:{
          while(i--){
            if(n==total[i].nodeName){
              total[i].count++;
              break dude;
              };
            }
          total[total.length]=new countObj(n);
          }
        for(var i=0;i<obj.childNodes.length;i++)
          totalNodes(obj.childNodes[i])
        }
      totalNodes(document.body);
      total.sort(sortThem);
      var w=window.open('','_blank');
      for(var i=0;i<total.length;i++)
      w.document.write(total[i].count+"  "+total[i].nodeName+"<br>");
      w.document.close();
      })()
    Which gives me something like this for this page:
    Code:
    819 #text
    125 font
    86 br
    81 #comment
    79 a
    71 div
    56 td
    37 option
    33 tr
    22 img
    21 strong
    16 table
    16 tbody
    15 input
    6 span
    6 script
    4 form
    4 b
    3 optgroup
    2 select
    2 p
    2 code
    2 hr
    1 i
    1 body
    1 thead
    That's great for a single page, but I need that sort of list for the 'average' page (or for the web as a whole...). I could generate that sort of list on a bunch of random pages (and have), but I don't have an easy way to total them over a ton of pages.
    Last edited by Harry Armadillo; 03-16-2005 at 11:33 PM. Reason: speling

  • #4
    Master Coder
    Join Date
    Feb 2003
    Location
    UmeŚ, Sweden
    Posts
    5,575
    Thanks
    0
    Thanked 83 Times in 74 Posts
    Why don't you let it build the source code for a JavaScript object. Then you run it on twenty different sites and get twenty objects that you place in an array. Add together all nodes of the same kind to a total, and divide by twenty, and you have the average.
    liorean <[lio@wg]>
    Articles: RegEx evolt wsabstract , Named Arguments
    Useful Threads: JavaScript Docs & Refs, FAQ - HTML & CSS Docs, FAQ - XML Doc & Refs
    Moz: JavaScript DOM Interfaces MSDN: JScript DHTML KDE: KJS KHTML Opera: Standards

  • #5
    Regular Coder
    Join Date
    Feb 2005
    Posts
    400
    Thanks
    0
    Thanked 0 Times in 0 Posts
    How do I do that on twenty pages from twenty different sites without triggering cross-site scripting warnings?

  • #6
    Regular Coder
    Join Date
    Aug 2004
    Location
    codegoboom@yahoo.com
    Posts
    999
    Thanks
    0
    Thanked 0 Times in 0 Posts
    There's a discussion about that kind of thing and xml http requests a few threads down, I think... (i'd probably just visit a bunch of sites in IE, and then read files from the cache, using the Shell/FSO).
    *this message will self destruct in n-seconds*

  • #7
    Master Coder
    Join Date
    Feb 2003
    Location
    UmeŚ, Sweden
    Posts
    5,575
    Thanks
    0
    Thanked 83 Times in 74 Posts
    Harry: You don't. Twenty sites is low enough to collect one object literal for each manually. Then you manually enter those into the source code of the script that calculates the averages..

    You see, the time you took for worrying about how to do it automatically is probably way larger than it would have been to do it manually.
    liorean <[lio@wg]>
    Articles: RegEx evolt wsabstract , Named Arguments
    Useful Threads: JavaScript Docs & Refs, FAQ - HTML & CSS Docs, FAQ - XML Doc & Refs
    Moz: JavaScript DOM Interfaces MSDN: JScript DHTML KDE: KJS KHTML Opera: Standards

  • #8
    Senior Coder
    Join Date
    Jun 2002
    Location
    Nashua, NH
    Posts
    1,724
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Eliminate the root of the problem....
    Code:
    processNode = new Array();
    processNode['font'] = function(node)
              { /* process font node */
              };
    processNode['p'] = function(node)
              { /* process paragraph node */
              };
                         
    
    function doNode(node)
      { processNode[node.nodeName](node);
      }
    Vladdy | KL
    "Working web site is not the one that looks the same on common graphical browsers running on desktop computers, but the one that adequately delivers information regardless of device accessing it"

  • #9
    Supreme Master coder! glenngv's Avatar
    Join Date
    Jun 2002
    Location
    Philippines
    Posts
    10,966
    Thanks
    0
    Thanked 236 Times in 233 Posts
    I did a rough performance test on both solutions (hash and switch) and found interesting results for IE6 and Firefox. Their results are contrasting. In IE, hash is faster than switch but the other way around for FF and it's also interesting to note that FF is much faster (about twice as fast) than IE in processing the code. Here's the code and the results:

    script:
    Code:
    //for hash
    var processNode = new Array();
    processNode['font'] = function(node)
              { /* process font node */
              };
    processNode['p'] = function(node)
              { /* process paragraph node */
              };
    processNode['div'] = function(node)
              { /* process div node */
              };
                        
    function doNode(node)
    { processNode[node](node);
    }
    
    function process(node){
      var s = new Date();
      for (var i=0;i<100000;i++){
        doNode(node);
      }
      var e = new Date();
      var d = (e-s)/1000;
      alert(d);
      document.getElementById('output1').innerHTML+=d+' '+node+'<br />';  
    }
    
    //for switch
    function doNode2(node)
    { 
      switch (node){
        case 'font':processNode2(node);break;
        case 'p':processNode2(node);break;
        case 'div':processNode2(node);break;
      }
    }
    
    function processNode2(node){
      /* process node */
    }
    
    function process2(node){
      var s = new Date();
      for (var i=0;i<100000;i++){
        doNode2(node);
      }
      var e = new Date();
      var d = (e-s)/1000;
      alert(d);
      document.getElementById('output2').innerHTML+=d+' '+node+'<br />';  
    }
    form:
    Code:
    <form>
    <div>
     <input type="button" value="hash" onclick="process(prompt('node?',''))" />
     <div id="output1"></div>
    </div>
    <hr />
    <div>
     <input type="button" value="switch" onclick="process2(prompt('node?',''))" />
     <div id="output2"></div>
    </div>
    </form>
    Results:
    Code:
    hash (IE)	switch (IE)	hash (FF)	switch (FF)
    1.112 div	1.262 div	0.631 div	0.571 div
    1.101 div	1.261 div	0.641 div	0.56 div
    1.102 div	1.261 div	0.641 div	0.561 div
    1.102 div	1.251 div	0.641 div	0.561 div
    1.102 div	1.252 div	0.641 div	0.56 div
    1.102 p		1.212 p		0.631 p		0.551 p
    1.092 p		1.212 p		0.631 p		0.551 p
    1.101 p		1.212 p		0.641 p		0.551 p
    1.092 p		1.202 p		0.631 p		0.55 p
    1.101 p		1.212 p		0.631 p		0.551 p
    1.101 font	1.161 font	0.641 font	0.541 font
    1.112 font	1.162 font	0.651 font	0.541 font
    1.102 font	1.161 font	0.651 font	0.541 font
    1.101 font	1.162 font	0.651 font	0.541 font
    1.111 font	1.162 font	0.641 font	0.541 font
    After I executed all the repetitions for hash method, I refreshed the page then execute the items for the switch method to make the scenario even.

  • #10
    Regular Coder
    Join Date
    Feb 2005
    Posts
    400
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Vladdy...won't work. The parent/child/sibling/grandparent/cousin relationships count, so I gotta just walk the the tree.

    liorean, haven't you even had a puzzle that you just had to crack? Do I need to spend six hours figuring out how to cut a script down from 600 to 500 ms of run time? No, but I am going to. Anyway I found some code from your xml http postings - I didn't realize the IE5.5 would let you do out of domain xmlhttp requests.

    I as type this, an ugly hunk of code is pulling urls from a list in a textarea, grabbing the html, chopping off the pieces I don't want, and dumping the rest into an iframe. Then a script totals up the different types of nodes and updates an output area.

    IE5.5 seems the parse HTML into a slight different tree tha Firefox (not to mention how IE creates nodes from broken tags), so I still want to figure out how to do something similar in FF.

    Meanwhile, it's nice to know that small is five time as common as blockquote and that #text is 185 times more common than hr.

  • #11
    Regular Coder
    Join Date
    Aug 2004
    Location
    codegoboom@yahoo.com
    Posts
    999
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Harry Armadillo
    I still want to figure out how to do something similar in FF.
    That may be documented on xulplanet (if not, just save the source files, and read them locally).
    *this message will self destruct in n-seconds*

  • #12
    Regular Coder
    Join Date
    Feb 2005
    Posts
    400
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Not being the sort of person who stops just because something may be pointless...I tried hashes.

    The various functions all wanted different sets of parameters, so I end up building an object containing the minimal set and passed a reference to it. With the extra overhead, hashes are a lot slower. Especially with the error handling needed with IE's habit of making pointless nodes from broken tags (an '/img' node? a 'C160DB3548BEA4' node?)

    FWIW, this the relative frequency I found in the bodies of a sample of 3021 pages:
    Code:
    31.2818  #text      
     9.5278  br         
     9.1822  a          
     7.4688  td         
     6.2378  font       
     4.7963  tr         
     3.9655  img        
     3.8765  p          
     3.5474  span       
     3.2721  b          
     2.2265  ! or #comment 
     2.1986  center     
     1.7305  tbody      
     1.6318  table      
     1.5344  div        
     0.9987  option     
     0.9682  nobr       
     0.8739  li         
     0.8541  input      
     0.5880  i          
     0.3109  strong     
     0.2909  spacer     
     0.2580  hr         
     0.2282  noscript   
     0.2156  script     
     0.2067  small      
     0.1776  area       
     0.1602  form       
     0.1547  u          
     0.1535  ul         
     0.1340  dd         
     0.1056  body       
     0.0782  sub        
     0.0738  em         
     0.0671  big        
     0.0645  h2         
     0.0585  h1         
     0.0515  select     
     0.0512  h3         
     0.0486  th         
     0.0451  dt         
     0.0430  blockquote 
     0.0397  code       
     0.0337  pre        
     0.0301  h4         
     0.0277  map        
     0.0203  wbr        
     0.0090  style      
     0.0086  meta       
     0.0086  h5         
     0.0086  dl         
     0.0081  label      
     0.0080  size       
     0.0077  ol         
     0.0070  h6         
     0.0048  optgroup   
     0.0041  tt         
     0.0041  iframe     
     0.0034  noindex    
     0.0032  s          
     0.0028  textarea   
     0.0028  link       
     0.0018  base       
     0.0016  textbox    
     0.0016  noembed    
     0.0016  address    
     0.0015  caption    
     0.0012  dir        
     0.0011  strike     
     0.0008  cite       
     0.0008  acronym    
     0.0006  fieldset   
     0.0005  frame     
     0.0004  thead     
     0.0004  ilayer    
     0.0004  frameset  
     0.0004  col       
     0.0003  nolayer   
     0.0003  layer     
     0.0002  object   
     0.0002  menu     
     0.0001  nowrap    
     0.0001  embed     
     0.0001  dfn       
     0.0001  dev       
     0.0001  colgroup  
     0.0001  blink

  • #13
    Regular Coder
    Join Date
    Feb 2005
    Posts
    400
    Thanks
    0
    Thanked 0 Times in 0 Posts
    For compiling my stats with a real browser, the key line is
    Code:
      netscape.security.PrivilegeManager.enablePrivilege("UniversalBrowserRead");
    which causes Firefox to ask if I want to allow a script to do potentially unsafe and obviously evil things. Which of course I do.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •