binaryWeapon
10-11-2008, 03:09 AM
I am interested in writing a tokenizer (and then a parser) for JavaScript, in JavaScript. I must admit, when I started (on a whim) I didn't know anything about parsers; I didn't even know what a tokenizer was at the time. I just winged it, learned all I could from various sites, and came up with a 500 line tokenizer. It recognizes strings, numbers (including hexadecimal, octal, floating point, and integer), comments, punctuators, keywords, identifyers, and also undefined and null. I didn't get around to distinguishing regex literals from division/division assignment, because I ran into a problem.
My tokenizer loops through every character in the string, and, depending on the character, and a few following it, narrows down the choices of token-types and then writes the token type and the value to an array. This works perfectly - until you get up to thousands of characters. I took the 500 lines of the tokenizer code itself and put into the tokenizer, and discovered that in Firefox it froze up and then stopped shortly after 7000 characters (there were 20,000 characters in the document). In Netscape, it stopped after 2000 chars. In Opera, it stopped at 18,000 chars. IE didn't work at all (can't say I'm surprised).
The only success was Google Chrome. It really showed the power of the V8 engine; the moment I told it to tokenize, all 20,000 characters were immediately tokenized. But even Chrome could only handle up to 37,000 chars.
My question is this: How could I cut down on the recursion? My problem was that I had way to much recursion - browsers will automatically stop after a certain amount. Is there any way to do the same thing with less? Basically, how are tokenizers usually made?
Any link, article, comment, insight, or help of any kind is much appreciated!
My tokenizer loops through every character in the string, and, depending on the character, and a few following it, narrows down the choices of token-types and then writes the token type and the value to an array. This works perfectly - until you get up to thousands of characters. I took the 500 lines of the tokenizer code itself and put into the tokenizer, and discovered that in Firefox it froze up and then stopped shortly after 7000 characters (there were 20,000 characters in the document). In Netscape, it stopped after 2000 chars. In Opera, it stopped at 18,000 chars. IE didn't work at all (can't say I'm surprised).
The only success was Google Chrome. It really showed the power of the V8 engine; the moment I told it to tokenize, all 20,000 characters were immediately tokenized. But even Chrome could only handle up to 37,000 chars.
My question is this: How could I cut down on the recursion? My problem was that I had way to much recursion - browsers will automatically stop after a certain amount. Is there any way to do the same thing with less? Basically, how are tokenizers usually made?
Any link, article, comment, insight, or help of any kind is much appreciated!