View Full Version : Resolved Tokenizer Recursion Problem

10-11-2008, 04:09 AM
I am interested in writing a tokenizer (and then a parser) for JavaScript, in JavaScript. I must admit, when I started (on a whim) I didn't know anything about parsers; I didn't even know what a tokenizer was at the time. I just winged it, learned all I could from various sites, and came up with a 500 line tokenizer. It recognizes strings, numbers (including hexadecimal, octal, floating point, and integer), comments, punctuators, keywords, identifyers, and also undefined and null. I didn't get around to distinguishing regex literals from division/division assignment, because I ran into a problem.

My tokenizer loops through every character in the string, and, depending on the character, and a few following it, narrows down the choices of token-types and then writes the token type and the value to an array. This works perfectly - until you get up to thousands of characters. I took the 500 lines of the tokenizer code itself and put into the tokenizer, and discovered that in Firefox it froze up and then stopped shortly after 7000 characters (there were 20,000 characters in the document). In Netscape, it stopped after 2000 chars. In Opera, it stopped at 18,000 chars. IE didn't work at all (can't say I'm surprised).

The only success was Google Chrome. It really showed the power of the V8 engine; the moment I told it to tokenize, all 20,000 characters were immediately tokenized. But even Chrome could only handle up to 37,000 chars.

My question is this: How could I cut down on the recursion? My problem was that I had way to much recursion - browsers will automatically stop after a certain amount. Is there any way to do the same thing with less? Basically, how are tokenizers usually made?

Any link, article, comment, insight, or help of any kind is much appreciated!

10-11-2008, 06:11 AM
I tried something similar and posted a question to another forum.
Never got any action on it, but you are welcome to see if it would help you.
See: http://www.webdeveloper.com/forum/showthread.php?t=139173&highlight=tokenizer

I found a great explanation of compilers, parsers and tokenizers at: http://home.iae.nl/users/mhx/crenshaw/tiny.html

I also had some test code, but it may take awhile to find it again.
I was trying to create a programable RPN calculator with Javascript.
Went pretty well until I came to 'if...then...else...elseif' and other
conditional logic statements.

Other sites of interest I have found are:

While looking for more information I 'googled' using terms such as
tiny, compiler, assembler, tokens, tokenizer, parser
in various combinations.

10-12-2008, 01:40 AM
Turns out I posted too soon. After a night of rest I immediately realized that I was using tail-recursion - basically just recursion that does the same thing as a loop. I don't know why I didn't do this before - I just used a loop instead of recursion. I was also able to use named continue statements to get around the fact that I need to continue an outer loop and not an inner loop.

That being said, I really appreciate the post, jmrk - the sites were very useful.