View Single Post
Old 06-08-2007, 05:02 PM   PM User | #6
DH2006
New Coder

 
Join Date: May 2006
Posts: 42
Thanks: 0
Thanked 0 Times in 0 Posts
DH2006 is an unknown quantity at this point
Quote:
Originally Posted by liorean View Post
Sadly, there is to my knowledge no easy to use parser for HTML that generates an HTMLDocument object. DOMParser requires well formed XML for instance, and so does [object XMLHttpRequest].responseXML. I can imagine there is a way to get around it using innerHTML, or using iframe.src='data:test/html,'+encodeURIComponent(source); or something similar. I advise you to try one of those.
I will take a last try by searching in the Mozilla/Firefox-Source, where the function is, that parses the not-well-formed html-sources, when they are loaded in the browser. If I'm lucky, it's possible to contact that function through XPCOM. Else I will use REGEX-functions again.

InnerHTML will work, but my experience so far is that this way is not only slow, but will probably use much too much memory.
Iframe should work too, but likely using too much memory too - as my project is a search application for my own use that should load around at least 10000-search-pages a day. (Though it shall display around 10000 of the found linked pages too a day, I'm already curious how that will affect memory :P ).
I need my own application, as I want to implement my own analysis function and tools easily.

Thank you very much for your help!
Greetings, Dieter
DH2006 is offline   Reply With Quote