...

View Full Version : How to get the links in a responseText/responseXML?



DH2006
06-08-2007, 03:20 AM
Hi!
How could I find the links (like in document.links[0] ...) from the content i retrieved through a XMLHttpRequest?
I get the error: MyThis.tempObj has no properties.



<html><head>
<script language="javascript" type="text/javascript">
MyXMLHttpRequest = function(fuURL,fuCallMeOnLoad) {
var MyThis = this;
this.status = 0; // siehe http-response-codes bzw xmlgetrequest-response-codes
this.theURL = fuURL; this.HTMLofURL = ''; this.CallMeOnLoad = fuCallMeOnLoad;
this.extractedLinks = new Array();
function onLoad (e) {
MyThis.status = 200;
MyThis.HTMLofURL = e.target.responseText;
var tempObj = e.target.responseXML; //That's not working
alert(tempObj.links[1]); // and so this gives an error
// MyThis.CallMeOnLoad(MyThis.theURL,MyThis.HTMLofURL);
}
this.LoadPage = function() {
try { netscape.security.PrivilegeManager.enablePrivilege("UniversalXPConnect"); } catch (e) { alert("Permission UniversalXPConnect denied."); }
var r = new XMLHttpRequest();
r.onload = onLoad;
r.open ("GET", this.theURL, true);
r.send (null);
}
}
function Show(t,h) {alert(t + "\n" + h);}
var Seite = new MyXMLHttpRequest('http://www.google.com',Show);
Seite.LoadPage();
</script></head><body> Hello! </body></html>

I have tried some other stuff already too, like parser, documentElement, ... but somehow responseXML is not there.

Greetings
Dieter

liorean
06-08-2007, 06:19 AM
The page you're requesting is not an XML page. It's only natural that responseXML is not available for it.

DH2006
06-08-2007, 09:56 AM
The page you're requesting is not an XML page. It's only natural that responseXML is not available for it.

I already thought something like that, but then i wasn'T sure, as in the head of the html-source it's marked as

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

Therefor I tried something like new DomParser(e.target.responseText) (can't remember the exact term), that should return a document-object which should be parseable then.
That didn't work (probably as I had an error in the syntax). There came a quite ugly long error-message.

Now I have found that example:

var MyresponseXML = new DOMParser().parseFromString(e.target.responseText, 'text/xml');


Do you think that would look good? Of course I will try it now.
Greetings
Dieter

DH2006
06-08-2007, 10:11 AM
Well, that didn'T work.
First error message now was "nicht wohlgeformt" (is in word by word translation in English "not well formed").
(A second error message of course followed then saying "MyThis.tempObj.links has no properties" )

Instead of text/xml i just tried text/html too, but that gave an exception with "NS_ERROR_NOT_IMPLEMENTED" at the parseFromString-function.

Could I use setMimeHeader? But I read that only working in IE. (I'm using Firefox.)

Thank you already for all your help!
Greetings, Dieter
P.S.
Actually I started first by retrieving the links by regex-functions from the source of loaded iFrame, but everyone suggested to just use XMLHttpRequest instead. Really, Firefox doesn't seem to have a simple function to get the source of a loaded website/window/document/iFrame/Frame. So now I tried to use XMLHttpRequest and the links-property and it doesn't work again.

liorean
06-08-2007, 02:02 PM
Sadly, there is to my knowledge no easy to use parser for HTML that generates an HTMLDocument object. DOMParser requires well formed XML for instance, and so does [object XMLHttpRequest].responseXML. I can imagine there is a way to get around it using innerHTML, or using iframe.src='data:test/html,'+encodeURIComponent(source); or something similar. I advise you to try one of those.

DH2006
06-08-2007, 06:02 PM
Sadly, there is to my knowledge no easy to use parser for HTML that generates an HTMLDocument object. DOMParser requires well formed XML for instance, and so does [object XMLHttpRequest].responseXML. I can imagine there is a way to get around it using innerHTML, or using iframe.src='data:test/html,'+encodeURIComponent(source); or something similar. I advise you to try one of those.

I will take a last try by searching in the Mozilla/Firefox-Source, where the function is, that parses the not-well-formed html-sources, when they are loaded in the browser. If I'm lucky, it's possible to contact that function through XPCOM. Else I will use REGEX-functions again.

InnerHTML will work, but my experience so far is that this way is not only slow, but will probably use much too much memory.
Iframe should work too, but likely using too much memory too - as my project is a search application for my own use that should load around at least 10000-search-pages a day. (Though it shall display around 10000 of the found linked pages too a day, I'm already curious how that will affect memory :P ).
I need my own application, as I want to implement my own analysis function and tools easily.

Thank you very much for your help!
Greetings, Dieter

DH2006
06-08-2007, 06:09 PM
I could try reading through some of the extensions that extract information from Google.de and other search engines. At least there are some that alter the pages. Let's see, maybe some isn't first loading the pages in iframes/frames first, but handles this somethow different (somehow faster and less memory consumptive.)

rnd me
06-15-2007, 09:09 AM
Sadly, there is to my knowledge no easy to use parser for HTML that generates an HTMLDocument object. DOMParser requires well formed XML for instance, and ...

cheer up, and turn it into valid XML!

//ff2 only
var s = new XMLSerializer();
var d = document;
var str = s.serializeToString(d);
alert(str);


instead of d, you can feed it your responseBody.


you can also create a hidden div, and set the innerHTML to the responseText.
then something like:

var newLinks=hiddenDiv.getElementsByTagName("a");

should work, as long as it is hidden, it shouldn't take that long.
but use the spankin new serializer if you can.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum