View Full Version : How to extract Html Text?
joeantony
09-18-2006, 07:30 AM
Hey,
How would I extract all the text elements in the html page.
I need all the words that are viewable in a html page in array..
how easily this could be achieved...:confused:
- Joe
nikkiH
09-18-2006, 07:29 PM
That depends on how messy the page is, and if the author used VALID html. ;)
Remember that the browser will often guess at what the page author wanted, so it might render as text things that are not quite in text elements (such as td, div, etc).
Is it your page? If so, you have control there. If not, good luck with that. :D
nickfox
09-19-2006, 11:00 AM
You could get the body element and then iterate down through it's children using recursion and picking up the TEXT_NODE nodeType along the way and putting in into an array. I don't think it would be to hard. You might have to filter out some garbage like nikki said. What exactly are you trying to do?
mrhoo
09-20-2006, 06:24 AM
You can start with this and tinker with it:
function deepText(who){
var tmp= [];
var tex;
var pa= who.childNodes;
var cnt= 0;
sayDeepLoop:
while (pa && pa[cnt]){
tex= pa[cnt++];
if(tex.nodeType=== 3 && tex.data) tmp.push( tex.data);
else if(tex.hasChildNodes()) tmp=tmp.concat(deepText(tex));
}
return tmp;
}
var A= deepText(document.body);
vBulletin® v3.8.2, Copyright ©2000-2012, Jelsoft Enterprises Ltd.