Thread: Scrapping
View Single Post
Old 09-25-2012, 07:59 PM   PM User | #5
rnd me
Senior Coder

 
rnd me's Avatar
 
Join Date: Jun 2007
Location: Urbana
Posts: 3,452
Thanks: 9
Thanked 466 Times in 450 Posts
rnd me is a jewel in the roughrnd me is a jewel in the roughrnd me is a jewel in the rough
regexp is hard to write, and harder to maintain, i always preferred using DOM methods.
regexp can break on a simple redesign or widget addition.
finding the start point is easy, but the end point is harder, and subject to change with content.
with regexp, you almost always end up have to "over bite" and work your back from the end.
for example, you can find a div with a certain ID pretty easily, but if it nests 5 other divs, where the heck do you stop?
trying to match end tags in RegExp is next to impossible for any non-guru.
you would need to call exec() several times on the same regexp (instead of a one-shot like match or split) to step through result matches, and yikes, what a pita...

dom can be much more targeted/filtered/precise, and you can use easy-to-read css selectors instead of RegExp gobblygook.
plus you don't have to worry about finding the end tag, it's immune to whitespace/char encoding changes, and it resists breakage from authors inserting duplicate-shaped html.


demo in firebug right here, then adapt as needed:

Code:
function aGet(turl, callback) {
    var XHRt = new XMLHttpRequest;
    XHRt.onreadystatechange = function () {if (XHRt.readyState == 4 && XHRt.status == 200) {callback(XHRt.responseText);}};
    XHRt.open("GET", turl, true);
    XHRt.send();
    return XHRt;
}


aGet("/", function(str){
   var t= document.createElement("head");
      t.innerHTML=str;


//adjust here:
  var path="#collapseobj_forumbit_1 .alt1Active div [href$='2']";
  alert(  t.querySelectorAll( path )[0].textContent || "NO MATCH"  );
});
edit:
of course, i can hear them now, "what about the poor IE7 users?".
you can still for them, you can still use the dom, just not CSS selectors.
it's still a LOT easier to jump to the nearest ID and .getElementsByTag(Name) your way down to the content you need than it is splitting hairs with regexp.
i say forget about IE7; click jackers and zombie bots will probably finish those boxes off in the next 6 months, and 2/3 of IE7 boxes are in china, which is likely not your site's target demo anyway...
__________________
my site (updated 5/13)
STATS (2013/5) HTML5:90.2% MOB:14% IE7:0.5% IE8:8.8% IE9:11.4% IE10:6.5%

Last edited by rnd me; 09-25-2012 at 08:22 PM..
rnd me is offline   Reply With Quote