...

View Full Version : ISO regex guru for help in multi-line pattern match



MrBiggZ
12-17-2010, 05:19 PM
Hi!

I'm hoping there is a regex guru in the house. I've been getting pretty good at this on a novice level and now it's time to step up the game a little bit.

From this little section of HTML code below I'd like to pull the addresses out. I tried this:



#<div style="margin-bottom:.*?>\s(.*>)<br>\s(.*?)\s<br>\s(\d{3}-\d{3}-\d{4}\s<br>#m


Alas .. my results yield nothing =( I don know that the end of each line in the HTML is a LF not CR/LF. I did try using the s modifier instead of the m. Still no luck! Oh .. I'm using preg_match_all and not just preg_match

I've seen multi-lines done but it was congested with \/.*+ and other thing I couldn't follow it. If you could please help me with this one and break it down for me so I can under stand what is going on. I think once I get this one under my belt and I can follow it I can do others.

Hope you can help! I'll consider if an xmas present! =) Probably the only one I'll be getting this year! :eek:

Thanks much in advance!


<div id="directions">
<div class="item">

<div class="number">1</div> Steak 'N Shake
<div style="margin-bottom: 20px;">
3810 W. Washington<br>
Indianapolis, IN 46241 <br>
317-241-0483 <br><br>

Hours: <br>
Dining Room:
<div>Monday: Anytime</div>
<div>Tuesday: Anytime</div>
<div>Wednesday: Anytime</div>
<div>Thursday: Anytime</div>
<div>Friday: Anytime</div>
<div>Saturday: Anytime</div>
<div>Sunday: Anytime</div>
<div class="show">Drivethrough: </div>
<div class="show">Monday: </div>
<div class="show">Tuesday: </div>
<div class="show">Wednesday: </div>
<div class="show">Thursday: </div>
<div class="show">Friday: </div>
<div class="show">Saturday: </div>
<div class="show">Sunday: </div>

</div>

timgolding
12-17-2010, 05:50 PM
DOM would probably be better for dealing with parsing html documents

Inigoesdr
12-17-2010, 09:02 PM
Tim is right, using the DOM is a lot more reliable than regular expressions for parsing HTML. That being said, I believe something like this is what you're looking for:

$input = <<<END
<div id="directions">
<div class="item">

<div class="number">1</div> Steak 'N Shake
<div style="margin-bottom: 20px;">
3810 W. Washington<br>
Indianapolis, IN 46241 <br>
317-241-0483 <br><br>

Hours: <br>
Dining Room:
<div>Monday: Anytime</div>
<div>Tuesday: Anytime</div>
<div>Wednesday: Anytime</div>
<div>Thursday: Anytime</div>
<div>Friday: Anytime</div>
<div>Saturday: Anytime</div>
<div>Sunday: Anytime</div>
<div class="show">Drivethrough: </div>
<div class="show">Monday: </div>
<div class="show">Tuesday: </div>
<div class="show">Wednesday: </div>
<div class="show">Thursday: </div>
<div class="show">Friday: </div>
<div class="show">Saturday: </div>
<div class="show">Sunday: </div>

</div> <div class="item">

<div class="number">2</div> Steak 'N Shake2
<div style="margin-bottom: 20px;">
3810 W. Washington2<br>
Indianapolis, IN 462412 <br>
317-241-04832 <br><br>
END;

$count = preg_match_all('#<div\s*class="item">\s*<div[^>]*>[^<]*</div>\s*([^\r\n\t]*?)[\r\n\t]*<div[^>]*>\s*([^\r\n\t]*?)<br[^>]*>[\s\r\n\t]*([^\r\n\t]*?)\s*<br[^>]*>\s*([0-9\-]*)#si', $input, $matches);

unset($matches[0]);

var_dump($matches);

That gives output like this:

array(4) {
[1]=>
array(2) {
[0]=>
string(14) "Steak 'N Shake"
[1]=>
string(15) "Steak 'N Shake2"
}
[2]=>
array(2) {
[0]=>
string(18) "3810 W. Washington"
[1]=>
string(19) "3810 W. Washington2"
}
[3]=>
array(2) {
[0]=>
string(22) "Indianapolis, IN 46241"
[1]=>
string(23) "Indianapolis, IN 462412"
}
[4]=>
array(2) {
[0]=>
string(12) "317-241-0483"
[1]=>
string(13) "317-241-04832"
}
}


I'm sure there are edge cases that you will have to tweak for, so keep that in mind.

Lamped
12-18-2010, 01:50 AM
A little tip for you: .* is almost always a bad idea. finding "margin-bottom:.*?>" - "margin-bottom:[^>]*>" leads to less confusion and errors.

I use "The Regex Coach" from http://weitz.de/regex-coach/ for developing and testing complex expressions, I strongly recommend it. The Windows version runs perfectly in Wine too.

MrBiggZ
12-18-2010, 07:09 AM
Thanks for your replies!

Ok! This:


#<div\s*class="item">\s*<div[^>]*>[^<]*</div>\s*([^\r\n\t]*?)[\r\n\t]*<div[^>]*>\s*([^\r\n\t]*?)<br[^>]*>[\s\r\n\t]*([^\r\n\t]*?)\s*<br[^>]*>\s*([0-9\-]*)#si'

Looses me! =(

Tell me if I'm right or not! This <div[^>]*>[^<]*</div> after the <div but not a > 0 or more time then a > and again not a < 0 or more times the a </div>

Now does \s* mean more then one space, tab or line break

I'm confused on this one: <br[^>]*>[\s\r\n\t]*([^\r\n\t]*?)

Dumb question is .. how to you train your brain to think this way? I just haven't found the in's and out's of it yet. If I had a good teacher I'd be better off. I'm semi-noobish so go ahead and beat me up! :eek:

Inigoesdr
12-18-2010, 09:25 AM
This <div[^>]*>[^<]*</div> after the <div but not a > 0 or more time then a > and again not a < 0 or more times the a </div>
Correct.


Now does \s* mean more then one space, tab or line break

I'm confused on this one: <br[^>]*>[\s\r\n\t]*([^\r\n\t]*?)
Yeah, you don't really need \r\n\t in that one. I left them in there because I added the \s last, and... lazy.

Dumb question is .. how to you train your brain to think this way? I just haven't found the in's and out's of it yet. If I had a good teacher I'd be better off. I'm semi-noobish so go ahead and beat me up! :eek:
You seem to have a better knowledge of regex than most people that ask questions about it. As far as getting your brain to think like that... I'm not sure. The easiest way I have found to develop/test regular expressions is to use Regex Buddy (http://www.regexbuddy.com/)(commercial). I have used The Regex Coach mention earlier previously, and it's a good free solution, but Regex Buddy is the best piece of software I've used for regex thus far.

Screen:
http://www.regexbuddy.com/screens/regexbuddy.png

MrBiggZ
12-19-2010, 05:31 AM
Yes sir I've seen Regex Buddy but my wallet cried at $40 bucks. I guess if I was doing this for a living .. it would be a good investment. But I mainly do this just to keep my mind sharp and that degree that hangs on the wall behind me that has NEVER been used to some use. =((

Cobol programmer by schooling. When I graduated in '88 PC were about were a POS used car costs now. The interwebz didn't even exist in the civilian world yet.

I guess it's going to have to be repetition learning to get this down. *sigh*

Thanks for all your help!

low tech
12-19-2010, 05:43 AM
Hi

http://www.gskinner.com/RegExr/

FREE online learning tool:-)

LT

MrBiggZ
12-20-2010, 01:49 AM
Hi

http://www.gskinner.com/RegExr/

FREE online learning tool:-)

LT

Thanks bud! Haven't ran across that one yet! Once the torture, I mean holidays are over I'll have to apply myself a bit more on it! Consider it bookmarked! :thumbsup:



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum