...

View Full Version : Resolved RegEx Modifiers



coolcamo8642
02-17-2011, 08:28 PM
I've got the following sets of code I'm trying to extract data from using a regular expression:


<TD STYLE=text-align:center;background-color:yellow;color:red;font-size:10px;font-weight:bold; CLASS="dedefault"><p class="centeraligntext">ONLINE COURSE
</TD>
<TD STYLE=text-align:center;font-size:10px;background:WHITE; CLASS="dedefault"><p class="centeraligntext">
3
</TD>


I'd like to extract both the "ONLINE COURSE" and the "3"; this is the expression I'm using right now:


$REGEX_3='#\<p class\=\"centeraligntext\"\>(.+?)\<\/TD\>#s';


Since one value is on the same line as the <p> and the other isn't, I need it to ignore the linebreak so that both can be detected with the same expression. I've tried adding \m to make it multiline, but I don't know where precisely to put it at. I've tried a few different spots, but even when it's not causing a syntax error, it doesn't work.

Can anyone tell me how I need to modify the expression to ignore linebreaks?

Thanks in advance.

venegal
02-17-2011, 08:57 PM
There is no such thing as "ignore the linebreak". It's only about how to match it.

The multiline modifier only changes whether "^" and "$" match the beginning and end of lines, or of the whole string. You don't have those in your regexp, so you don't need that modifier.

Now, the real problem: You are trying to match a string, which includes linebreaks, with (.+?). The dot metacharacter won't match linebreaks, unless you use the "s" modifier, so normally it would just stop at the end of a line, which is not what you want.

But you already use the "s" modifier, and your regexp works perfectly. I have no idea what your problem is supposed to be. Maybe you're using preg_match, wich will stop after the first match, instead of preg_match_all.

The right place for modifiers, by the way, is after the closing regexp delimiter, which, in your case, is after the second "#".

coolcamo8642
02-18-2011, 02:49 AM
Thanks for the clarification. It seems like I was causing a syntax error when I had attempted to change it earlier, because it's working now as you said. Part of what's extracted ends up on a new line, but that's fine.

I think regular expressions are one of the most frustrating aspects of programming, so I really appreciate your detail!

venegal
02-18-2011, 03:02 AM
The line breaks are in the matched string, because you're matching them with the (.+?). If you don't want them, you can easily change the regexp to

$REGEX_3='#\<p class\=\"centeraligntext\"\>\s*(.+?)\s*\<\/TD\>#s';

The \s* should eat away the whitespace (including line breaks) just fine. Notice that this way, you don't even need the "s" modifier, because the line breaks will already be gone, so the dot doesn't have to match them any more.

So,

$REGEX_3='#\<p class\=\"centeraligntext\"\>\s*(.+?)\s*\<\/TD\>#';
should also do the trick.

Oh, and another thing: I'd only escape characters if I really need to, because part of why regexps can be frustrating is that they are hard to read, and escaping stuff without really needing to do that makes that problem even worse.

So, you could easily write it like this:

$REGEX_3 = '#<p class="centeraligntext">\s*(.+?)\s*</TD>#';



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum