PDA

View Full Version : variable pattern/modifier


duniyadnd
01-01-2003, 07:08 AM
I'm pretty new to pattern searching, so I know the script I wrote below is WAAAAAY below how efficient it can be, I'm guessing it has a runtime of O(infinity).. :o

Just curious if anyone can try to shorten the code and make it more efficient.



"/(<\s*a((\s*\w*|\w*\s*\1)+)(\=*)\s*(\"*|\'*)((\s*\w*|\w*\s*\1)+)(\"*|\'*)((\s*\w*|\w*\s*\1)+)href(\s*)=(\s*)(\"|\')((\s*\w*|\w*\s*\1)+)(\"|\').*\s*>((\s*\w*|\w*\s*\1)+)<\/a>)/ix"


This is what it does, it looks for <a href="somelink">somelinkname</a>

However, it makes sure that if the section is typed over a bunch of lines, it will still catch it, and on another issue, it makes sure that if there are any other pieces of text, like class="something", it would ignore it and go on, however I don't think it does it after the href call. It ignores cases.

I tried using the /s at the end to make it read as a single line, but it generated problems for me (ie. it wouldn't read the pages at all). If anyone can make the above code more efficient, would greatly appreciate it.

Thanks
Duniyadnd

duniyadnd
01-01-2003, 07:10 AM
for some reason, i don't see the edit option, so i'll just continue down here.

for every section that reads 1, you'll see an asterix, put a "\" before it instead, it just came out wrong. Actually, there is a bunch of errors on the above one, i'll try to post it without the PHP syntax.

"/(<\s*a((\s*\w*|\w*\s*\1)+)(\=*)\s*(\"*|\'*)((\s*\w*|\w*\s*\1)+)(\"*|\'*)((\s*\w*|\w*\s*\1)+)href(\s*)=(\s*)(\"|\')((\s*\w*|\w*\s*\1)+)(\"|\').*\s*>((\s*\w*|\w*\s*\1)+)<\/a>)/ix"

Try this one instead.

Duniyadnd

mordred
01-01-2003, 04:04 PM
I have a notion of what you want to achieve with this RegExp, though you haven't stated exactly what you wanted to find with this pattern. Below is an example that might help you along:


$str = '
Here is <a class="x1" href=\\'urlX.html\\' id="blah">
a site
</a>
you should visit.
';

$test = preg_match(
'/<a(?:.*?)*\s+href\=[\\'"](.+?)[\\'"]\s*(?:.*?)>(.+?)<\/a>/is',
$str,
$matches
);

if ($test && count($matches) == 3) {
echo "URL: " . $matches[1] . "<br />";
echo "Text: " . $matches[2] . "<br />";
}


If you want to match more than one <a> tag in a given string, have a look at preg_match_all(). It's basically the same as preg_match() but allows for multiple matches and working upon them.

I do not know exactly how fine-grained and forgiving your RegExp shall be, from my experience I can say that sometimes "less is more", especially concerning maintenance. After all, for a lot of developers RegExps just look as if the cat jumped on the keyboard... but if you have any specific questions on how to tailor the expression to fit your needs, just ask!

mordred
01-01-2003, 04:09 PM
Arghl, this forum eats my backslashes again. I just hate it. :)

duniyadnd
01-01-2003, 04:20 PM
Geeeeez... that's exactly what i was looking for... now all i have to do is figure out how you did that.. :D Believe me, I haven't slept for two nights cause I was trying to get this thing to work efficiently, I guess its back to the drawing board of how all this works. Heh, as you can see, this is the first time I was trying to implement this type of procedure, oh well....

and yeah, i was using preg_match_all to begin with.

Would appreciate it if you left a brief explanation of how it worked, I mean, i can go through the manual, but the examples they show are kind of limited, I figure I would learn more by doing the most complicated type I could think.

Thanks,
Duniyadnd

mordred
01-01-2003, 05:56 PM
Ok, here's a short explanation. I guess you know already how preg_match and preg_match_all work, so I'll concentrate on the RegExp. Let's dissect it step by step:


<a(?:.*?)

The tag should start and directly be followed by 'a' (fairly obvious). Then follows a group (enclosed in the parentheses). Since groups normally are matched and put in the $matches array, but I have no use for them later, I tell the RegExp not to capture (i.e. don't put in $matches) this group by using ?: .
Then follows a strange looking thingy: .*?. The dot means to match any character. The asterisk controls the repetition of the dot (i.e. how often it may appear) and stands for "zero or more occasions". So this part matches anything to the end... and that's not really what we want. The repetition would match everything following, and that's called a "greedy" matching. We have to tell it not match "greedily", which means to stop when the rest of the pattern starts matching, and we do so by appending a question mark after the repetition character.


\s+href\=

\s matches any whitespace character, and there may be one or more whitespaces before the "href" attribute. Then follows the "=" sign and I escaped it since I couldn't remember if it had a special meaning in RegExps or not. :D


[\'"](.+?)[\'"]

Our url is enclosed in either single- or double quotes, which are contained in character classes. A class simply means: Match one of the contained characters. The group in the middle looks quite akin to the one explained earlier, and it performs also an ungreedy match of any character, only that there must be at least one. By doing this you exclude emtpy href attributes.


\s*(?:.*?)>

Some whitespace might follow., and then the following group is exactly the same as explained earlier and ends when the greater-than sign is found.


(.+?)<\/a>

To match the text, we use this ungreedy group again and have to find at least one character. Can you see a "pattern" here? :D
At the end there must be the closing tag, and the slash must be escaped because we used slashes as the delimiter characters (thoses that enclose your complete pattern).

/is

The modifiers at the end mean:
i = case-insensitive search
s = the dot (.) matches also newlines.

That's it. Was it understandable? :)
The really tricky part is the greedy-/ungreedy matching. If you grasped that you eventually start liking regular expressions, because it's very powerful tool then.

duniyadnd
01-02-2003, 08:03 AM
great, thanks, all i should have specified which ones i didn't know, but I figure other people can learn from this example. The ones that concerned to my state of confusion were the ones that involved the "?".

Thanks again
Duniyadnd