View Full Version : What could possibly be wrong here? (regex)

08-14-2008, 09:20 AM
I have a simple preg_match script here that is supposed to get the URL:


$string = "<a href=\"http://example.com/page.html\" title=\"my page\">";
$get = preg_match('href.*?title', $string);
echo $get;


I've tested my regex and it's correct, but when I run this code I get this error:
Warning: preg_match() [function.preg-match]: Delimiter must not be alphanumeric or backslash in /homepages/21/h545454582/htdocs/pages/curl.php on line 4

...obviously there are no alphanumeric or backslash in my delimiter, so can anybody please tell me what is going on here?

08-14-2008, 09:32 AM
You need to 'bound' you're pattern:

$get = preg_match('href.*?title', $string);
// Change to:
$get = preg_match('/href.*?title/', $string);

You will likely want to change the actual search to be more targeted. preg_match returns an integer result indicating its results found, not the actual results. So, you'll want something more like:

$matches = array();
preg_match('/<a href="(.*)" title="(.*)">.*<\/a>/msi', $string, $matches);

The () are subpattern matches, so results will be stored in $matches[1] and $matches[2] for url and title. If you need multiple results, you need to use preg_match_all to continue the search and return a multidimensional array. In that case though, you'll likely need to force this to be 'ungreedy' with a U modifier.

08-14-2008, 09:47 AM
Thanks Fou-lu, it worked after downgrading from php 5 to 4. And I will take your suggestion as well, but I'm curious, what is the 'msi' at the end of the pattern?

08-14-2008, 10:39 AM
Those are pattern modifiers. I'm not sure what you're usage is so I added them in. Here are their meanings:
m - Multiline. This allows pattern searches split over multiple lines.
s - DotAll. Allows the wildcard '.' character to also match newline characters.
i - Caseless. Allows pattern match regardless of string case.
i is probably the most common, followed by m. U is for ungreedy, which if you have say a text document you're finding all matches for or scraping a webpage, you'll need to find all results. The ungreedy modifier tells it to only grab up to the first bounds instead of continuing. Otherwise, you'd likely have one result out of a document with 10 <a href="" title="">.</a>. This is because it matches from the first <a href=""... all the way to the very last </a>. It is 'greedy' by default, so it takes as much data in as it possibly can.
To be honest, a much better explanation about patterns can be given to you from a Perl Pro. When it comes to patterns, I'd recommend asking them about patterns before a PHP developer. Just mention its in PHP :)

Did it not work in PHP 5? I'll have to check it out tomorrow, its almost 3 am here so I should probably get some sleep :$

08-14-2008, 10:45 AM
Great info there... just one more thing...

For the string:
blah blah other stuffhere href="http://example.com/" blah blah other stuff here

This regex...

How would I get just the URL without the " at the end?

08-14-2008, 10:51 AM
actually that did not work, can you show me how to get just the URL, inoring everything else around the URL?

I'm not sure but I'm thinking it would have to use the items at the beginning and end of the url...
Start:href=" End:" blah

08-14-2008, 12:43 PM
Replace $result with the var that is holding the outputted http://example.com/"
$result = str_replace('"', '', $result);

08-14-2008, 08:53 PM
The reason it did not work is because its not a 'complete' <a> tag. It will work if you use the format <a href="yourelink.html" title="">a</a>. If you have something like: 'blah blah href="yourlink.html" blah blah blah', you can match it with:

$string = 'blah blah href="yourlink.html" blah blah blah';
$matches = array();
preg_match('/href="(.*)"/msi', $string, $matches);

Which is similar to you're original pattern.
$matches[0] will contain the full match, and $matches[1] will contain yourelink.html. You need to be careful about arbitrary searches on href though, you may need to add an assertion to ensure it not matching actual <a> tags. This will capture regardless of if its in an <a> tag, and if you need to change it to ignore the <a> matches, you'll want to assert a (?!<a) in front of the pattern.