...

View Full Version : regex help required to grab links from hrefs



dudey
08-28-2006, 02:50 PM
Hi,

I am very new to PHP and have ben doing quite well (if I do say so myself) but I have hit a bit of a brick wall with regular expressions and wondered if anyone could point me in the right direction...

What I am trying to do is strip the url and content of an href into two separate variables, so


$link = "<a href=\"/media/press_releases/\">News &amp; Press Releases</a>"
would end up as two variables...

$url = "/media/press_releases/";
$title = "News &amp; Press Releases";

I assume I'm going to need to use preg_match, but my knowledge of regex's is abysmal and I'm not really understanding them properly.

Any help that can be given would be very much appreciated.

Thank you,

dudey

chump2877
08-28-2006, 03:06 PM
I'm sure there's a way to combine this into one regex, and I'm not the king of regex either, but this would work, i think:


$link = "<a href=\"/media/press_releases/\">News &amp; Press Releases</a>"
preg_match('(\b[a-zA-Z0-9]+://[^( |\>)]+\b)',$link,$matches);
preg_match('/(<a)(.+?)(>)(.+?)(<\/a>)/',$link,$matches2);

echo "URL is: ".$matches[0]."<br>";
echo "Title is: ".$matches2[4];

dudey
08-28-2006, 03:18 PM
Hmmm, almost works ... apart from the url bit.

Can you possibly explain to me what each part of your script is actually doing so that I might have a play with it (with hopefully a little understanding), for instance why specific part of the array?

Thanks for your help.

dudey

dudey
08-28-2006, 03:24 PM
Ah, I see ... I think...

preg_match('/(<a)(.+?)(>)(.+?)(<\/a>)/',$link,$matches2);
echo "Title is: ".$matches2[4];

the [4] is getting the fourth instance of whatever is between the brackets ... so in this case it would be the content between the closing > of the opening tag and the end tag of </a> ... so I guess '.+?' means 'any character' ?

dudey
08-28-2006, 03:41 PM
still can't seem to get the url part of it to work though :( ... any ideas?

Thanks for your help so far

dudey

chump2877
08-28-2006, 03:50 PM
<?

$link = '<a href="/media/press_releases/">News &amp; Press Releases</a>';
preg_match('/(<a)(.*?)(href="|href=\')(.+?)("|\')(.*?)(>)([^<>]+?)(<\/a>)/i',$link,$matches);

echo "URL is: ".$matches[4]."<br>";
echo "Title is: ".$matches[8];

?>
give that a whirl...i just fooled around with it some more...but it may not be perfect for finding all URLs though...

The "URL" is matching whatever the forth parenthesized pattern found...the "Title" is matching whatever the eighth parenthesized pattern found....

For the URL: (.+?) is any character repeated one or more times, but don;t be so "greedy" as to include the (") character as well....

similar logic for the Title...

you can also try looking here: http://www.codingforums.com/showthread.php?t=76949

Edit: Added a couple of things to my regex pattern...

dudey
08-28-2006, 03:56 PM
Great stuff.
Thanks very much, it is much appreciated.

dudey

chump2877
08-28-2006, 04:05 PM
...:thumbsup:



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum