...

View Full Version : Extracting URL



afrojojo
06-18-2010, 12:00 PM
Need some help extracting the first url from a string. Whether it would start with http or https. Thanks

Input
this is some text http://yahoo.com/directory/file.php?id=1&blah=1 this is some texthttp://msn.com this is some text
this is some text https://google.com.

Output
http://yahoo.com/directory/file.php?id=1&blah=1

Rowsdower!
06-18-2010, 12:52 PM
You need to explore the preg_match() (http://php.net/manual/en/function.preg-match.php) function for this.

Set up a regular expression (http://www.google.com/search?hl=en&source=hp&q=php+regular+expression+url&aq=f&aqi=g2&aql=&oq=&gs_rfai=CHVqXwV0bTKeSFpaWMcnftKMKAAAAqgQFT9BubGc) to match the link pattern you need. Most likely you would set this up to read all data between "http" and the first blank space afterward. The exact code will depend on your project.

afrojojo
06-18-2010, 06:46 PM
Yeah, It really confuses me. I've tried by myself with no luck.

Input
this is some text http://yahoo.com/directory/file.php?id=1&blah=1 this is some texthttp://msn.com this is some text
this is some text https://google.com.

Output
http://yahoo.com/directory/file.php?id=1&blah=1

I would like to be able to control which url is extracted. If I have a variable set to 1, the output is http://yahoo.com/directory/file.php?id=1&blah=1. If the variable is set to 3, the output is https://google.com. Any ideas?

Rowsdower!
06-18-2010, 10:02 PM
Yeah, It really confuses me. I've tried by myself with no luck.

Input
this is some text http://yahoo.com/directory/file.php?id=1&blah=1 this is some texthttp://msn.com this is some text
this is some text https://google.com.

Output
http://yahoo.com/directory/file.php?id=1&blah=1

I would like to be able to control which url is extracted. If I have a variable set to 1, the output is http://yahoo.com/directory/file.php?id=1&blah=1. If the variable is set to 3, the output is https://google.com. Any ideas?

If you use preg_match_all() you will get an array of results, which you could then choose/use in any order you want to.

afrojojo
06-19-2010, 03:25 PM
$pattern = '/((?:https?|ftp):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$])/i';
if(preg_match_all($pattern, $string, $match)) {
$string=$match[0][0];
echo $string
}


So I figured it out and came up with this. I had another question though.

Example:
$string = "<a href="http://google.com">Google</a> this is some text http://yahoo.com this is some texthttp://msn.com";

If the string was what I have above, how do I make the pattern stop at a " or '. Otherwise the first string produced would be http://google.com">Google</a>. It only stops at the first space. I would like it to stop at the first space, the first ", or the first '.

So I would want the output of $string=$match[0][0] to be http://google.com. The output of $string=$match[0][1] to be http://yahoo.com, and so on.

Keleth
06-19-2010, 11:40 PM
Well, not really the pattern I'd use to search for a url... Plus, unless I'm missing something, I don't see why it should catch an apostrophe or double quote anyway... I don't see either as valid characters in your expression?

If someone types in just google.com or www.google.com do you wanna ignore it?

A quick google search some time back helped me find this site

http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

The author has a pretty good regex you can use and explains it a bit.

afrojojo
06-20-2010, 03:58 AM
Well, not really the pattern I'd use to search for a url... Plus, unless I'm missing something, I don't see why it should catch an apostrophe or double quote anyway... I don't see either as valid characters in your expression?

If someone types in just google.com or www.google.com do you wanna ignore it?

A quick google search some time back helped me find this site

http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

The author has a pretty good regex you can use and explains it a bit.

I would like to ignore url's without a protocol. The url's i'm dealing with will always have a protocol.

The reason I want to stop after apostrophe or quotes is because of the <a> tag using them to wrap the url.

I dont want this - http://yahoo.com">Yahoo</a>
I want this - http://yahoo.com

Keleth
06-20-2010, 04:55 AM
Again... are you getting quotes in your test cases? Because I see nothing in your pattern that should match a quote of any kind...

afrojojo
06-20-2010, 04:58 AM
Again... are you getting quotes in your test cases? Because I see nothing in your pattern that should match a quote of any kind...
That's why I need your help.

Keleth
06-20-2010, 06:20 AM
Well... I ran the code you put above and it works as expected... there are no problems... it stops at the quotes just like you wanted to and as it should. The code you gave pulls the 3 URLs perfectly. That's why I'm confused what you need help with.

afrojojo
06-20-2010, 03:07 PM
Really? It still gives me junk.

http://yahoo.com">test</a> :confused:

Keleth
06-20-2010, 04:06 PM
Sorry man, I copied and pasted your code verbatim and it works for me. You might wanna make sure you didn't change your local code since you posted it here. Again, I see NO reason why you should get any quotes in your results, and the fact that you are baffles me completely.

Try coping and pasting your code from here into an online regex program like http://gskinner.com/RegExr/ and see what happens.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum