...

View Full Version : Regular Expression - retrieving website url



tagnu
08-26-2008, 01:18 PM
Hi all,
One more regex help.

I'd like to retrieve the original url of sites from yahoo search results.

For. e.g:
1. www.example.com

2. subdomain.example1.com (subdomain)

If I go for this expression:

http://*[^/]*

I'll get all the http://uk.wrs.yahoo.com/ from both links.

But how to retrieve the highlighted sites.



http://uk.wrs.yahoo.com/_ylt=A8pWBj2w5bNIa84A54S7HAx.;_ylu=X3oDMTEwZTl2dThqBHNlYwNzcgRwb3MDNwRjb2xvA2luMl9pbnRsBHZ0aWQD/
SIG=11kp4e70q/EXP=1219835696/**http%3A//www.example.com/index.html

http://uk.wrs.yahoo.com/_ylt=A8pWBj2w5bNIa84A5YS7HAx.;_ylu=X3oDMTEwaXAxMWJuBHNlYwNzcgRwb3MDNQRjb2xvA2luMl9pbnRsBHZ0aWQD/
SIG=11r0pjgq9/EXP=1219835696/**http%3A//subdomain.example1.com/index.html

Thank you

Philip M
08-26-2008, 01:57 PM
This should move you forward:-



<script type = "text/javascript">

var a = "http://uk.wrs.yahoo.com/_ylt=A8pWBj2w5bNIa84A54S7HAx.;_ylu=X3oDMTEwZTl2dThqBHNlYwNzcgRwb3MDNwRjb2xvA2luMl9pbnRsBHZ0aWQD/" + "SIG=11kp4e70q/EXP=1219835696/**http%3A//www.example.com/index.html"

var b = "http://uk.wrs.yahoo.com/_ylt=A8pWBj2w5bNIa84A5YS7HAx.;_ylu=X3oDMTEwaXAxMWJuBHNlYwNzcgRwb3MDNQRjb2xvA2luMl9pbnRsBHZ0aWQD/" + "SIG=11r0pjgq9/EXP=1219835696/**http%3A//subdomain.example1.com/index.html"

var x = a.match(/(http%3A.+)/);
x[0] = x[0].replace (/\%3A/,":")
alert (x[0]);

var y = b.match(/(http%3A.+)/);
y[0] = y[0].replace (/\%3A/,":")
alert (y[0]);

</script>


With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea. It is hard to be sure where they are going to land, and it could be dangerous sitting under them as they fly overhead.

abduraooft
08-26-2008, 02:51 PM
With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea. It is hard to be sure where they are going to land, and it could be dangerous sitting under them as they fly overhead. Lol, I thought the above code is for something else.

Cranford
08-26-2008, 02:54 PM
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Any Title</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<script type="text/javascript">

var nStr1 = "http://uk.wrs.yahoo.com/_ylt=A8pWBj2w5bNIa84A54S7HAx.;_ylu=X3oDMTEwZTl2dThqBHNlYwNzcgRwb3MDNwRjb2xvA2luMl9pbnRsBHZ0aWQD/SIG=11kp4e70q/EXP=1219835696/**http%3A//www.example.com/index.html";
var nStr2 = "http://uk.wrs.yahoo.com/_ylt=A8pWBj2w5bNIa84A5YS7HAx.;_ylu=X3oDMTEwaXAxMWJuBHNlYwNzcgRwb3MDNQRjb2xvA2luMl9pbnRsBHZ0aWQD/SIG=11r0pjgq9/EXP=1219835696/**http%3A//subdomain.example1.com/index.html";

function getDomain(urlStr){

var nDomain = urlStr.substring(urlStr.lastIndexOf('//')+2,urlStr.lastIndexOf('/'));
return nDomain;
}

function init(){

alert(getDomain(nStr1));
alert(getDomain(nStr2));
}

onload = init;

</script>
</head>
<body>

</body>
</html>

Philip M
08-26-2008, 03:02 PM
Another example showing that there are more ways than one of killing a cat.

tagnu
08-29-2008, 01:13 PM
Thank you Philip, the code works fine.
But in my case, I get 'http://example.com' in certain cases apart from 'http%3A//example.com'.

I was really looking for an expression that would accommodate both cases.

tagnu
08-29-2008, 01:18 PM
var nStr1 = "http://uk.wrs.yahoo.com/_ylt=A8pWBj2w5bNIa84A54S7HAx.;_ylu=X3oDMTEwZTl2dThqBHNlYwNzcgRwb3MDNwRjb2xvA2luMl9pbnRsBHZ0aWQD/SIG=11kp4e70q/EXP=1219835696/**http&#37;3A//www.example.com/index.html";
var nStr2 = "http://uk.wrs.yahoo.com/_ylt=A8pWBj2w5bNIa84A5YS7HAx.;_ylu=X3oDMTEwaXAxMWJuBHNlYwNzcgRwb3MDNQRjb2xvA2luMl9pbnRsBHZ0aWQD/SIG=11r0pjgq9/EXP=1219835696/**http%3A//subdomain.example1.com/index.html";

function getDomain(urlStr){

var nDomain = urlStr.substring(urlStr.lastIndexOf('//')+2,urlStr.lastIndexOf('/'));
return nDomain;
}

function init(){

alert(getDomain(nStr1));
alert(getDomain(nStr2));
}

onload = init;

</script>



Cranford, thank you for the effort, your snippet suits my need and I'm currently moving with it. But I'm really curious if there's a regex.
Adding 'http://' to nDomain; will make it easier to apply this as an attribute for any html element.

Failing to add 'http://', will add the current domain as prefix to the returned variable nDomain. In this case, you'll get the output as

'http://uk.wrs.yahoo.com/example.com'


function getDomain(urlStr){

var nDomain = urlStr.substring(urlStr.lastIndexOf('//')+2,urlStr.lastIndexOf('/'));
return 'http://' + nDomain;
}



ps: Learning regex using expresso (http://www.ultrapico.com/ExpressoDownload.htm), I'll update this post as soon I find a good regex.

Philip M
08-30-2008, 07:43 AM
Thank you Philip, the code works fine.
But in my case, I get 'http://example.com' in certain cases apart from 'http&#37;3A//example.com'.

I was really looking for an expression that would accommodate both cases.



<script type = "text/javascript">

var a = "http://uk.wrs.yahoo.com/_ylt=A8pWBj2w5bNIa84A54S7HAx.;_ylu=X3oDMTEwZTl2dThqBHNlYwNzcgRwb3MDNwRjb2xvA2luMl9pbnRsBHZ0aWQD/" + "SIG=11kp4e70q/EXP=1219835696/**http%3A//www.example.com/index.html"

var b = "http://uk.wrs.yahoo.com/_ylt=A8pWBj2w5bNIa84A5YS7HAx.;_ylu=X3oDMTEwaXAxMWJuBHNlYwNzcgRwb3MDNQRjb2xvA2luMl9pbnRsBHZ0aWQD/" + "SIG=11r0pjgq9/EXP=1219835696/**http%3A//subdomain.example1.com/index.html"

a = a.replace(/\%3A/,":");
var x = a.match(/[^\b](http.+)/);
x[0] = x[0].replace (/./,"");
alert (x[0]);

b = b.replace(/\%3A/,":");
var y = b.match(/[^\b](http.+)/);
y[0] = y[0].replace (/./,"");
alert (y[0]);

</script>



Taking the liberty of modifying Cranford's solution:-


function getDomain(urlStr){

var nDomain = urlStr.substring(urlStr.lastIndexOf('**')+2,urlStr.lastIndexOf('/'));
return nDomain;
}


You can test your regular expressions at: http://www.claughton.clara.net/regextester.html

tagnu
08-30-2008, 04:19 PM
Thank you!

Got the regex http(.){1,3}\/\/[^\/]*/g

Description:
http followed by
(.){1,4} any characters, min 1 or max 4 (to retrieve : and &#37;3A and also to include https),
\/\/ and // (escaped so \/\/)
[^\/]* any character except / (escaped so \/)
/g return all occurrences of the match



var urlStr = "http://in.wrs.yahoo.com/_ylt=A8pWBj2w5bNIa84A54S7HAx.;_ylu=X3oDMTEwZTl2dThqBHNlYwNzcgRwb3MDNwRjb2xvA2luMl9pbnRsBHZ0aWQD/SIG=11kp4e70q/EXP=1219835696/**http%3A//www.thesdf.org/index.html"

var res = urlStr.match(/http(.){1,3}\/\/[^\/]*/g);
document.write("count:"+ res.length + "<br />");

for(i=0;i<res.length;i++)
document.write(res[i]+ "<br/>");



ps: don't forget the /g

With g flag returns an array containing the matches, without g flag returns just the first match or if no match is found returns null.
I'm learning!
Helpful resources: http://www.javascriptkit.com/javatutors/redev3.shtml

Philip M
08-30-2008, 04:30 PM
Thank you!

Got the regex http(.){1,3}\/\/[^\/]*/g

Description:
http followed by
(.){1,3} any characters, min 1 or max 3 :, &#37;3A,
\/\/ and // (escaped so \/\/)
[^\/]* any character except / (escaped so \/)
/g return all occurances



To be picky, that does not work for https://
So make it http(.){1,4}\/\/[^\/]*/g

tagnu
08-30-2008, 07:20 PM
To be picky, that does not work for https://
So make it http(.){1,4}\/\/[^\/]*/g

That's true! thanks for pointing out.

So a better regex is
http(.){1,4}\/\/[^\/]*/g

Updated the previous post too.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum