PDA

View Full Version : Regular Expressions


]|V|[agnus
06-28-2004, 08:44 PM
I wasn't sure where to place this topic, but here seemed as good a place as any, give that regular expressions are in everything from JavaScript to ColdFusion to (I assume) most programming languages. (Though, some smelly guy on the bus one day was telling me C had no such pattern matching methods. This seemed highly unlikely to me, but I don't know C...)

At any rate, I've just recently discovered regular expressions, and I am very intrigued by their powers.

1) What are some good exercises, tutorials or just good reads on using regular expressions?

2) Is this a proper expression to check for a valid e-mail address format: [[:alnum:]]+\@[[:alnum:]]+\.[[:alnum:]]+

(I speak without language specific context because it has appeared to me that there is a general syntax for regular expressions recognized by most languages I've encountered that employ them. Is this correct?

trib4lmaniac
06-28-2004, 09:05 PM
Regexps are amazing and for practise, try this, regexp.php (http://mq.astronomyforbeginners.com/regexp.php).
There's a few examples on there (including that email validation regexp above) my favorite being the remove doubled words. www.regularexpressions.info is a good site and I taught myself most regexp stuff from there and also stole the highlighting for the above regexp.php :D

Edit:
Updated the link as the address has changed!

]|V|[agnus
06-28-2004, 09:08 PM
Excellent. Thanks, br0. ;)

allida77
06-28-2004, 09:16 PM
I usually use:
http://www.regexlib.com/
http://www.evolt.org/article/Regular_Expressions_in_JavaScript/17/36435/

That regexp.php is nice.

Unit
06-28-2004, 09:16 PM
Also, C natively does not support regular expressions. There are libraries that facilitate the evaluation of regular expressions.

trib4lmaniac
06-28-2004, 09:20 PM
That regexp.php is nice.Any suggestions for examples at the bottom of the page are very much appreciated :thumbsup:

trib4lmaniac
06-28-2004, 09:21 PM
http://www.evolt.org/article/Regular_Expressions_in_JavaScript/17/36435/lol, its liorean!
Don't use that article though, the tutorial on re.info is much better as js doesn't support all of the regexp features.

]|V|[agnus
06-28-2004, 09:49 PM
\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z0-9._%-]{2,4}\b

That is the re located in the first paragraph of the front page over at re.info. That's what they list to check for a valid email address. Shouldn't "seth.ras_mussen@hotmail.com" pass as a match for that then? I'm having problems with it not doing so.

trib4lmaniac
06-28-2004, 10:15 PM
|V|[agnus']\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z0-9._%-]{2,4}\b

That is the re located in the first paragraph of the front page over at re.info. That's what they list to check for a valid email address. Shouldn't "seth.ras_mussen@hotmail.com" pass as a match for that then? I'm having problems with it not doing so.
Do you have case-insensitivity turned on? That address is lowercase [A-Z0-9] is upper. Either turn on insensitivity or change that character classes to [a-zA-Z0-9._%-] :thumbsup:

]|V|[agnus
06-28-2004, 10:29 PM
I thought you might be right there, but upon further inspection, "seth@seth.com" is not matching either, so something is horribly awry... :(

Are the boundary escape sequences necessary for this if what I'm checking against is a form field?

trib4lmaniac
06-28-2004, 10:34 PM
If your checking agains a form use the one on my regexp.php, changes the \b's to ^ and $. (start of string, end of string respectivley.)
If you match up seth@seth.com on the regexp page with case-insensitivity turned ON, it matches.

Willy Duitt
06-28-2004, 10:37 PM
|V|[agnus']\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z0-9._%-]{2,4}\b

That is the re located in the first paragraph of the front page over at re.info. That's what they list to check for a valid email address. Shouldn't "seth.ras_mussen@hotmail.com" pass as a match for that then? I'm having problems with it not doing so.

Oh, the coder_seth reincarnate....

I wondered what happened to you. :rolleyes:

....Willy

]|V|[agnus
06-28-2004, 10:52 PM
Oh, the coder_seth reincarnate....

I wondered what happened to you. :rolleyes:

:yawn:

Tribal: thanks. Even though I swear I tried that, per the sam recommendation from re.info, I seem to have it squared away now. That's a slick page you put together there. Kudos.

]|V|[agnus
06-28-2004, 10:58 PM
[0-9]{3}

Shouldn't that match "666" but not "6666" or "6" or "66", etc. ONLY a set of THREE NUMBERIC digits?

Willy Duitt
06-28-2004, 10:59 PM
|V|[agnus']:yawn:

ROTFLMAO :p

What's that suppose to mean?
You get tired of your user names or....

That's the third user name I have known you to use on this forum.
Most of us that frequent these forums use the same nic for years and there is a certain sense of pride regarding this unique identity and being recognizsed as such.

If I'm not boring you, can I ask why the need for the repeated changes of nics?

.....Willy

Antoniohawk
06-28-2004, 11:02 PM
Trib4l or anyone esle, I would be most appreciative if you would compile a list of good regex tutorials for the new computer programming FAQ. If you decide to do so, here's the thread. [http://www.codingforums.com/showthread.php?t=40811]

trib4lmaniac
06-28-2004, 11:09 PM
|V|[agnus'][0-9]{3}

Shouldn't that match "666" but not "6666" or "6" or "66", etc. ONLY a set of THREE NUMBERIC digits?
[0-9]{3} will match any numeric character 3 times ie. '666', '123', '174' but not '6', '66', '6666', '1234', '17'. If you want to check for three successive numbers use "([0-9])\1\1".

]|V|[agnus
06-29-2004, 09:55 PM
:blank stare:

You confuse me, Willy.

At any rate, this is only the second name I've registered here. The reason being that the first was temporarily banned as I made a less than good first impression. I wanted to start over anonymous. Didn't quite work out. Meh...

AND NOW! BACK TO THE TOPIC...

([0-9]+\.[0-9]{2}|[0-9]+)

This is what I came up with as a pattern that should match valid dollar amounts of either "xxx.xx" format or "xxx" format, where "x" is any digit 0-9. The above pattern is matching "xxx.x" for instance though. Where am I going wrong? I'm trying to say "a series of digits of arbitrary length, followed by a period, followed by exactly two digits OR a series of digits of arbitrary length."

]|V|[agnus
06-29-2004, 09:57 PM
[0-9]{3} will match any numeric character 3 times ie. '666', '123', '174' but not '6', '66', '6666', '1234', '17'. If you want to check for three successive numbers use "([0-9])\1\1".

Right, and I was getting a match for "6666" against that re... I swear, but something must have been cooky. I dunno... *shrug* Just limited the input to three characters and moved on. :)

Why would "[0-9]\1\1" match any successive series of three digits? I am working with ColdFusion REs currently, and nothing in the documentation I'm working with mentions the "\1" escape sequence.

trib4lmaniac
06-29-2004, 11:27 PM
|V|[agnus']([0-9]+\.[0-9]{2}|[0-9]+)

This is what I came up with as a pattern that should match valid dollar amounts of either "xxx.xx" format or "xxx" format, where "x" is any digit 0-9. The above pattern is matching "xxx.x" for instance though. Where am I going wrong? I'm trying to say "a series of digits of arbitrary length, followed by a period, followed by exactly two digits OR a series of digits of arbitrary length."
use^([0-9]+\.[0-9]{2}$|[0-9]+$)to check for amounts "^"=start of string, "$"=end of string. Your regexp was allowing it anywhere in the string, so it was the [0-9]+ it was matching it in the first three digits, forgetting the .x

Also for more info on the \1, it's called callback (http://www.regularexpressions.info/brackets.html).

trib4lmaniac
06-29-2004, 11:28 PM
Oh, and the reason 6666 matches [0-9]{3} is again because of the start of string - end of string thing. The engine finds three digits (666) then stops. To match 3 digits only, use ^[0-9]{3}$

]|V|[agnus
06-30-2004, 12:06 AM
Awesome, thanks so much. :thumbsup:

trib4lmaniac
06-30-2004, 06:02 PM
By the way, for anyone using regexps in PHP, here's a tip for you... Use single quotes around patterns as this makes the literal, so you don't have to escape dollar signs, back-slashes etc.

]|V|[agnus
07-08-2004, 07:52 PM
^http://([[:alnum:]._%-]+\.|)[[:alnum:]._%-]+\.[[:alnum:]._%-]{2,4}(/|)$

This is a pattern I came up with for matching a web address. Should match

-domain w/ subdomain, w/ trailing slash
-domain w/o subdomain, w/ trailing slash
-domain w/o subdomain, w/o trailing slash
-domain w/ subdomain, w/o trailing slash

Anybody see any issues, or am I gettin' the hang o' this stuff? ;)

trib4lmaniac
07-11-2004, 11:58 PM
Hmm close, should work, but for a minor improvement, instead of using ...|) to make something optional, use a ? (optional)...
^http://([[:alnum:]._%-]+\.)?[[:alnum:]._%-]+\.[[:alnum:]._%-]{2,4}(/)?$

]|V|[agnus
07-12-2004, 08:51 AM
Hmm, fair enough. I've been wondering...

[[:alnum:]._%-]

The "." in that set is not the literal, correct? So then doesn't it mean "any character?" If so, why are any of the other characters, or the [:alnum:] set necessary?

A PHP book I have listed this as an email pattern, for example:

.+@.+\..+

trib4lmaniac
07-12-2004, 01:50 PM
Inside a character set a dot is literal, it only take the form of a meta-character outside of classes. So :alphanum: is necessary :D
Also that email regexp is very basic, not all characters are allowed (unless in side double quotes, that would take a very complex regexp though!)

]|V|[agnus
07-12-2004, 06:05 PM
Thanks so much for all your help with this. It just takes somebody like you and a few practical examples for somebody like me to get a good foothold on something like REs. My applications are already smarter and better for it. :)

]|V|[agnus
07-12-2004, 07:27 PM
http://codingforums.com/showthread.php?t=41611

Over in this thread, some people offered this simple patter, /\w/, as a RE that would allow A-Z, a-z, 0-9 and _

If \w covers "_" as well, why do we have to include "_" in the character sets for our email and www address patterns? I ask operating under the assumption that [:alnum:] == \w

trib4lmaniac
07-12-2004, 07:45 PM
I think \w is the same as [[:alnum:]_]

]|V|[agnus
07-12-2004, 07:50 PM
Interesting. I found some information that said otherwise. Hrm..

Grant Palin
07-12-2004, 07:52 PM
callback (http://www.regularexpressions.info/brackets.html).

Great link! An excellent resource for learning regular expressions!

Thanks for sharing it! :thumbsup:

]|V|[agnus
07-12-2004, 08:06 PM
Regarding character classes like \w and what they cover: http://www.regular-expressions.info/charclass.html

\w stands for "word character". Exactly which characters it matches differs between regex flavors. In all flavors, it will include [A-Za-z]. In most, the underscore and digits are also included. In some flavors, word characters from other languages may also match. In EditPad Pro, for example, the actual character range depends on the script you have chosen in Options|Font. If you are using the Western script, characters with diacritics used in languages such as French and Spanish will be included. If you are using the Cyrillic script, Russian characters will be included, etc. The best way to find out is to do a couple of tests with the regex flavor you are using. In the screen shot, you can see the characters matched by \w in PowerGREP when using the Western script.

I have no idea how to tell what "flavor" of RegEx I'm using (note, I didn't look around on the site to see if they explain how yet either!) but even so, it sounds like it might be a safe, general rule (unless you know for sure what "flavor") to express things in terms of character ranges that all flavors support, yes?

trib4lmaniac
07-12-2004, 09:13 PM
Probably best :thumbsup:

]|V|[agnus
07-13-2004, 08:28 PM
^[0-9]+(\.[0-9]+)?$

I want this to match integers and floating point numbers. Thumbs up or down? :)

So far so good!

(Thanks again to all who helped me with this here. Others should no doubt get good assisstance from this thread and the resources linked within'. REs are not so bad as they appear at a glance, and once you start to grasp them, you can begin to make so many little and large enhancements to your applications.

I'm in love. :)

]|V|[agnus
07-13-2004, 10:38 PM
var validWWW = /^http:\/\/([[:alnum:]._%-]+\.)?[[:alnum:]._%-]+\.[[:alnum:]._%-]{2,4}(\/)?$/;

This is the valid Web site address pattern you helped me with, Tribal, only I've added some extra escapes because I think JavaScript requires them. The extra escapes are to make sure the "/" are interpreted as literals for the pattern, and not JS comments or something, I believe.

Does that look okay? I am getting addresses like "http://www.testers.com/" returned as invalid against that pattern.

Edit: here is the full validation script for that field...


//Validate web address
var w = f.webSite.value; // "f" is a reference to the <form>
if (w.substring(0,7) !== "http://") {w = "http://" + w;} //make sure protocol is at beginning
var validWWW = /^http:\/\/([[:alnum:]._%-]+\.)?[[:alnum:]._%-]+\.[[:alnum:]._%-]{2,4}(\/)?$/;
if (w !== "http://") {
if (!w.match(validWWW)) {
e[e.length] = "The Web site address you entered (" + w + ") is invalid. A valid Web site address is of the 'http://www.domain.com/' format.";
}
}


I think that could be refined, but basically... if they want to not have a web address, then they could leave it with the default "http://" and it would just move on... otherwise, it'll make sure the protocol is at the beginning and then check their entry against the pattern.

trib4lmaniac
07-15-2004, 01:36 PM
|V|[agnus']I want this to match integers and floating point numbers. Thumbs up or down? :)
:thumbsup:
|V|[agnus']var validWWW = /^http:\/\/([[:alnum:]._%-]+\.)?[[:alnum:]._%-]+\.[[:alnum:]._%-]{2,4}(\/)?$/;
Just a note, I'm not sure if you would need to escape the hypens inside the character classes as they denote a range. But seeing as it as at the end of the class it may be ok!

trib4lmaniac
07-15-2004, 01:38 PM
It's ok your fine, did some research. Hypens are ok at the beginning or the end of a class!

]|V|[agnus
07-15-2004, 04:48 PM
but I'm not okay! :)

valid web addresses such as "http://domain.com" are not matching :(

trib4lmaniac
07-16-2004, 12:55 AM
I don't really know much about js regexps. But without modifying your re too much... regexp.php (http://mq.astronomyforbeginners.com/regexp.php?string=http%3A%2F%2Fdomain.com&pattern=%5Ehttp%3A%2F%2F%28%5B%5B%3Aalnum%3A%5D._%25-%5D%2B%5C.%29%3F%5B%5B%3Aalnum%3A%5D._%25-%5D%2B%5C.%5B%5B%3Aalnum%3A%5D._%25-%5D%7B2%2C4%7D%2F%3F%24&replacement=&match_type=preg_match&submit=Go)

Also, right at the end of the re you have (/)? The brackets are not needed, as the optional meta-character works on a single character, /? (Aren't you glad I've saved you two whole characters?)

]|V|[agnus
07-17-2004, 10:48 AM
Heh.. yes, yes I am.

]|V|[agnus
07-26-2004, 07:03 PM
move along now...

liorean
07-26-2004, 08:24 PM
lol, its liorean!
Don't use that article though, the tutorial on re.info is much better as js doesn't support all of the regexp features.
Regex is individual for each language. JavaScript has an incomplete PERL-like syntax, but doesn't even look at POSIX. However, what you can find in PERL isn't regex any longer, it's contextual expressions. PCRE is as close to complete as what you can find, and POSIX is a kind of a smallest common denominator, except not all implementations support it. All in all, you'll find that regex are different between pretty much all im plementations, so your best friend would be documentation related to the implementation you're interested in.

trib4lmaniac
07-26-2004, 09:00 PM
Hehe, your back :thumbsup:

liorean
07-26-2004, 09:10 PM
Hehe, your back :thumbsup:No, but I felt like checking in on you to see that you are behaving while I'm offline...

trib4lmaniac
07-26-2004, 09:25 PM
No, but I felt like checking in on you to see that you are behaving while I'm offline...
lol, ok.