08-17-2007, 05:39 PM
Given a passage with links in it, I want to process this so that it'll end up as plain text with no links.
Regular expressions have to be the most complicated thing I have ever attempted to learn. Even after reading a step by step guide I still can't wrap my head around them.
What I think I have is the basic syntax of the expression. I know I need to match strings starting with <a and ending with > (I'll remove the closing element of the link with str_replace since that's always going to be "</a>"). And I know that between the start and end could be any number of any characters. Therefore, the expression should, if I've understood this correctly, be...
eregi("^<a(.*)>$", $text, $matches);
Assuming that's right, what I don't get is all the escaping that tends to go on with regular expressions. The tutorial I read mentioned escaping certain characters with backslashes, but made no mention of forward slashes, yet they always seem to feature quite heavily in reg. expressions. Can anyone enlighten me?
08-17-2007, 08:22 PM
Regexlib.com (http://regexlib.com/) is a good place to start. I wrote this code and accomplished what you're looking for:
$test = 'text text text <a href="www.com/blah.ext?arg=123&arg=321" other="1">WWW.LINK.COM</a> MORE TEXT';
$new = preg_replace('/<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>/', '$2', $test);
This is just an example, you can adapt it to your needs. You don't need to create a new variable like I did. I used this regex (http://regexlib.com/REDetails.aspx?regexp_id=968).
08-17-2007, 08:55 PM
Thank you a thousand times.
I was nowhere near. Everything I've read says "." is any character, and "*" is any number of the previous character. Is there a simple reason (other than "that's just the way it is" :) ) why trying to match any number of any character in the middle doesn't work?
In the meantime I will try and decipher what you've done so I can finally start to learn the syntax.
08-17-2007, 09:07 PM
It does work, that expression I used has 3 of them. The syntax is the only difference. By the way, if you're looking for an easy way to test regular expressions you can try out this program (http://weitz.de/regex-coach/). It has some simple features that can help you out. Breaking down the expression, saying the expression in words, showing you the parts of it, testing matches and replacing, etc. Good luck!
08-17-2007, 09:22 PM
Is there a simple reason (other than "that's just the way it is" :) ) why trying to match any number of any character in the middle doesn't work?
Like Inigoesdr said, it does work. The difference with his is that his preg_replace expression replaced the entire link with the text that the anchor tags surrounded. You were having to do two separate expressions, which would work also.
As for the backslashes, you use them to either escape certain characters or use them to create special characters, like \s (which is a space). Forward slashes are used at the start and end of a pattern in the preg functions since they use Perl's regular expression syntax.
08-17-2007, 10:52 PM
You can really use any character to mark the beginning and end of the expression as long as it's the same character and it isn't used as a special character for the syntax. Though, from what I've seen, people tend to use the forward slash the most.
08-18-2007, 08:49 PM
It does work, that expression I used has 3 of them. The syntax is the only difference. By the way, if you're looking for an easy way to test regular expressions you can try out this program. It has some simple features that can help you out. Breaking down the expression, saying the expression in words, showing you the parts of it, testing matches and replacing, etc. Good luck!
Thanks thats a useful tool. Well I've done some testing with it and realise I was simply missing a question mark, apparently you have to say "zero or one of any number of any character", not just "any number of any character" which I find slightly counter-intuitive but there we go.
What I'm wondering, given that this works (at least according to the Regex Coach) and can be replaced with $1 to get the outcome I'm looking for, is there any benefit to using the much longer expression you gave me?
08-18-2007, 09:17 PM
It depends on what you want to use it for. The one I posted will only take out links, while yours will remove all anchors for example. You should make sure you have it set to use case-insensitivity too.