PDA

View Full Version : regular expression to check for single and double line breaks (auto-link URLs too!)


]|V|[agnus
11-10-2004, 10:02 PM
ColdFusion has a function called ParagraphFormat() which takes a string and plops in <P></P> tags for you. Aside from the problem of not being able to control the case of the P's, it doesn't insert them in the smartest way, nor in the cleanest way. So I'm trying to write my own pFormat tag to achieve this, but I don't know what pattern to use to match "a carriage return followed by anything that is not a carriage return."

If that is found, I want to insert the <br /> tags, but if two carriage returns are found in succession I want to insert "</p>(carriage return)<p>" The latter is easy enough, and the former seems like it should be easy, but I guess I'm just not savvy enough with REs. :\

regularexpression.info says that "q[^u]" should match "q followed by a character that is not u". so i tried the same thing... this is how the RE looks in ColdFusion:

#Chr(13)#[^#Chr(13)#]

(#Chr(13)# is a coldfusion function that returns ASCII character 13, which is a carriage return)

Using that, I get <br />s inserted where every instance of Chr(13) is. If there are two carriage returns in a row, that should not match, should it?

liorean
11-10-2004, 10:46 PM
Here, just make sure you do everything in the right order:
str=str // Just explaining myself
.replace(/\r\n|\r|\n/g,'\n') // Normalise all line ending sequences to a common form
.replace(/\n\n+/g,'<\/p>\n<p>') // Replace two or more line endings with paragraphs
.replace(/\n/g,'<br/>'); // Replace single line endings with rowbreaks.
If you wish, you can do something about the case where you have more than two line endings by using a callback function instead of that string in the replacement.

]|V|[agnus
11-10-2004, 11:09 PM
thanks liorean! that normalizing bit is a good idea. i only had to change that last pattern to:

[^>]\n

so that i didn't get:

<p>aljdfhalksdjhflkasjdhflkasjdhf</p><br />
<p>asfasfasfasfasfasf etc...</p><br />

Thanks! :)

AaronW
11-11-2004, 12:25 AM
I wrote something similar (but far more complicated) with PHP which basically allows users to write any HTML they want without having to write their own <p>, <br />, and </p> tags.

Assumes paragraphs are seperated by double line breaks, as does yours. Also assumes All block tags are separated by double line breaks (unless they're the first in the string). Feel free to port it to CF (if that's possible... I think the eval modifier is for PHP's PCREs only... But you can work around it with arrays/callbacks):


function to_html ($str)
{
$block_tag_list = array ('address', 'applet', 'blockquote', 'button', 'center', 'dd', 'del', 'dir', 'div', 'dl', 'dt', 'fieldset', 'form', 'frameset', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'iframe', 'ins', 'isindex', 'li', 'map', 'menu', 'noframes', 'noscript', 'object', 'ol', 'p', 'pre', 'script', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'ul');

$tags = '<' . implode ('[^>]*>|<', $block_tag_list) . '[^>]*>';
$pattern = <<<PATTERN
/
(\A|\\n\\n)(?!$tags) # Start of string or two linebreaks or anything but a block tag
(.+?) # Just about anything
(\Z|\\n\\n) # End of string or two line breaks
/isex
PATTERN;

$str = str_replace ("\r\n", "\n", $str);
$str = preg_replace ($pattern, "'\\1<p>' . nl2br ('\\2') . '</p>\\3'", $str);
$str = preg_replace ($pattern, "'\\1<p>' . nl2br ('\\2') . '</p>\\3'", $str);
return $str;
}

]|V|[agnus
11-11-2004, 05:51 PM
Nice Aaron, I want to look at that later.

I added a bit to auto-link URLs found int he text. Here is my pattern:

(http://)?(www\.)?[0-9a-z_%-]+[0-9a-z_%.-]+\.[0-9a-z_%-]+[0-9a-z_%.-]+(\/[0-9a-z_%.-]+\.[0-9a-z_%.-]+)?

Can that be condensed or improved somehow? It seems to work great, but I just feel like it's a little verbose.

]|V|[agnus
11-11-2004, 06:23 PM
ahahahahah... the answer to my question is a sound "YES, it can be MUCH simpler":

\b(www|http|\S+@)\S+\b

hehe... matches email too!

AaronW
11-11-2004, 08:17 PM
Simplicity has its drawbacks ;) My link checker, as my paragraph-maker, is rather complex, but the good news is that mine is also very smart ;)

For instance: "yahoo.com" will get linked. It won't link the quotes though. yahoo.com/. will get linked, and the period will not be linked. Most link parsers require whitespace on either end of the link, which sucks (especially when your link concludes a bracket set, like http://yahoo.com).

Here's that one:

(Runs THREE regular expressions on the string):


function parse_links ($str)
{
$pattern_most = '/(\b)(?<!href="|src="|\/|>)((www\.|ftp\.|https:\/\/|http:\/\/|ftp:\/\/)(.[^(\s|\)|\(|\]|"|,|\'|!|\#|\.\s)]+)+)/ise';
$replace_most = "'\\1<a href=\"'.(preg_match ('/^(https|http|ftp):\/\//i', '\\2') ? str_replace ('@', '@', '\\2') : str_replace ('@', '@', 'http://\\2')).'\">'.str_replace ('@', '@', '\\2').'</a>'";

$pattern_email = '/(\b)(?<!mailto:|>)([a-z]+@(.[^(\s|\)|\(|\]|"|,|\'|!|\#|\.\s)]+)+)/ise';
$replace_email = "'\\1<a href=\"mailto:'.str_replace ('@', '@', '\\2').'\">\\2</a>'";

$pattern_simple = '/(?<=\s|\A)(?<!\/)([a-z_-]+(\.[a-z_-]+)+(.[^(\s|\)|\(|\]|"|,|\'|!|\#|\.\s)]+)+)/isx';
$replace_simple = '<a href="http://\\1">\\1</a>';

$str = preg_replace (array ($pattern_most, $pattern_email, $pattern_simple), array ($replace_most, $replace_most, $replace_simple), $str);

return stripslashes ($str);
}


I have the commented version of that saved too, which I need for editing it. If you port this, and find a bug, let me know :)

]|V|[agnus
11-11-2004, 09:22 PM
Thanks Aaron, and you're right about complexity. However, my more complex pattern was faring worse than this simple one. This one is most of the way there, and I think with some tweaks it could be much smarter without being excessively verbose.

AaronW
11-11-2004, 09:45 PM
Heh, it all depends on who'll be giving the input I suppose. If it were just me, I'd have left it simple (or just written <a href=""> manually).

Oh, mine doesn't re-link the links in href="" either. Heh.

]|V|[agnus
11-11-2004, 11:19 PM
Oh, mine doesn't re-link the links in href="" either. Heh.

whatchu mean by that?

]|V|[agnus
11-11-2004, 11:35 PM
your $pattern_most and $pattern_simple are recognized as malformed by CF. :\ it chokes at "(?< ... )"

AaronW
11-12-2004, 12:16 AM
|V|[agnus']whatchu mean by that?

I mean you can pass it a string like this:

---
Go to <a href="http://yahoo.com">http://www.yahoo.com/</a> and www.google.com.
---

And it won't turn it into this:

---
Go to <a href="<a href="http://yahoo.com">http://yahoo.com</a>"><a href="http://yahoo.com">http://www.yahoo.com/</a></a> and <a href="http://www.google.com">http://www.google.com</a>.
---

But rather:

---
Go to <a href="http://yahoo.com">http://www.yahoo.com/</a> and <a href="http://www.google.com">http://www.google.com</a>.
---

As for it choking on (?< ...), that's odd... The $pattern_email uses lookbehind assertions too... Keep in mind that the two regexes that are crashing on your both use the 'e' modifier, which are PHP-only I believe. See: http://ca.php.net/pcre.pattern.modifiers for the explanation on what it does. Should give you an idea on how to emulate it in CF. I could in fact eliminate the need for the 'e' modifier if you were willing to lose the functionality of it linking links without protocols (www.yahoo.com won't be linked, but http://www.yahoo.com/ will) and subsequently links without www. (yahoo.com won't link). Your call. Would be neat to have it working fully in CF though.