![]() |
another regex question
I am slowly getting my head around regex, but really it is mostly a mystery to me.
Here's the thing: I have a string (although I have no idea how that string will look). All I know is that the string will contain a word (I don't know what that word is either). I don't know if the string will be a paragraph, a sentence or a sentence fragment (the sentence may be cut off, either at the start or the end). But I need to get as much of the sentence containing the word as possible, without getting too much. So I figure that these are the "rules": - Start capturing from the closest word before the variable word that starts with a capital/uppercase. - If there is no word that starts with a capital before the variable word, start capturing from the start of the string. - Equally, if the part of the string after the variable word contains a full stop/period, finish capturing at the full stop. - If not, capture until the end of the string. I know it's not perfect logic, but it doesn't have to be - all I want to do is to be able to show the word in some sort of context, like Word does when you do spellcheck. Any suggestions? |
Something like this:
Code:
(?:^|\.)\s?([^.]*wibble[^.]*)(?:$|\.)But I haven't tried to match a capital letter.. |
This version
Code:
(?:^|\.|\;)\s?([A-Z][^.]*wibble[^.]*)(?:$|\.) |
Here's my suggestion:-
Code:
<html>- Pupil's answer to Catholic Elementary School test. |
thanks Andrew - the first one was very close. I changed it to
Code:
(?:|^)?[\w]([^.]*wibble[^.]*)($:|\.|\?|\!|$)seems right to me. Thank you both for your suggestions. |
no, wait - that doesn't work. it ends if the sentence ends with a full stop, but keeps going if it is a ! or ? :(
|
Use mine!
Code:
var ts = text.split(/\.|;|\?|!/); // split at period or semi-colon or ? or ! |
Here's the thing: Lets say the string is this:
"The dog jumped over the moon. He was happy to see me. I left in a hurry" and the word is "happy" in that case, all I want is "He was happy to see me." If it's "was happy to see me. I left in a hurry" all I want is "was happy to see me." If it's "The dog jumped over the moon. He was happy to see" all I want is: "He was happy to see" splitting it on the punctuation is probably the safest way, but then I have to loop through the array to find out which split is the one that I want. Which is why regex seems to be the answer... |
And what about
"aardvarks whistle. happy dogs bark" ??? What do you want to get out of that? Logically, it would be "happy dogs bark", as the period before "happy" belongs in another sentence. But it's your call. |
in that case I would want happy dogs bark
but sentences will always begin with a capital, and end with . or ! or ? the problem is that the string that contains the word may not be a complete sentence. |
Here's my answer.
I'll let you figure out if you can combine the 4 regexp's into one. Note that I stop on the first match, because some text patterns will match more than one of the regexps, but the regexps are purposely ordered by most desirable match. The hack to get rid of a leading period is just that: a hack. But it works. Code:
<script type="text/javascript"> |
Quote:
Quote:
My answer includes code to handle that case. It could be less code if you were *SURE* that a sentence always starts with a capital letter. |
This revision
Code:
(?:|^)?[\w]([^.]*wibble[^.]*)($:|\.|\?|\!|$)Code:
(?:^|\.|\?\!)?[\w]([^.]*wibble[^.]*)(?:\.|\?|\!|$) |
Here's a slightly better version. Handles the sentence *before* "happy" ending with ? or ! (not just period).
Has the interesting effect of changing *which* "happy" is found in demo #5. If you really wanted the first one found, I could fix it to do that. But I'm assuming that's a case you aren't too worried about. Code:
<script> |
Andrew: I'm pretty sure this is wrong:
(?:^|\.|\?\!) The ^ character only means negation when used inside of [ ]. In any case, you forgot the | between \? and \! if you were looking for "or" conditions. And also, in any case, you are missing parens. But I'm pretty sure that should be (?:[^\.\?\!]) But I think that (?!(\.|\?|\!)) would also work. ?! is a *negative* non-capture. The ! is the negation character for captures, not the ^ Did you test it? Against many samples, as I did? ********* EDIT: I did test it. I tested both your version: /(?:^|\.|\?|\!)?[\w]([^.]*happy[^.]*)(?:\.|\?|\!|$)/ (I added the missing | before the first \!) And my modification: /(?:[^\.\?\!])?[\w]([^.]*happy[^.]*)(?:([\.|\?|\!]|$))/; Neither passed all tests. Neither could find "happy" in aardvarks whistle. happy dogs bark Neither isolated the sentence in either aardvarks whistle dixie! happy dogs bark or happy happy happy! and even more happy? (that is, in both cases they returned the entire test string) I will say that your (?:^|\.|\?|\!) seemed to have mostly worked. Surprised me. |
| All times are GMT +1. The time now is 08:39 AM. |
Powered by vBulletin®
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.