CodingForums.com

CodingForums.com (http://www.codingforums.com/index.php)
-   JavaScript programming (http://www.codingforums.com/forumdisplay.php?f=2)
-   -   another regex question (http://www.codingforums.com/showthread.php?t=286810)

xelawho 01-31-2013 04:41 PM

another regex question
 
I am slowly getting my head around regex, but really it is mostly a mystery to me.

Here's the thing: I have a string (although I have no idea how that string will look). All I know is that the string will contain a word (I don't know what that word is either). I don't know if the string will be a paragraph, a sentence or a sentence fragment (the sentence may be cut off, either at the start or the end).

But I need to get as much of the sentence containing the word as possible, without getting too much.

So I figure that these are the "rules":

- Start capturing from the closest word before the variable word that starts with a capital/uppercase.
- If there is no word that starts with a capital before the variable word, start capturing from the start of the string.
- Equally, if the part of the string after the variable word contains a full stop/period, finish capturing at the full stop.
- If not, capture until the end of the string.

I know it's not perfect logic, but it doesn't have to be - all I want to do is to be able to show the word in some sort of context, like Word does when you do spellcheck.

Any suggestions?

AndrewGSW 01-31-2013 06:43 PM

Something like this:

Code:

(?:^|\.)\s?([^.]*wibble[^.]*)(?:$|\.)
You can test it here.

But I haven't tried to match a capital letter..

AndrewGSW 01-31-2013 06:48 PM

This version

Code:

(?:^|\.|\;)\s?([A-Z][^.]*wibble[^.]*)(?:$|\.)
looks either for a full-stop or semi colon, and the sentence should start will a capital letter.

Philip M 01-31-2013 07:11 PM

Here's my suggestion:-

Code:

<html>
<head>
</head>
<body>

Enter word to find <input type = "text" id = "theword" onblur = "findit()">

<script type = "text/javascript">

var text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ipsum leo, scelerisque at dapibus ac, consectetur vel ipsum. Morbi et metus ut diam molestie ullamcorper. Suspendisse rutrum semper semper. Donec volutpat neque in lorem tempus scelerisque. Curabitur dignissim rhoncus quam ac suscipit. Donec viverra quam lobortis neque porta a sagittis urna tristique. Suspendisse nec lacus nisi. Pellentesque fermentum massa sit amet magna hendrerit vestibulum. Sed elit libero, scelerisque eu eleifend ut, interdum gravida nunc. Etiam ut nisi sapien, et tempus sem. Nam vel mi est. Mauris congue felis ut ante bibendum vehicula. Nullam nec sapien arcu, eget cursus lorem. Donec blandit, dolor tristique ornare dictum, arcu sapien vulputate dolor, et placerat risus odio ut magna. Ut magna mauris, pellentesque at ultricies vitae, fermentum vitae dolor."

//var ts = text.split(/\.|;/);  // split at period or semi-colon
var ts = text.split(".");  // split at period only

function findit() {
var intext = false;
for (var i=0; i < ts.length; i++) {
var found = false;
var tofind = document.getElementById("theword").value;
var regexp = new RegExp(tofind, 'gi');        // setting regex case insensitive and global
if (regexp.test(ts[i])) {
found = true;
intext = true;
}
if (found) {alert ("The word " + tofind + " was found in the sentence:- " + "\n" + ts[i])}
}
if (!intext) {alert ("The word " + tofind  + " was not found.")}

}

</script>

</body>
</html>

Christians only have one spouse. This is called monotony.
- Pupil's answer to Catholic Elementary School test.

xelawho 01-31-2013 07:53 PM

thanks Andrew - the first one was very close. I changed it to
Code:

(?:|^)?[\w]([^.]*wibble[^.]*)($:|\.|\?|\!|$)
to start the capture at the beginning of the sentence ort the beginning of the string, instead of the end of the previous one, and to end on a full stop, exclamation, question mark or just the end of the string

seems right to me. Thank you both for your suggestions.

xelawho 01-31-2013 08:01 PM

no, wait - that doesn't work. it ends if the sentence ends with a full stop, but keeps going if it is a ! or ? :(

Philip M 01-31-2013 08:02 PM

Use mine!

Code:

var ts = text.split(/\.|;|\?|!/);  // split at period or semi-colon or ? or !
Does your regex allow you to find a variable word? Or a phrase? Not just wibble!

xelawho 01-31-2013 08:23 PM

Here's the thing: Lets say the string is this:
"The dog jumped over the moon. He was happy to see me. I left in a hurry"

and the word is "happy"

in that case, all I want is
"He was happy to see me."

If it's
"was happy to see me. I left in a hurry"

all I want is
"was happy to see me."

If it's
"The dog jumped over the moon. He was happy to see"

all I want is:
"He was happy to see"

splitting it on the punctuation is probably the safest way, but then I have to loop through the array to find out which split is the one that I want. Which is why regex seems to be the answer...

Old Pedant 01-31-2013 08:43 PM

And what about
"aardvarks whistle. happy dogs bark"
???

What do you want to get out of that?

Logically, it would be "happy dogs bark", as the period before "happy" belongs in another sentence. But it's your call.

xelawho 01-31-2013 08:49 PM

in that case I would want happy dogs bark

but sentences will always begin with a capital, and end with . or ! or ?

the problem is that the string that contains the word may not be a complete sentence.

Old Pedant 01-31-2013 09:18 PM

Here's my answer.

I'll let you figure out if you can combine the 4 regexp's into one.

Note that I stop on the first match, because some text patterns will match more than one of the regexps, but the regexps are purposely ordered by most desirable match.

The hack to get rid of a leading period is just that: a hack. But it works.

Code:

<script type="text/javascript">
function findSentenceByWord( text, word )
{
    var re1 = new RegExp( "[A-Z\\.][^A-Z\\.]+?" + word + "[^\\.\\?\\!]*[\\.\\?\\!]", "" );
    var re2 = new RegExp( "^[\\s\\S]*?" + word + "[^\\.\\?\\!]*[\\.\\?\\!]", "" );
    var re3 = new RegExp( "[A-Z\\.][^A-Z\\.]+?" + word + "[\\s\\S]*$", "" );
    var re4 = new RegExp( "^[\\s\\S]*?" + word + "[\\s\\S]*$", "" );
    var res = [ re1, re2, re3, re4 ];
    for ( var r = 0; r < res.length; ++r )
    {
        var re = res[r];
        if ( re.test( text ) )
        {
            document.write("Match on regexp " + (r+1) + "<br/>");
            var m = text.match(re)[0];
            if ( m.charAt(0) == "." ) { m = m.substring(1); }
            document.write( m + "<br/>");
            return;
        }
    }
}

function demo( text, word )
{
    document.write( "<hr/>Testing <i><b>" + text + "</b></i> for word " + word + "<br/>" );
    findSentenceByWord( text, word );
}   

demo( "The dog jumped over the moon. He was happy to see me. I left in a hurry", "happy" );
demo( "was happy to see me. I left in a hurry", "happy" );
demo( "The dog jumped over the moon. He was happy to see", "happy" );
demo( "aardvarks whistle. happy dogs bark", "happy" );
demo( "happy happy happy! and even more happy?", "happy" );
demo( "all the happy dogs", "happy" );</script>

I dump out which regexp matched so that you can see that indeed all 4 are needed, depending on the input.

Old Pedant 01-31-2013 09:20 PM

Quote:

Originally Posted by xelawho (Post 1310116)
in that case I would want happy dogs bark

but sentences will always begin with a capital, and end with . or ! or ?

If that is true, why did you include this example:
Quote:

If it's
"was happy to see me. I left in a hurry"
"was happy to see me." does not start with a capital letter.

My answer includes code to handle that case. It could be less code if you were *SURE* that a sentence always starts with a capital letter.

AndrewGSW 01-31-2013 09:22 PM

This revision
Code:

(?:|^)?[\w]([^.]*wibble[^.]*)($:|\.|\?|\!|$)
is incorrect. Should be
Code:

(?:^|\.|\?\!)?[\w]([^.]*wibble[^.]*)(?:\.|\?|\!|$)
(?: denotes a non-capturing group, and the | at the beginning was incorrect. So the previous sentence might also end with a ? or !

Old Pedant 01-31-2013 09:34 PM

Here's a slightly better version. Handles the sentence *before* "happy" ending with ? or ! (not just period).

Has the interesting effect of changing *which* "happy" is found in demo #5. If you really wanted the first one found, I could fix it to do that. But I'm assuming that's a case you aren't too worried about.
Code:

<script>
function findSentenceByWord( text, word )
{
    var re1 = new RegExp( "[A-Z\\.\\?\\!][^A-Z\\.\\?\\!]+?" + word + "[^\\.\\?\\!]*[\\.\\?\\!]", "" );
    var re2 = new RegExp( "^[\\s\\S]*?" + word + "[^\\.\\?\\!]*[\\.\\?\\!]", "" );
    var re3 = new RegExp( "[A-Z\\.\\?\\!][^A-Z\\.\\?\\!]+?" + word + "[\\s\\S]*$", "" );
    var re4 = new RegExp( "^[\\s\\S]*?" + word + "[\\s\\S]*$", "" );
    var res = [ re1, re2, re3, re4 ];
    for ( var r = 0; r < res.length; ++r )
    {
        var re = res[r];
        if ( re.test( text ) )
        {
            document.write("Match on regexp " + (r+1) + "<br/>");
            var m = text.match(re)[0];
            m = m.replace( /^[\.\?\!]?\s*/, "" );
            document.write( m + "<br/>");
            return;
        }
    }
}

function demo( text, word )
{
    document.write( "<hr/>Testing <i><b>" + text + "</b></i> for word " + word + "<br/>" );
    findSentenceByWord( text, word );
}   

demo( "The dog jumped over the moon. He was happy to see me. I left in a hurry", "happy" );
demo( "was happy to see me. I left in a hurry", "happy" );
demo( "The dog jumped over the moon. He was happy to see", "happy" );
demo( "aardvarks whistle. happy dogs bark", "happy" );
demo( "aardvarks whistle dixie! happy dogs bark", "happy" );
demo( "happy happy happy! and even more happy?", "happy" );
demo( "all the happy dogs", "happy" );
</script>


Old Pedant 01-31-2013 09:41 PM

Andrew: I'm pretty sure this is wrong:
(?:^|\.|\?\!)

The ^ character only means negation when used inside of [ ].

In any case, you forgot the | between \? and \! if you were looking for "or" conditions. And also, in any case, you are missing parens.

But I'm pretty sure that should be
(?:[^\.\?\!])
But I think that
(?!(\.|\?|\!))
would also work. ?! is a *negative* non-capture. The ! is the negation character for captures, not the ^

Did you test it? Against many samples, as I did?

*********

EDIT: I did test it.

I tested both your version:
/(?:^|\.|\?|\!)?[\w]([^.]*happy[^.]*)(?:\.|\?|\!|$)/
(I added the missing | before the first \!)

And my modification:
/(?:[^\.\?\!])?[\w]([^.]*happy[^.]*)(?:([\.|\?|\!]|$))/;

Neither passed all tests.
Neither could find "happy" in aardvarks whistle. happy dogs bark

Neither isolated the sentence in either
aardvarks whistle dixie! happy dogs bark
or
happy happy happy! and even more happy?
(that is, in both cases they returned the entire test string)

I will say that your (?:^|\.|\?|\!) seemed to have mostly worked. Surprised me.


All times are GMT +1. The time now is 08:39 AM.

Powered by vBulletin®
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.