...

View Full Version : Function to strip out non-alphabetical characters ?



jeddi
11-04-2009, 09:05 PM
Hi,

Is there any function to strip out non-alphabetical characters so that
I only get words left ?

I don't need anything with numbers or other non-alphabetical characters
All I want is the words.

Is there any function for that ? or do I have to use
a regex for that ?

met
11-04-2009, 09:54 PM
quick google revealed



function allowAlphabets($string){

//create an array which has only allowed characters set
$allow_characters=array('a','b','c','d','e','f','g','h','i','j','k',
'l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B',
'C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S',
'T','U','V','W','X','Y','Z');

//start of regx pattern
$pattern = "@[^(";

//generate a regx pattern
foreach ($allow_characters as $char) {
$pattern .= preg_quote($char, "@");
}

//close the regx pattern string
$pattern .= ")]@";

//replace all non alphabets characters with space
$after = preg_replace($pattern, " ", $string);

//prints the result string with having only alphabets
return $after;
}



http://blog.sachinkraj.com/how-to-strip-all-non-alphabetic-characters-from-a-string/

but yes it could be accomplished with regex as well.

kbluhm
11-04-2009, 10:09 PM
$alpha = preg_replace( '/[^a-z\s]/i', '', $text );

$words = preg_split( '/\s+/', $alpha, -1, PREG_SPLIT_NO_EMPTY );

print_r( $words );


Note that punctuated words such as don't will turn up unpunctuated (is that even a word?), ie: dont.

tomws
11-04-2009, 10:15 PM
That seems a bit more complicated than necessary if just looking for letters and spaces. Won't a simple preg_replace like this work?

$str = preg_replace('/[^\w ]/i','',$str);
Test with this:

$str = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus commodo ipsum vel lorem fermentum pretium. Ut lacus lorem, tempus et condimentum at, aliquam in quam. Sed vulputate orci non lectus varius non blandit odio ornare. Nulla vulputate mi tristique magna facilisis pulvinar. Nullam mattis tincidunt cursus. Ut semper mollis sollicitudin. Vivamus varius velit in velit lacinia sed tincidunt libero tincidunt. Nullam nulla urna, consectetur ut mattis sed, sollicitudin et mauris. Cras laoreet placerat tellus, in vulputate ipsum pharetra id. In tellus metus, bibendum ut vulputate vel, congue ac odio. Etiam tempor consequat tellus, vel ultrices mauris laoreet eget. Cras felis felis, tristique eu aliquet vitae, tincidunt non purus. Ut bibendum pellentesque risus ut porta. Cras vel nibh mauris. Etiam sollicitudin gravida felis quis dictum. Nullam hendrerit scelerisque tellus ac mollis. Duis scelerisque, ante dictum mollis mattis, nibh risus eleifend nunc, at egestas tortor quam nec augue. Nullam eleifend est ut neque facilisis ac volutpat orci commodo. ';
echo $str,"<br/><br/>";
$str = preg_replace('/[^\w ]/i','',$str);
echo $str,"<br/><br/>";

When you say "words", I presume you mean you want to keep the spaces, too. If not, remove the space from the regex.

EDIT: kbluhm got it while I was pasting.

jeddi
11-05-2009, 09:23 AM
Thanks for all the replies :)

To clarify, I don't want to change any data, just ignore the non-alpabetical
blocks, so "%6fgTw" and "Gr^>>ht))" should both be ignored but "henry"
would make it in.

What does the w
in ('/[^\w ]/i','',$str); mean ?

I guess that it means word ?
Is that a dictionary word or any alphabetical block of letters surrounded by spaces ?

I was passed this regex which finds 3 and 4 word phrases
that have 1-5 letters in them.

#((?:\b\w{1,5}\b\s+){3,4})#

If someone can talk me through this, I would really like to understand it.
what is the b for ?

Thanks

tomws
11-05-2009, 03:26 PM
To clarify, I don't want to change any data, just ignore the non-alpabetical
blocks, so "%6fgTw" and "Gr^>>ht))" should both be ignored but "henry"
would make it in.

That's different from your original problem description of removing non-alphabetic characters. You rather want to remove any words which contain any non-alphabets. EDIT: Correcting myself after a re-read, you don't even want to remove, but just ignore. Clarify what you're trying to do - in specific terms.


What does the w
in ('/[^\w ]/i','',$str); mean ?

I guess that it means word ?
Is that a dictionary word or any alphabetical block of letters surrounded by spaces ?

That's short-hand for a word character (a letter).



I was passed this regex which finds 3 and 4 word phrases
that have 1-5 letters in them.

#((?:\b\w{1,5}\b\s+){3,4})#

If someone can talk me through this, I would really like to understand it.
what is the b for ?


B is for word breaks - spaces, tabs, maybe more. I'm not a regex specialist, but I think replacing the {1-5} with a .* will allow that to grab any alphabetic word while ignoring mixed character "words". Removing the {3,4} ought to remove the 3 or 4 words restriction.

jeddi
11-06-2009, 06:04 AM
Thats great,
Thanks Tom for helping out
I do look at the manual a lot but finding these
little bits is sometimes hard, so I appreciate the explanation. :thumbsup:

BTW -

in:
#((?:\b\w{1,5}\b\s+){3,4})#

any idea what the s means ?

kbluhm
11-06-2009, 08:55 AM
in:
#((?:\b\w{1,5}\b\s+){3,4})#

any idea what the s means ?

Any whitespace character.

http://www.php.net/manual/en/regexp.reference.backslash.php



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum