...

View Full Version : in_array always returning false



cfructose
07-16-2007, 08:03 PM
in_array($current_letter, $greek_chars) is returning false always, despite there being a Greek character in the variable $current_letter.

I'm trying to get a 'span lang' tags automatically inserted into a document around isolated Greek words:


// add <span lang> tags around Greek words
$greek_chars = array("α", "β", "γ", "δ", "ε", "ζ", "η", "θ", "ι", "κ", "λ", "μ", "ν", "ξ", "ο", "π", "ρ", "σ", "ς", "τ", "υ", "φ", "χ", "ψ", "ω", "ά", "έ", "ή", "ί", "ό", "ύ", "ώ", "ϊ", "Α", "Β", "Γ", "Δ", "Ε", "Ζ", "Η", "Θ", "Ι", "Κ", "Λ", "Μ", "Ν", "Ξ", "Ο", "Π", "Ρ", "Σ", "Τ", "Υ", "Φ", "Χ", "Ψ", "Ω", "Ά", "Έ", "Ή", "Ί", "Ό", "Ύ", "Ώ", "Ϊ");
$greek_lang_open = "<span lang=\"el\">";
$greek_lang_close = "</span>";
$span_open_length = strlen($greek_lang_open); //measures how many chars there are so we can jump forward later after inserting
$span_close_length = strlen($greek_lang_close); //measures how many chars there are so we can jump forward later after inserting
$pos=0; //pointer
$cc_text_length = strlen($cc_text);
$gr1 = "no";
$gr2 = "no";

while ($pos<$cc_text_length) { //scan file
$current_letter = substr($cc_text,$pos,1); //grab next character from pointer
if ($gr1 == "no" && $gr2 == "no" && in_array($current_letter, $greek_chars)) {
$gr1 = "yes";
$cc_text = str_replace($current_letter,$greek_lang_open.$current_letter,$cc_text); // add open span lang tag
$pos = $pos+$span_open_length;
}
elseif ($gr1 == "yes" && in_array($current_letter, $greek_chars)) {
$gr1 = "no";
$gr2 = "yes";
}
elseif ($gr2 == "yes" && !(in_array($current_letter, $greek_chars))) {
$gr2 = "no";
$gr1 = "no";
$cc_text = str_replace($current_letter,$greek_lang_close.$current_letter,$cc_text); // add close span lang tag
$pos = $pos+$span_close_length;
}
$pos++; //increment the pointer
$cc_text_length = strlen($cc_text); //re-measure length of $cc_text, as it's changed from our adding span tags
}

Any ideas what silly thing I've done wrong here?

Also, it's terribly slow - could anyone push me down a different path...?
:-)

Cheers

NancyJ
07-16-2007, 08:14 PM
how long is the text that you're processing? (can you give a sample?)

cfructose
07-16-2007, 08:24 PM
Each article = about four or five A4 pages of text.

Typcial sentences are 90% English with words in Greek dotted all over the place in linguistic discussion, usually with English translations afterwards.

NancyJ
07-16-2007, 08:46 PM
Depending on the speed of your server this may run quicker than what you've got;



$cc_words = explode(" ", $cc_text);
foreach($cc_words as $key => $word)
{
$cc_words[$key]= process_greek($word);
}
$cc_text = implode(" ", $cc_words);
function process_greek($word)
{
$greek_lang_open = "<span lang=\"el\">";
$greek_lang_close = "</span>";
$greek_chars = array("α", "β", "γ", "δ", "ε", "ζ", "η", "θ", "ι", "κ", "λ", "μ", "ν", "ξ", "ο", "π", "ρ", "σ", "ς", "τ", "υ", "φ", "χ", "ψ", "ω", "ά", "έ", "ή", "ί", "ό", "ύ", "ώ", "ϊ", "Α", "Β", "Γ", "Δ", "Ε", "Ζ", "Η", "Θ", "Ι", "Κ", "Λ", "Μ", "Ν", "Ξ", "Ο", "Π", "Ρ", "Σ", "Τ", "Υ", "Φ", "Χ", "Ψ", "Ω", "Ά", "Έ", "Ή", "Ί", "Ό", "Ύ", "Ώ", "Ϊ");
$letters = str_split($word);
foreach($letters as $letter)
{
if(in_array($letter, $greek_chars))
{
return $greek_lang_open.$word.$greek_lang_close;
}
}
return $word;
}

~0.3s on my pc for 6 sides of a4 (12pt times)

It doesnt solve your big problem of the greek characters not being recognised but it at least runs faster.
It works - but pretty much the only word it didnt say was greek was 'in'. The problem is in the way its identifying the greek letters. I tried running htmlentities on the characters but most werent in my table. If you're luck your table might have more of them (print_r(get_html_translation_table(HTML_ENTITIES));) than mine - so its worth a try.

Sorry I couldnt come up with a more useful solution. But if php tihnks Γ is a G then it will match them :(

cfructose
07-16-2007, 09:00 PM
Unbelievably helpful and erudite response - thanks a lot!
:-)

Well, this lovely little bit of code I once saw on a forum will turn all the Greek characters (along with all high level UTF-8 stuff) into, say, Δ etc.


$cc_text = preg_replace('/([\xc0-\xdf].)/se', "'&#' . ((ord(substr('$1', 0, 1)) - 192) * 64 + (ord(substr('$1', 1, 1)) - 128)) . ';'", $cc_text);
$cc_text = preg_replace('/([\xe0-\xef]..)/se', "'&#' . ((ord(substr('$1', 0, 1)) - 224) * 4096 + (ord(substr('$1', 1, 1)) - 128) * 64 + (ord(substr('$1', 2, 1)) - 128)) . ';'", $cc_text);

I guess I could get around the 'how to get php to recognise Greek chars' problem that way, but then how would you suggest I go about searching for each ASCII code?

Anoter snag that I've just noticed (well, much bigger than a snag!) is that several articles have many whole sentences in Greek rather than only individual words. If I move my 'pointer' along to the next character and allow it to continue past spaces, punctuation etc until it bumps into a group of 6symbols that doesn't have the format: & # digit digit digit ; would that work?

I'm ridiculously out of my depth here, sorry!

NancyJ
07-16-2007, 09:27 PM
There arent any ascii codes for those characters. They might exist in ANSI extended ASCII but I'm not sure (non-standard characters arent my forte)
If you run your greek alphabet through ord() most of them come up with 63 (?)

This site lists the html entities and hex codes for the characters but I dont know if that will be any use to you. I dont know if the regexp will do a better job of character recognition
http://htmlhelp.com/reference/html40/entities/symbols.html

Ofcourse whatever solution you come up with, you'll get a lot of non greek words matched because your array contains A,B,E,Z,H,I,K,M,N,O,P,T,Y (unless my computer is reading those characters wrong)

cfructose
07-16-2007, 10:12 PM
Ah, I was most probably using the wrong term when I said ASCII - did I mean ANSI? Well, regardless, the code:


$cc_text = preg_replace('/([\xc0-\xdf].)/se', "'&#' . ((ord(substr('$1', 0, 1)) - 192) * 64 + (ord(substr('$1', 1, 1)) - 128)) . ';'", $cc_text);
$cc_text = preg_replace('/([\xe0-\xef]..)/se', "'&#' . ((ord(substr('$1', 0, 1)) - 224) * 4096 + (ord(substr('$1', 1, 1)) - 128) * 64 + (ord(substr('$1', 2, 1)) - 128)) . ';'", $cc_text);

that I mentioned before renders, for example, περνά η μέλλισα

as:


&amp;#960;&amp;#949;&amp;#961;&amp;#957;&amp;#940; &amp;#951; &amp;#956;&amp;#941;&amp;#955;&amp;#955;&amp;#953;&amp;#963;&amp;#945;

in 'view source', which displays in the browser perfectly (imagine that each "amp;" isn't there!)

As for the Greek capital letters that look the same as the Latin ones, it didn't occur to me that it would have the same encoding (duh!) - That is a problem.

Should I conclude that it's simply to ambitious to hope that I might have automated all those TENS OF THOUSANDS of span tags? Or perhaps there's a wildy different approach you might suggest...?

Again, thanks so much for your invaluable input and time.

cfructose
07-16-2007, 10:36 PM
By the way, I just tried using the code you suggested, but I'm thwarted once again by absolutely NO words being identified as Greek. It's effectively telling me the same thing as was happening with my code: there are no matches between the data in $cc_text and the array.

Is it possible that the cause has something to do with the ob_start() I'm using earlier?

My code is:


ob_start();
include ("text/$include_name");//puts main text data into $cc_text
$cc_text = ob_get_clean();
foreach ($definition as $term => $def) {// Find and replace glossary terms
$cc_text = str_replace( $term, '<span class="help" title="' . strip_tags( $def ) . '">' . $term . '</span>', $cc_text );
}

...shortly after which I ran the span tag insertion part.

Could there be a connection?

NancyJ
07-16-2007, 10:43 PM
I tried running the code you provided on "περνά η μέλλισα" and nothing happened. Is there some other step that you're not mentioning here? I ran it on a passage of text that had some greek words on it and it only converted é and into something very strange. It looked like kanji.

The ob_start shouldnt affect the code. Its interesting that nothing is being matched for you, whereas nearly everything is being matched for me. Can you post your whole script (and if possible a snippet of an article you're trying to match)

cfructose
07-16-2007, 11:13 PM
The only other 'step' I can think of is that my code is saved as UTF-8 (no BOM, though that shouldn't make any difference), and that the font displaying the text can handle everything in Unicode.

What on earth might give the two of us such different results?

Well, for what it's worth, here's a large chunk of code that I'm sure is 99% irrelevant, but here we go:


<?php
include ("phpinserts/lang_choices.php");

$include_name = basename($_SERVER['PHP_SELF']); // set the name of the data file to include as the same as the root file
$chapter = substr($include_name, 0, 2);
$subchapter = substr($include_name, 3, 2);

if (substr($include_name, 0, 9) != "subscribe") {
include ("phpinserts/log_in_form.php");
}

// defining security variables
$secure_ka = "secure_".$chapter."_".$subchapter;
//$include_name_stripped = substr($include_name, 0, strrpos($include_name, '.')); //strrpos strips everything from the "." I.e. the ".php" filename.

if ($logged_in == "no" && $include_name == "password_change.php") {
echo "Please log in to change your password.";
}
elseif ($payment_received != "yes" && ($include_name == "subscribe_step_2.php" || $include_name == "subscribe_step_3.php")) {
echo "<div class=\"bigemptyspace\"></div><div class=\"bigemptyspace\"></div><p class=\"alertbox\" style=\"vertical-align: super;\"><img src=\"images/no_entry.gif\" alt=\"".$please_buy."\" style=\"width: 2em; height: 2em;\"/>Access Denied!<img src=\"images/no_entry.gif\" alt=\"".$please_buy."\" style=\"width: 2em; height: 2em;\"/></p><span class=\"warning\">Please complete <i>PayPal</i> payment in order to proceed.</span> <a href=\"subscribe_step_1.php\">Return to Step 1</a>";
}
elseif ($$secure_ka != 1) { // a 'variable variable' to compare $secure_ka (the $secure variable name as data) with a variable name created from that data (which was already definied as either 0 or 1 in config). Prevents direct access to 'members only' files via URL.
if ($include_name != "login.php" && $include_name != "logout.php") {
echo "<br/><h1>".$page."</h1><h1 style=\"position: relative; left: -0.07em; top: -1.25em; color: #00b8ff; font-size: 1.7em; font-weight: bold;\">".$page."</h1>"; // print title with shading
}


// cc text input + str_replace of all <span> glossary terms
ob_start();
if ($page == "Home") {
include ("text/home.php");
}
else {
include ("text/$include_name"); //puts main text data into $cc_text
}
$cc_text = ob_get_clean();
foreach ($definition as $term => $def) { // Find and replace terms
$cc_text = str_replace( $term, '<span class="help" title="' . strip_tags( $def ) . '">' . $term . '</span>', $cc_text );
}


// add <span lang> tags around Greek words
$greek_chars = array("α", "β", "γ", "δ", "ε", "ζ", "η", "θ", "ι", "κ", "λ", "μ", "ν", "ξ", "ο", "π", "ρ", "σ", "ς", "τ", "υ", "φ", "χ", "ψ", "ω", "ά", "έ", "ή", "ί", "ό", "ύ", "ώ", "ϊ", "Α", "Β", "Γ", "Δ", "Ε", "Ζ", "Η", "Θ", "Ι", "Κ", "Λ", "Μ", "Ν", "Ξ", "Ο", "Π", "Ρ", "Σ", "Τ", "Υ", "Φ", "Χ", "Ψ", "Ω", "Ά", "Έ", "Ή", "Ί", "Ό", "Ύ", "Ώ", "Ϊ");
$greek_lang_open = "<span lang=\"el\">";
$greek_lang_close = "</span>";
$span_open_length = strlen($greek_lang_open); //measures how many chars there are so we can jump forward later after inserting
$span_close_length = strlen($greek_lang_close); //measures how many chars there are so we can jump forward later after inserting
$pos=0; //pointer
$cc_text_length = strlen($cc_text);
$gr1 = "no";
$gr2 = "no";

while ($pos<$cc_text_length) { //scan file (source for the 'while' and 'pointer': http://gr2.php.net/strstr)
$current_letter = substr($cc_text,$pos,1); //grab next character from pointer
if ($gr1 == "no" && $gr2 == "no" && in_array($current_letter, $greek_chars)) {
$gr1 = "yes";
$cc_text = str_replace($current_letter,$greek_lang_open.$current_letter,$cc_text); // add open span lang tag
$pos = $pos+$span_open_length;
}
elseif ($gr1 == "yes" && in_array($current_letter, $greek_chars)) {
$gr1 = "no";
$gr2 = "yes";
}
elseif ($gr2 == "yes" && !(in_array($current_letter, $greek_chars))) {
$gr2 = "no";
$gr1 = "no";
$cc_text = str_replace($current_letter,$greek_lang_close.$current_letter,$cc_text); // add close span lang tag
$pos = $pos+$span_close_length;
}
$pos++; //increment the pointer
$cc_text_length = strlen($cc_text); //re-measure length of $cc_text, as it's changed from our adding span tags
}


//convert all quotes and greek characters (all high-level UTF-8) to asciis.
$cc_text = preg_replace('/([\xc0-\xdf].)/se', "'&#' . ((ord(substr('$1', 0, 1)) - 192) * 64 + (ord(substr('$1', 1, 1)) - 128)) . ';'", $cc_text);
$cc_text = preg_replace('/([\xe0-\xef]..)/se', "'&#' . ((ord(substr('$1', 0, 1)) - 224) * 4096 + (ord(substr('$1', 1, 1)) - 128) * 64 + (ord(substr('$1', 2, 1)) - 128)) . ';'", $cc_text);

echo $cc_text; // Display formatted article
?>

And an excerpt from one of pieces of text to be included (upon which I'm trying to perform all the span inserts):


Going back to our example of "James helped his friend", and 'reversing the action', we get "The friend helped James". Note that both the word James (which is a noun (a proper noun)), and the word friend remain unaltered regardless of their rôle in the sentence – regardless of their case. We could say, therefore, that the nominative (subject) and accusative (object) cases of all nouns in English (except pronouns) are identical; there is nothing to help us distinguish their rôles other than their position in the sentence. In Greek, however, the word for friend (φίλος), and the equivalent to the name James (Δημήτρης) do change their form according to their rôle, i.e. all Greek nouns must be appropriately modified to account for their case:

Ο Δημήτρης βοήθησε το φίλο του. Demetris helped his friend.
Ο φίλος βοήθησε το Δημήτρη. The friend helped Demetris.

The underlined letters above show the parts which undergo change according to case. Note the final ς of the nouns which is present in the nominative, but absent in the accusative. To complicate matters further, the articles (the word the) also change according to the case of the noun, as can be seen above by the addition of a τ in the accusative.

NancyJ
07-16-2007, 11:36 PM
heh, I couldnt even run the code - looks like my computer gave up. Interestingly, I've tried on 3 different servers and I cant get the text to output the greek characters - although I obviously have them because I can see them in my editor and on this site and any others I look at. I really dont have much experience with non ascii character sets.

cfructose
07-17-2007, 12:24 AM
Well, thanks you so much, Nancy. I may not have a solution, but I've learned a lot from you! :-)

cfructose
08-03-2007, 12:48 AM
After endless struggling, I've succeeded! I'm posting this in case anyone finds it useful...

(Note that the code, while working perfectly, could do with quite a bit of tightening to make it more efficient, eg, some regular expressions for the 'skip punctuation' bit etc)...


<?php
//convert all quotes and greek characters (all high-level UTF-8) to asciis.
$cc_text = preg_replace('/([\xc0-\xdf].)/se', "'&#' . ((ord(substr('$1', 0, 1)) - 192) * 64 + (ord(substr('$1', 1, 1)) - 128)) . ';'", $cc_text);
$cc_text = preg_replace('/([\xe0-\xef]..)/se', "'&#' . ((ord(substr('$1', 0, 1)) - 224) * 4096 + (ord(substr('$1', 1, 1)) - 128) * 64 + (ord(substr('$1', 2, 1)) - 128)) . ';'", $cc_text);


// create greek character array
$ansi=902;
while($ansi<=974) {
$greek_chars[] = "&#".$ansi.";";
$ansi++;
}
// add angled quotation marks to array
$greek_chars[] = "«";
$greek_chars[] = "»";

$greek_lang_open = "<span lang=\"el\">";
$greek_lang_close = "</span>";
$span_open_length = strlen($greek_lang_open); //measures how many chars there are so we can jump forward later after inserting
$span_close_length = strlen($greek_lang_close); //measures how many chars there are so we can jump forward later after inserting
$pos=0; //pointer
$cc_text_length = strlen($cc_text);
$current_letter_length = 6;
$first_greek_letter_found = "no";
$number_of_close_tag_occurences = 0;
$tag_detected = "no";

while ($pos<$cc_text_length) { //scan file
//grab characters in search of Greek ANSI codes
$current_letter = substr($cc_text,$pos,$current_letter_length);



if ($tag_detected == "no") {



//if 1 Greek letter has already been found:
if ($first_greek_letter_found == "yes") {
if (in_array($current_letter, $greek_chars)) {
//if next one is Greek too, move pointer
$pos = $pos+$current_letter_length-1;
}
//as long as the next group of 6 symbols doesn't start with space or punctuation, we'll put in the close span tag
elseif ((substr($current_letter, 0, 1) != " ") && (substr($current_letter, 0, 1) != ",") && (substr($current_letter, 0, 1) != "'") && (substr($current_letter, 0, 1) != ";") && (substr($current_letter, 0, 1) != ".")) {
// add close span tag
$cc_text = substr_replace($cc_text, $greek_lang_close.$current_letter, $pos, $current_letter_length);
//re-measure length of $cc_text, as it's changed from our adding span tags
$cc_text_length = strlen($cc_text);
$pos = $pos+$span_close_length;
$first_greek_letter_found = "no";//reset the var
}
}



//if no Greek letters have been found, but this one IS Greek:
elseif ($first_greek_letter_found == "no" && in_array($current_letter, $greek_chars)) {
$first_greek_letter_found = "yes";
//add open span lang tag
$cc_text = substr_replace($cc_text, $greek_lang_open.$current_letter, $pos, $current_letter_length);
//re-measure length of $cc_text, as it's changed from our adding span tags
$cc_text_length = strlen($cc_text);
$pos = $pos+$current_letter_length+$span_open_length-1;
}



//if no Greek letter found, but 1st symbol is <
elseif (substr($current_letter, 0, 1) == "<") {
$tag_detected = "yes";
}
}



//if tag_detected = yes, we want to skip all text until the close of the tag (">") to avoid putting spans within, say, image alt text that includes Greek characters.
else {
if (substr($current_letter, 0, 1) == ">") {
$tag_detected = "no";
}
}
$pos++; //increment the pointer
}
?>



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum