Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 13 of 13
  1. #1
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts

    in_array always returning false

    in_array($current_letter, $greek_chars) is returning false always, despite there being a Greek character in the variable $current_letter.

    I'm trying to get a 'span lang' tags automatically inserted into a document around isolated Greek words:

    PHP Code:
    // add <span lang> tags around Greek words
    $greek_chars = array("α""β""γ""δ""ε""ζ""η""θ""ι""κ""λ""μ""ν""ξ""ο""π""ρ""σ""ς""τ""υ""φ""χ""ψ""ω""ά""έ""ή""ί""ό""ύ""ώ""ϊ""Α""Β""Γ""Δ""Ε""Ζ""Η""Θ""Ι""Κ""Λ""Μ""Ν""Ξ""Ο""Π""Ρ""Σ""Τ""Υ""Φ""Χ""Ψ""Ω""Ά""Έ""Ή""Ί""Ό""Ύ""Ώ""Ϊ");
    $greek_lang_open "<span lang=\"el\">";
    $greek_lang_close "</span>";
    $span_open_length strlen($greek_lang_open); //measures how many chars there are so we can jump forward later after inserting
    $span_close_length strlen($greek_lang_close); //measures how many chars there are so we can jump forward later after inserting
    $pos=0//pointer
    $cc_text_length strlen($cc_text);
    $gr1 "no";
    $gr2 "no";

    while (
    $pos<$cc_text_length) {    //scan file
        
    $current_letter substr($cc_text,$pos,1); //grab next character from pointer
        
    if ($gr1 == "no" && $gr2 == "no" && in_array($current_letter$greek_chars)) {
            
    $gr1 "yes";
            
    $cc_text str_replace($current_letter,$greek_lang_open.$current_letter,$cc_text); // add open span lang tag
            
    $pos $pos+$span_open_length;
        }
        elseif (
    $gr1 == "yes" && in_array($current_letter$greek_chars)) {
            
    $gr1 "no";
            
    $gr2 "yes";
        }
        elseif (
    $gr2 == "yes" && !(in_array($current_letter$greek_chars))) {
            
    $gr2 "no";
            
    $gr1 "no";
            
    $cc_text str_replace($current_letter,$greek_lang_close.$current_letter,$cc_text); // add close span lang tag
            
    $pos $pos+$span_close_length;
        }
        
    $pos++; //increment the pointer
        
    $cc_text_length strlen($cc_text); //re-measure length of $cc_text, as it's changed from our adding span tags

    Any ideas what silly thing I've done wrong here?

    Also, it's terribly slow - could anyone push me down a different path...?
    :-)

    Cheers

  • #2
    Senior Coder NancyJ's Avatar
    Join Date
    Feb 2005
    Location
    Bradford, UK
    Posts
    3,174
    Thanks
    19
    Thanked 66 Times in 65 Posts
    how long is the text that you're processing? (can you give a sample?)

  • #3
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts
    Each article = about four or five A4 pages of text.

    Typcial sentences are 90% English with words in Greek dotted all over the place in linguistic discussion, usually with English translations afterwards.

  • #4
    Senior Coder NancyJ's Avatar
    Join Date
    Feb 2005
    Location
    Bradford, UK
    Posts
    3,174
    Thanks
    19
    Thanked 66 Times in 65 Posts
    Depending on the speed of your server this may run quicker than what you've got;

    PHP Code:
    $cc_words explode(" "$cc_text);
    foreach(
    $cc_words as $key => $word)
    {
      
    $cc_words[$key]= process_greek($word);
    }
    $cc_text implode(" "$cc_words);
    function 
    process_greek($word)
    {
      
    $greek_lang_open "<span lang=\"el\">";
      
    $greek_lang_close "</span>";
      
    $greek_chars = array("α""β""γ""δ""ε""ζ""η""θ""ι""κ""λ""μ""ν""ξ""ο""π""ρ""σ""ς""τ""υ""φ""χ""ψ""ω""ά""έ""ή""ί""ό""ύ""ώ""ϊ""Α""Β""Γ""Δ""Ε""Ζ""Η""Θ""Ι""Κ""Λ""Μ""Ν""Ξ""Ο""Π""Ρ""Σ""Τ""Υ""Φ""Χ""Ψ""Ω""Ά""Έ""Ή""Ί""Ό""Ύ""Ώ""Ϊ");
      
    $letters str_split($word);
      foreach(
    $letters as $letter)
      {
        if(
    in_array($letter$greek_chars))
        {
          return 
    $greek_lang_open.$word.$greek_lang_close;
        }
      }
    return 
    $word;

    ~0.3s on my pc for 6 sides of a4 (12pt times)

    It doesnt solve your big problem of the greek characters not being recognised but it at least runs faster.
    It works - but pretty much the only word it didnt say was greek was 'in'. The problem is in the way its identifying the greek letters. I tried running htmlentities on the characters but most werent in my table. If you're luck your table might have more of them (print_r(get_html_translation_table(HTML_ENTITIES)) than mine - so its worth a try.

    Sorry I couldnt come up with a more useful solution. But if php tihnks Γ is a G then it will match them

  • #5
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts
    Unbelievably helpful and erudite response - thanks a lot!
    :-)

    Well, this lovely little bit of code I once saw on a forum will turn all the Greek characters (along with all high level UTF-8 stuff) into, say, Δ etc.

    PHP Code:
    $cc_text preg_replace('/([\xc0-\xdf].)/se'"'&#' . ((ord(substr('$1', 0, 1)) - 192) * 64 + (ord(substr('$1', 1, 1)) - 128)) . ';'"$cc_text);
    $cc_text preg_replace('/([\xe0-\xef]..)/se'"'&#' . ((ord(substr('$1', 0, 1)) - 224) * 4096 + (ord(substr('$1', 1, 1)) - 128) * 64 + (ord(substr('$1', 2, 1)) - 128)) . ';'"$cc_text); 
    I guess I could get around the 'how to get php to recognise Greek chars' problem that way, but then how would you suggest I go about searching for each ASCII code?

    Anoter snag that I've just noticed (well, much bigger than a snag!) is that several articles have many whole sentences in Greek rather than only individual words. If I move my 'pointer' along to the next character and allow it to continue past spaces, punctuation etc until it bumps into a group of 6symbols that doesn't have the format: & # digit digit digit ; would that work?

    I'm ridiculously out of my depth here, sorry!

  • #6
    Senior Coder NancyJ's Avatar
    Join Date
    Feb 2005
    Location
    Bradford, UK
    Posts
    3,174
    Thanks
    19
    Thanked 66 Times in 65 Posts
    There arent any ascii codes for those characters. They might exist in ANSI extended ASCII but I'm not sure (non-standard characters arent my forte)
    If you run your greek alphabet through ord() most of them come up with 63 (?)

    This site lists the html entities and hex codes for the characters but I dont know if that will be any use to you. I dont know if the regexp will do a better job of character recognition
    http://htmlhelp.com/reference/html40...s/symbols.html

    Ofcourse whatever solution you come up with, you'll get a lot of non greek words matched because your array contains A,B,E,Z,H,I,K,M,N,O,P,T,Y (unless my computer is reading those characters wrong)

  • #7
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts
    Ah, I was most probably using the wrong term when I said ASCII - did I mean ANSI? Well, regardless, the code:

    PHP Code:
    $cc_text preg_replace('/([\xc0-\xdf].)/se'"'&#' . ((ord(substr('$1', 0, 1)) - 192) * 64 + (ord(substr('$1', 1, 1)) - 128)) . ';'"$cc_text);
    $cc_text preg_replace('/([\xe0-\xef]..)/se'"'&#' . ((ord(substr('$1', 0, 1)) - 224) * 4096 + (ord(substr('$1', 1, 1)) - 128) * 64 + (ord(substr('$1', 2, 1)) - 128)) . ';'"$cc_text); 
    that I mentioned before renders, for example, περνά η μέλλισα

    as:

    Code:
    &amp;#960;&amp;#949;&amp;#961;&amp;#957;&amp;#940; &amp;#951; &amp;#956;&amp;#941;&amp;#955;&amp;#955;&amp;#953;&amp;#963;&amp;#945;
    in 'view source', which displays in the browser perfectly (imagine that each "amp;" isn't there!)

    As for the Greek capital letters that look the same as the Latin ones, it didn't occur to me that it would have the same encoding (duh!) - That is a problem.

    Should I conclude that it's simply to ambitious to hope that I might have automated all those TENS OF THOUSANDS of span tags? Or perhaps there's a wildy different approach you might suggest...?

    Again, thanks so much for your invaluable input and time.
    Last edited by cfructose; 07-16-2007 at 09:18 PM. Reason: formatting

  • #8
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts
    By the way, I just tried using the code you suggested, but I'm thwarted once again by absolutely NO words being identified as Greek. It's effectively telling me the same thing as was happening with my code: there are no matches between the data in $cc_text and the array.

    Is it possible that the cause has something to do with the ob_start() I'm using earlier?

    My code is:

    PHP Code:
    ob_start();
    include (
    "text/$include_name");//puts main text data into $cc_text
    $cc_text ob_get_clean();
    foreach (
    $definition as $term => $def) {// Find and replace glossary terms
       
    $cc_text str_replace$term'<span class="help" title="' strip_tags$def ) . '">' $term '</span>'$cc_text );

    ...shortly after which I ran the span tag insertion part.

    Could there be a connection?

  • #9
    Senior Coder NancyJ's Avatar
    Join Date
    Feb 2005
    Location
    Bradford, UK
    Posts
    3,174
    Thanks
    19
    Thanked 66 Times in 65 Posts
    I tried running the code you provided on "περνά η μέλλισα" and nothing happened. Is there some other step that you're not mentioning here? I ran it on a passage of text that had some greek words on it and it only converted é and into something very strange. It looked like kanji.

    The ob_start shouldnt affect the code. Its interesting that nothing is being matched for you, whereas nearly everything is being matched for me. Can you post your whole script (and if possible a snippet of an article you're trying to match)

  • #10
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts
    The only other 'step' I can think of is that my code is saved as UTF-8 (no BOM, though that shouldn't make any difference), and that the font displaying the text can handle everything in Unicode.

    What on earth might give the two of us such different results?

    Well, for what it's worth, here's a large chunk of code that I'm sure is 99% irrelevant, but here we go:

    PHP Code:
    <?php
        
    include ("phpinserts/lang_choices.php");

        
    $include_name basename($_SERVER['PHP_SELF']);    // set the name of the data file to include as the same as the root file
        
    $chapter substr($include_name02);
        
    $subchapter substr($include_name32);
        
        if (
    substr($include_name09) != "subscribe") {
            include (
    "phpinserts/log_in_form.php");
        }
        
        
    // defining security variables
        
    $secure_ka "secure_".$chapter."_".$subchapter;
        
    //$include_name_stripped = substr($include_name, 0, strrpos($include_name, '.'));    //strrpos strips everything from the "." I.e. the ".php" filename.
        
        
    if ($logged_in == "no" && $include_name == "password_change.php") {
            echo 
    "Please log in to change your password.";
        }
        elseif (
    $payment_received != "yes" && ($include_name == "subscribe_step_2.php" || $include_name == "subscribe_step_3.php")) {
            echo 
    "<div class=\"bigemptyspace\"></div><div class=\"bigemptyspace\"></div><p class=\"alertbox\" style=\"vertical-align: super;\"><img src=\"images/no_entry.gif\" alt=\"".$please_buy."\" style=\"width: 2em; height: 2em;\"/>Access Denied!<img src=\"images/no_entry.gif\" alt=\"".$please_buy."\" style=\"width: 2em; height: 2em;\"/></p><span class=\"warning\">Please complete <i>PayPal</i> payment in order to proceed.</span> <a href=\"subscribe_step_1.php\">Return to Step 1</a>";
        }
        elseif ($
    $secure_ka != 1) {        // a 'variable variable' to compare $secure_ka (the $secure variable name as data) with a variable name created from that data (which was already definied as either 0 or 1 in config). Prevents direct access to 'members only' files via URL.
            
    if ($include_name != "login.php" && $include_name != "logout.php") {
                echo 
    "<br/><h1>".$page."</h1><h1 style=\"position: relative; left: -0.07em; top: -1.25em; color: #00b8ff; font-size: 1.7em; font-weight: bold;\">".$page."</h1>";    // print title with shading
            
    }


    // cc text input + str_replace of all <span> glossary terms
    ob_start();
    if (
    $page == "Home") {
        include (
    "text/home.php");
    }
    else {
        include (
    "text/$include_name");    //puts main text data into $cc_text
    }
    $cc_text ob_get_clean();
    foreach (
    $definition as $term => $def) {    // Find and replace terms
        
    $cc_text str_replace$term'<span class="help" title="' strip_tags$def ) . '">' $term '</span>'$cc_text );
    }


    // add <span lang> tags around Greek words
    $greek_chars = array("α""β""γ""δ""ε""ζ""η""θ""ι""κ""λ""μ""ν""ξ""ο""π""ρ""σ""ς""τ""υ""φ""χ""ψ""ω""ά""έ""ή""ί""ό""ύ""ώ""ϊ""Α""Β""Γ""Δ""Ε""Ζ""Η""Θ""Ι""Κ""Λ""Μ""Ν""Ξ""Ο""Π""Ρ""Σ""Τ""Υ""Φ""Χ""Ψ""Ω""Ά""Έ""Ή""Ί""Ό""Ύ""Ώ""Ϊ");
    $greek_lang_open "<span lang=\"el\">";
    $greek_lang_close "</span>";
    $span_open_length strlen($greek_lang_open); //measures how many chars there are so we can jump forward later after inserting
    $span_close_length strlen($greek_lang_close); //measures how many chars there are so we can jump forward later after inserting
    $pos=0//pointer
    $cc_text_length strlen($cc_text);
    $gr1 "no";
    $gr2 "no";

    while (
    $pos<$cc_text_length) {    //scan file  (source for the 'while' and 'pointer': http://gr2.php.net/strstr)
        
    $current_letter substr($cc_text,$pos,1); //grab next character from pointer
        
    if ($gr1 == "no" && $gr2 == "no" && in_array($current_letter$greek_chars)) {
            
    $gr1 "yes";
            
    $cc_text str_replace($current_letter,$greek_lang_open.$current_letter,$cc_text); // add open span lang tag
            
    $pos $pos+$span_open_length;
        }
        elseif (
    $gr1 == "yes" && in_array($current_letter$greek_chars)) {
            
    $gr1 "no";
            
    $gr2 "yes";
        }
        elseif (
    $gr2 == "yes" && !(in_array($current_letter$greek_chars))) {
            
    $gr2 "no";
            
    $gr1 "no";
            
    $cc_text str_replace($current_letter,$greek_lang_close.$current_letter,$cc_text); // add close span lang tag
            
    $pos $pos+$span_close_length;
        }
        
    $pos++; //increment the pointer
        
    $cc_text_length strlen($cc_text); //re-measure length of $cc_text, as it's changed from our adding span tags
    }


    //convert all quotes and greek characters (all high-level UTF-8) to asciis.
    $cc_text preg_replace('/([\xc0-\xdf].)/se'"'&#' . ((ord(substr('$1', 0, 1)) - 192) * 64 + (ord(substr('$1', 1, 1)) - 128)) . ';'"$cc_text);
    $cc_text preg_replace('/([\xe0-\xef]..)/se'"'&#' . ((ord(substr('$1', 0, 1)) - 224) * 4096 + (ord(substr('$1', 1, 1)) - 128) * 64 + (ord(substr('$1', 2, 1)) - 128)) . ';'"$cc_text); 
            
    echo 
    $cc_text;    // Display formatted article
    ?>
    And an excerpt from one of pieces of text to be included (upon which I'm trying to perform all the span inserts):

    Going back to our example of "James helped his friend", and 'reversing the action', we get "The friend helped James". Note that both the word James (which is a noun (a proper noun)), and the word friend remain unaltered regardless of their rôle in the sentence – regardless of their case. We could say, therefore, that the nominative (subject) and accusative (object) cases of all nouns in English (except pronouns) are identical; there is nothing to help us distinguish their rôles other than their position in the sentence. In Greek, however, the word for friend (φίλος), and the equivalent to the name James (Δημήτρης) do change their form according to their rôle, i.e. all Greek nouns must be appropriately modified to account for their case:

    Ο Δημήτρης βοήθησε το φίλο του. Demetris helped his friend.
    Ο φίλος βοήθησε το Δημήτρη. The friend helped Demetris.

    The underlined letters above show the parts which undergo change according to case. Note the final ς of the nouns which is present in the nominative, but absent in the accusative. To complicate matters further, the articles (the word the) also change according to the case of the noun, as can be seen above by the addition of a τ in the accusative.

  • #11
    Senior Coder NancyJ's Avatar
    Join Date
    Feb 2005
    Location
    Bradford, UK
    Posts
    3,174
    Thanks
    19
    Thanked 66 Times in 65 Posts
    heh, I couldnt even run the code - looks like my computer gave up. Interestingly, I've tried on 3 different servers and I cant get the text to output the greek characters - although I obviously have them because I can see them in my editor and on this site and any others I look at. I really dont have much experience with non ascii character sets.

  • #12
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts
    Well, thanks you so much, Nancy. I may not have a solution, but I've learned a lot from you! :-)

  • #13
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts

    Solution!

    After endless struggling, I've succeeded! I'm posting this in case anyone finds it useful...

    (Note that the code, while working perfectly, could do with quite a bit of tightening to make it more efficient, eg, some regular expressions for the 'skip punctuation' bit etc)...

    PHP Code:
    <?php 
    //convert all quotes and greek characters (all high-level UTF-8) to asciis. 
    $cc_text preg_replace('/([\xc0-\xdf].)/se'"'&#' . ((ord(substr('$1', 0, 1)) - 192) * 64 + (ord(substr('$1', 1, 1)) - 128)) . ';'"$cc_text); 
    $cc_text preg_replace('/([\xe0-\xef]..)/se'"'&#' . ((ord(substr('$1', 0, 1)) - 224) * 4096 + (ord(substr('$1', 1, 1)) - 128) * 64 + (ord(substr('$1', 2, 1)) - 128)) . ';'"$cc_text);  
                     
                     
    // create greek character array 
    $ansi=902
    while(
    $ansi<=974) { 
        
    $greek_chars[] = "&#".$ansi.";"
        
    $ansi++; 

    // add angled quotation marks to array 
    $greek_chars[] = "«"
    $greek_chars[] = "»"

    $greek_lang_open "<span lang=\"el\">"
    $greek_lang_close "</span>"
    $span_open_length strlen($greek_lang_open); //measures how many chars there are so we can jump forward later after inserting 
    $span_close_length strlen($greek_lang_close); //measures how many chars there are so we can jump forward later after inserting 
    $pos=0//pointer 
    $cc_text_length strlen($cc_text); 
    $current_letter_length 6
    $first_greek_letter_found "no"
    $number_of_close_tag_occurences 0
    $tag_detected "no"

    while (
    $pos<$cc_text_length) {    //scan file 
        //grab characters in search of Greek ANSI codes 
        
    $current_letter substr($cc_text,$pos,$current_letter_length); 
         
         
         
        if (
    $tag_detected == "no") { 
         
         
         
            
    //if 1 Greek letter has already been found: 
            
    if ($first_greek_letter_found == "yes") { 
                if (
    in_array($current_letter$greek_chars)) { 
                    
    //if next one is Greek too, move pointer 
                    
    $pos $pos+$current_letter_length-1
                } 
                
    //as long as the next group of 6 symbols doesn't start with space or punctuation, we'll put in the close span tag 
                
    elseif ((substr($current_letter01) != " ") && (substr($current_letter01) != ",") && (substr($current_letter01) != "'") && (substr($current_letter01) != ";") && (substr($current_letter01) != ".")) { 
                    
    // add close span tag 
                    
    $cc_text substr_replace($cc_text$greek_lang_close.$current_letter$pos$current_letter_length); 
                    
    //re-measure length of $cc_text, as it's changed from our adding span tags 
                    
    $cc_text_length strlen($cc_text);  
                    
    $pos $pos+$span_close_length
                    
    $first_greek_letter_found "no";//reset the var 
                

            } 
             
             
             
            
    //if no Greek letters have been found, but this one IS Greek: 
            
    elseif ($first_greek_letter_found == "no" && in_array($current_letter$greek_chars)) { 
                
    $first_greek_letter_found "yes"
                
    //add open span lang tag 
                
    $cc_text substr_replace($cc_text$greek_lang_open.$current_letter$pos$current_letter_length); 
                
    //re-measure length of $cc_text, as it's changed from our adding span tags 
                
    $cc_text_length strlen($cc_text); 
                
    $pos $pos+$current_letter_length+$span_open_length-1
            } 
             
             
             
            
    //if no Greek letter found, but 1st symbol is < 
            
    elseif (substr($current_letter01) == "<") { 
                
    $tag_detected "yes";     
            } 
        } 
         
         
         
        
    //if tag_detected = yes, we want to skip all text until the close of the tag (">") to avoid putting spans within, say, image alt text that includes Greek characters. 
        
    else { 
            if (
    substr($current_letter01) == ">") { 
                
    $tag_detected "no"
            }     
        } 
        
    $pos++; //increment the pointer         
    }     
    ?>


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •