Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 5 of 5
  1. #1
    Regular Coder
    Join Date
    Dec 2009
    Location
    Hong Kong
    Posts
    119
    Thanks
    8
    Thanked 0 Times in 0 Posts

    Counting ONLY Chinese characters?

    Is there a way to count ONLY Chinese characters? (or multibit characters)

    PHP Code:

    $ch 
    "很好-nice";

    echo 
    mb_detect_encoding($ch);  // UTF-8

    echo mb_strlen($ch,'UTF-8');  // 7 
    Is there a way to get the count for just the Chinese characters, means 2 for that example.

    Background: I fetch webpages and I want to filter out Chinese webpages since I don't analyse their data. Any workaround is welcome too!

  • #2
    New Coder
    Join Date
    Jan 2010
    Posts
    29
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Hmm, that's a good question. I'd suggest if you don't find a solution to replace all english characters (or roman characters) and then do a strlen(); Not even sure it would work, but I'd try it.
    Nerd Stuff (code, rrdtool, monitoring, etc):

    blog.anthonyhurst.com

  • #3
    12k
    12k is offline
    New Coder
    Join Date
    Jan 2012
    Posts
    29
    Thanks
    0
    Thanked 6 Times in 6 Posts
    I dont believe there is a set function for this, but u can use something such as ord($character) to get the id of each character in the string. Then do a check to see if the character is between a certain number, and if it isnt, add it to the chinese character count. (Not sure exact id's between the correct characters)

  • #4
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,978
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    Quote Originally Posted by 12k View Post
    I dont believe there is a set function for this, but u can use something such as ord($character) to get the id of each character in the string. Then do a check to see if the character is between a certain number, and if it isnt, add it to the chinese character count. (Not sure exact id's between the correct characters)
    This is really the only way I'd suspect. Cast your string to an array, map it through an ord, and then apply an array filter to pull out between the range of chinese characters (which I have no idea of either).

  • #5
    Regular Coder
    Join Date
    Dec 2009
    Location
    Hong Kong
    Posts
    119
    Thanks
    8
    Thanked 0 Times in 0 Posts
    I just like to know if a page is mainly Chinese (or high bit) or not. I played around a bit and found a dirty method:

    PHP Code:
    $enc 'UTF-8';  // just as an example

    $all  "我不明白我写什么-I have no idea what I write!";
    echo 
    mb_strlen($all$enc); // 37
    echo strlen($all); // 53

    $ch   "我不明白我写什么";
    echo 
    mb_strlen($ch$enc); // 8
    echo strlen($ch); // 24

    $en   "I have no idea what I write!";
    echo 
    mb_strlen($en$enc); // 28
    echo strlen($en); // 28

    $char "<>?#@!()[]{}*'=";
    echo 
    mb_strlen($char$enc); // 22
    echo strlen($char); // 30

    //--------------------------------

    $url "http://www.chinadaily.com.cn/hqzx/";

    $source file_get_contents($url);
    preg_match('~charset=([-a-z0-9_]+)~i',$source ,$charset);

    $enc $charset[1];

    $source strip_tags($source);

    $a mb_strlen($source$enc)."<br>";
    $b strlen($source)."<p>";

    echo 
    $a;
    echo 
    $b;

    echo 
    "Difference =".$a/$b;  // 0.47 
    For a western script page the output will be 1, (or near 1 in case they use etc.). For a Chinese page it will be 0.40-0.95

    It's the best I came up with.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •