...

View Full Version : Counting ONLY Chinese characters?



fail
01-15-2012, 02:41 PM
Is there a way to count ONLY Chinese characters? (or multibit characters)




$ch = "很好-nice";

echo mb_detect_encoding($ch); // UTF-8

echo mb_strlen($ch,'UTF-8'); // 7



Is there a way to get the count for just the Chinese characters, means 2 for that example.

Background: I fetch webpages and I want to filter out Chinese webpages since I don't analyse their data. Any workaround is welcome too!

fatecaresx13
01-16-2012, 07:17 AM
Hmm, that's a good question. I'd suggest if you don't find a solution to replace all english characters (or roman characters) and then do a strlen(); Not even sure it would work, but I'd try it.

12k
01-16-2012, 07:57 AM
I dont believe there is a set function for this, but u can use something such as ord($character) to get the id of each character in the string. Then do a check to see if the character is between a certain number, and if it isnt, add it to the chinese character count. (Not sure exact id's between the correct characters)

Fou-Lu
01-16-2012, 02:07 PM
I dont believe there is a set function for this, but u can use something such as ord($character) to get the id of each character in the string. Then do a check to see if the character is between a certain number, and if it isnt, add it to the chinese character count. (Not sure exact id's between the correct characters)

This is really the only way I'd suspect. Cast your string to an array, map it through an ord, and then apply an array filter to pull out between the range of chinese characters (which I have no idea of either).

fail
01-18-2012, 06:01 AM
I just like to know if a page is mainly Chinese (or high bit) or not. I played around a bit and found a dirty method:



$enc = 'UTF-8'; // just as an example

$all = "我不明白我写什么-I have no idea what I write!";
echo mb_strlen($all, $enc); // 37
echo strlen($all); // 53

$ch = "我不明白我写什么";
echo mb_strlen($ch, $enc); // 8
echo strlen($ch); // 24

$en = "I have no idea what I write!";
echo mb_strlen($en, $enc); // 28
echo strlen($en); // 28

$char = "<>?#@!()[]{}*'=";
echo mb_strlen($char, $enc); // 22
echo strlen($char); // 30

//--------------------------------

$url = "http://www.chinadaily.com.cn/hqzx/";

$source = file_get_contents($url);
preg_match('~charset=([-a-z0-9_]+)~i',$source ,$charset);

$enc = $charset[1];

$source = strip_tags($source);

$a = mb_strlen($source, $enc)."<br>";
$b = strlen($source)."<p>";

echo $a;
echo $b;

echo "Difference =".$a/$b; // 0.47


For a western script page the output will be 1, (or near 1 in case they use etc.). For a Chinese page it will be 0.40-0.95

It's the best I came up with.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum