I just like to know if a page is mainly Chinese (or high bit) or not. I played around a bit and found a
dirty method:
PHP Code:
$enc = 'UTF-8'; // just as an example
$all = "我不明白我写什么-I have no idea what I write!";
echo mb_strlen($all, $enc); // 37
echo strlen($all); // 53
$ch = "我不明白我写什么";
echo mb_strlen($ch, $enc); // 8
echo strlen($ch); // 24
$en = "I have no idea what I write!";
echo mb_strlen($en, $enc); // 28
echo strlen($en); // 28
$char = "<>?#@!()[]{}*•º«'=äüö¥";
echo mb_strlen($char, $enc); // 22
echo strlen($char); // 30
//--------------------------------
$url = "http://www.chinadaily.com.cn/hqzx/";
$source = file_get_contents($url);
preg_match('~charset=([-a-z0-9_]+)~i',$source ,$charset);
$enc = $charset[1];
$source = strip_tags($source);
$a = mb_strlen($source, $enc)."<br>";
$b = strlen($source)."<p>";
echo $a;
echo $b;
echo "Difference =".$a/$b; // 0.47
For a western script page the output will be 1, (or near 1 in case they use äüµ³ etc.). For a Chinese page it will be 0.40-0.95
It's the best I came up with.