Go Back   CodingForums.com > :: Server side development > PHP

Before you post, read our: Rules & Posting Guidelines

Reply
 
Thread Tools Rate Thread
Enjoy an ad free experience by logging in. Not a member yet? Register.
Old 01-15-2012, 01:41 PM   PM User | #1
fail
Regular Coder

 
Join Date: Dec 2009
Location: Hong Kong
Posts: 118
Thanks: 8
Thanked 0 Times in 0 Posts
fail is an unknown quantity at this point
Counting ONLY Chinese characters?

Is there a way to count ONLY Chinese characters? (or multibit characters)

PHP Code:

$ch 
"很好-nice";

echo 
mb_detect_encoding($ch);  // UTF-8

echo mb_strlen($ch,'UTF-8');  // 7 
Is there a way to get the count for just the Chinese characters, means 2 for that example.

Background: I fetch webpages and I want to filter out Chinese webpages since I don't analyse their data. Any workaround is welcome too!
fail is offline   Reply With Quote
Old 01-16-2012, 06:17 AM   PM User | #2
fatecaresx13
New Coder

 
Join Date: Jan 2010
Posts: 29
Thanks: 0
Thanked 2 Times in 2 Posts
fatecaresx13 is an unknown quantity at this point
Hmm, that's a good question. I'd suggest if you don't find a solution to replace all english characters (or roman characters) and then do a strlen(); Not even sure it would work, but I'd try it.
__________________
Nerd Stuff (code, rrdtool, monitoring, etc):

blog.anthonyhurst.com
fatecaresx13 is offline   Reply With Quote
Old 01-16-2012, 06:57 AM   PM User | #3
12k
New Coder

 
Join Date: Jan 2012
Posts: 29
Thanks: 0
Thanked 6 Times in 6 Posts
12k is an unknown quantity at this point
I dont believe there is a set function for this, but u can use something such as ord($character) to get the id of each character in the string. Then do a check to see if the character is between a certain number, and if it isnt, add it to the chinese character count. (Not sure exact id's between the correct characters)
12k is offline   Reply With Quote
Old 01-16-2012, 01:07 PM   PM User | #4
Fou-Lu
God Emperor


 
Fou-Lu's Avatar
 
Join Date: Sep 2002
Location: Saskatoon, Saskatchewan
Posts: 15,647
Thanks: 4
Thanked 2,450 Times in 2,419 Posts
Fou-Lu is a name known to allFou-Lu is a name known to allFou-Lu is a name known to allFou-Lu is a name known to allFou-Lu is a name known to allFou-Lu is a name known to all
Quote:
Originally Posted by 12k View Post
I dont believe there is a set function for this, but u can use something such as ord($character) to get the id of each character in the string. Then do a check to see if the character is between a certain number, and if it isnt, add it to the chinese character count. (Not sure exact id's between the correct characters)
This is really the only way I'd suspect. Cast your string to an array, map it through an ord, and then apply an array filter to pull out between the range of chinese characters (which I have no idea of either).
Fou-Lu is offline   Reply With Quote
Old 01-18-2012, 05:01 AM   PM User | #5
fail
Regular Coder

 
Join Date: Dec 2009
Location: Hong Kong
Posts: 118
Thanks: 8
Thanked 0 Times in 0 Posts
fail is an unknown quantity at this point
I just like to know if a page is mainly Chinese (or high bit) or not. I played around a bit and found a dirty method:

PHP Code:
$enc 'UTF-8';  // just as an example

$all  "我不明白我写什么-I have no idea what I write!";
echo 
mb_strlen($all$enc); // 37
echo strlen($all); // 53

$ch   "我不明白我写什么";
echo 
mb_strlen($ch$enc); // 8
echo strlen($ch); // 24

$en   "I have no idea what I write!";
echo 
mb_strlen($en$enc); // 28
echo strlen($en); // 28

$char "<>?#@!()[]{}*•º«'=äüö¥";
echo 
mb_strlen($char$enc); // 22
echo strlen($char); // 30

//--------------------------------

$url "http://www.chinadaily.com.cn/hqzx/";

$source file_get_contents($url);
preg_match('~charset=([-a-z0-9_]+)~i',$source ,$charset);

$enc $charset[1];

$source strip_tags($source);

$a mb_strlen($source$enc)."<br>";
$b strlen($source)."<p>";

echo 
$a;
echo 
$b;

echo 
"Difference =".$a/$b;  // 0.47 
For a western script page the output will be 1, (or near 1 in case they use äüµ³ etc.). For a Chinese page it will be 0.40-0.95

It's the best I came up with.
fail is offline   Reply With Quote
Reply

Bookmarks

Jump To Top of Thread


Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 05:24 PM.


Advertisement
Log in to turn off these ads.