PDA

View Full Version : BOM removal and UTF-8 broswer detection


cfructose
02-20-2007, 02:21 PM
I'm using Chinese, Russian, Hebrew & Greek characters on most pages of a site (along with Latin).
I therefore need to encode in UTF-8, and am using the meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Fine up to here (I think), BUT the BOM is being displayed on some browsers resulting in either blank space or the dreaded "i-diaresis, right-angle-quotation-mark, inverted question mark" rearing its ugly head.

This screws up my formatting, nudging graphics around etc - the problem being compounded by the high number of scattered php includes to data files which ALSO require multiple alphabets, leading to more blank spaces and gobbledygook throughout each page, not just at the top.

I HAD been using Notepad2 as a text editor (shoot me!), and later learner that Notepad was automatically adding a BOM, and so went about removing the EF BB BF at the beginning of each page manually with a hex editor.
(I've also changed text editor to one that gives me the option!) :-)

Having removed the BOMs (aka "Bane Of My" existence), my formatting is back on track when viewed in, say, Firefox, but the browser is not automatically rendering the pages as UTF-8.

How do I get round this?

Yes, I can instruct users to change the encoding form the view menu, but even if they're not scared off by that (simple as it is), they'd have to change it every time they visited a new page, which is clearly an unacceptable nuisance.

So:
Does the BOM override the meta tag I mentioned above? - I.e. is there a hierarchy?
(It would seem that the tag does nothing, now that the BOMs have been removed)

How can I retain browsers' ability to detect that I want UTF-8 while preventing certain ones from wreaking havoc with my layout when they print the BOM?

I'm desperately hoping that I've fundamentally misunderstood how it all works (despite the inordinate number of pages I've been reading on the subject) and that there's a simple solution: pleeeeeeeez somebody come and tell me I'm being an idiot...!

croatiankid
02-20-2007, 03:32 PM
I'm also interested, because I've always noticed that in browsers you can specify the encoding. Is this just the encoding that is used if none is specified (i.e. "default")?

Also, you should know that your server should be configured to send it encoded with UTF-8 (I'd ask my hosting provider).

pakmannen
02-20-2007, 03:42 PM
BOM stands for Byte Order Mark and is a way to auto detect utf-8 encoding. It is, however, unsupported by php, which makes it pretty much unusable. Get a hold of an editor that can save utf-8 files without BOM and go from there.

For utf-8 encoding to work properly you need to:
- Save your files with utf-8 encoding without BOM (you need an editor which supports this, PSPad is a good one)
- Include the charset meta tag

For some servers, that is all you need to do. However, if the browser fails to auto detect the charset, the server is set to override it. If you can not change this (that is, you don't have access to the server), there are ways to override it. Either change the content type in the http-header with php http://se.php.net/header or with an .htaccess file http://www.w3.org/International/questions/qa-htaccess-charset

(The content type in the header always overrides the meta tag)

cfructose
02-20-2007, 03:43 PM
Hi CroatianKid,

A week ago I posted a 'user-ticket' to my host provider about that - still awaiting response.

I imagine with the hac^eks in your language, you've run into the same problem!

Hvala.

PS - I was following your quest with the A List Apart "Holy Grail" a while ago - I read all your posts in various places, and was going through the same troubles. I feel your pain! I'm just about to post a LONG message about it - please read it, if you have the time. Nice to meet you, by the way.

cfructose
02-20-2007, 03:47 PM
Hej,

Sorry - I didn't notice your reply before I posted that last message, only the one after.

Lots of useful info, thanks.

I can't live without php OR UTF-8 - life ain't fair!
OK...onwards and upwards!
:-)

pakmannen
02-20-2007, 03:54 PM
Tja! :)

I have been dealing with the same problem myself lately. But sort of the other way around. My server (ubuntu) was set to override the charset with uft-8, and this caused a great deal of confusion.. It's all good now.

Php works great with utf-8, as long as you skip the BOM. Also remember that string functions in php are made for the ISO encoding which means that you will get false values if you use strlen and other functions that deal with the length of strings. Instead, use mb_strlen (mb=multibyte) like this: mb_strlen($string, 'utf-8');

cfructose
02-20-2007, 04:05 PM
Mmmm, thanks for that tip.
It raises a question: when I use substr in the code below, am I in any danger?

<?php
$b_pressed = "<img class=\"b-anim\" src=\"images/b-anim.gif\" alt=\"\"/>";
if (substr($page, 0, 5) == "Revie") {echo $b_pressed;}
?>

There are several pages, all of which start with the word "Review", and for all of which I want the animated gif to display, so I opted for comparing the first few letters of the $page which takes its value from basename($_SERVER['PHP_SELF'])

It works perfectly, but now I'm worried... Would it go wrong if using, say, å, ä, ö form Swedish (which many of my pages do indeed use)?

Skulle du kunna va' snäll o' förklara de' för mig?

Tack.

croatiankid
02-20-2007, 04:15 PM
Hi CroatianKid,

A week ago I posted a 'user-ticket' to my host provider about that - still awaiting response.

I imagine with the hac^eks in your language, you've run into the same problem!

Hvala.

PS - I was following your quest with the A List Apart "Holy Grail" a while ago - I read all your posts in various places, and was going through the same troubles. I feel your pain! I'm just about to post a LONG message about it - please read it, if you have the time. Nice to meet you, by the way.
Great, I've just found out that I've been e-stalked for the last few months, lol j/k.

pakmannen
02-20-2007, 04:31 PM
Substr will indeed be a problem if you are using special characters such as å, ä, ö. The thing is, php counts the bytes of a string to determine its length. UTF-8 encoding, however, uses up to 4 bytes for each character.

Quote from wikipedia:
"Only one byte is needed to encode the 128 US-ASCII characters (Unicode range U+0000 to U+007F). Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF). Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in other planes of Unicode."

This means that å, ä and ö will be counted by php as two characters each, which will affect substr and other string functions. If you have to use it together with special characters, use mb_substr instead. http://se2.php.net/manual/en/function.mb-substr.php

On a page encoded with utf-8:

substr('åka', 1, 3); would return: '?ka'.
mb_substr('åka', 1, 3, 'utf-8'); would return: 'ka'.

cfructose
02-20-2007, 04:50 PM
Understod. Great! And thank you again.