View Full Version : Character sets and transmission
ghell
02-13-2008, 02:35 PM
How should character sets work with XML?
For example when sending an XML string from an application over a network, should the whole XML string always be sent as something like UTF-8 (so that the tags etc are UTF-8 and the content inside the tags may be a different format such as UTF-16?) and then use the <?xml ... encoding="UTF-16"?> to have UTF-16 content in the message?
I think one of the points that may clear this confusion up for me is this: Is <?xml ... encoding="..." ?> used JUST for the textual content of tags or is it also used for the tags themselves. What about binary data that is base64 encoded? What about attributes?
If the questions are still unclear, please tell me so that I can try to clarify.
liorean
02-13-2008, 04:02 PM
The entire document, markup and content alike, should be encoded using the same character encoding scheme. And realistically, UTF-8 for the entire document makes most sense - usually XML applications use pure ASCII, so if the portion markup is in the same ballpark as the portion content, then UTF-8 will always be more compact than UTF-16. UTF-16 is the best alternative though. Just to clarify, I meant that as in: Of all the alternatives to UTF-8, UTF-16 is the best. UTF-8 is still almost always a better choice, though. It's the only non-ASCII-superset that you can serve XML as without sending the charset as part of the Content-Type HTTP header.
And if you need something that compresses better than UTF-16 or UTF-8, then it's a better idea to gzip content than to change to another encoding.
ghell
02-13-2008, 04:28 PM
Thanks for the reply.
I'm writing a format that may later be used in Asian countries so I am just a bit worried about catering for everyone.
It is not a problem for me to send the whole XML string in UTF-16 and have the encoding specified in the document, I was just more worried about how well other XML parsers would handle it, where the encoding actually gets used. Surely it needs to be decoded before the parser can actually read the "encoding" section anyway so I was wondering where it was used.
I am not doing anything to do with HTTP.
Yes, for compression I will probably end up using gzip streams (or bzip2 or whatever), but that is outside of the scope of my question, as it will be the application handling the compression and I am interested in the encoding that is used before the compression and after the decompression, so in essence the XML parser shouldn't even know whether or not it has been compressed.
liorean
02-13-2008, 06:02 PM
I'm writing a format that may later be used in Asian countries so I am just a bit worried about catering for everyone.UTF-16, every character is two or four bytes. All UTF-16 characters that are four bytes also takes four bytes in UTF-8, so discount those.
Then you have the rest:
The western and central European languages use one byte for most of the text, occasionally two bytes, which means UTF-8 is always better for these languages than UTF-16 which always uses two bytes (UTF-16/UTF-8 = ~2 times larger).
Chinese, Korean and Japanese use three bytes UTF-8 per character while UTF-16 uses only two, thus for text in these languages UTF-16 is better (UTF-16/UTF-8 = ~0.67 times smaller).
For most languages in between, average of two bytes per character with dispersed single byte characters (whitespace, "arabic" numbers etc.) while UTF-16 uses two bytes for all characters, which means UTF-8 is very slightly better than UTF-16. (UTF-16/UTF-8 = ~1.0-1.4 depending on language)
However, you're sending markup, not pure text. Almost all markup languages are written in English, using the ASCII subset. All those ASCII characters that constitute markup, newlines and whitespace use only a single byte in UTF-8, but use two bytes in UTF-16. That means, using JavaScript:var
portionmarkup = 0.50, // As a decimal number, not a per cent value.
portionCKJ = 1-portionmarkup,
sizecomparison = (2*portionmarkup + 2*portionCKJ/3);
alert('This content, if UTF-16 encoded would be '+sizecomparison+' times the size it would be if UTF-8 encoded.');If you fiddle with the constants, you see the only cases where UTF-16 is actually preferable to UTF-8 for CJK is if less than one fourth the characters in the document are ASCII markup, newlines, whitespace, digits etc.
So as you see, UTF-16 is only smaller than UTF-8 for those languages if the amount of content characters is three times as large as the amount of markup characters in the document.It is not a problem for me to send the whole XML string in UTF-16 and have the encoding specified in the document, I was just more worried about how well other XML parsers would handle it, where the encoding actually gets used.All XML engines can handle both UTF-8 and UTF-16 perfectly fine. Whether they can handle any other encoding however, that's entirely up to the separate engine makers.Yes, for compression I will probably end up using gzip streams (or bzip2 or whatever), but that is outside of the scope of my question, as it will be the application handling the compression and I am interested in the encoding that is used before the compression and after the decompression, so in essence the XML parser shouldn't even know whether or not it has been compressed.Well, the thing is that the encoding matters mostly when it comes to one of the two: storage and transport. And if you GZip your content, you'll find there is nearly no benefit at all to using UTF-16 (or almost any other encoding) over using UTF-8.
Second: <?xml ... encoding="..."?> is not necessary for UTF-8 or UTF-16 encoded content. It IS necessary for any other ASCII-compatible encoding. It is redundant for any non-ASCII-compatible encoding since you need an external declaration of document encoding for those encodings to even begin to parse the document.
ghell
02-13-2008, 11:37 PM
I am aware of the differences in character sets, my question was more like this:
Say I wanted to write some content that contains a lot of languages, for example
<greetings>Hello, Bonjour, 今日は, 你好, السلام عليكم</greetings>
(English, French, Japanese, Chinese, Arabic)
Assuming there was some encode function in an arbitrary language, would it be
utf16encode(...<greetings>...</greetings>...)
or would it be
...<greetings>utf16encode(...)</greetings>...
I think this was answered in the very last part of your last post about encoding="..." being redundant for UTF, as this would probably imply that the entire thing would need to be encoded and the encoding would have to be defined outside the XML. Can you confirm this?
liorean
02-13-2008, 11:48 PM
I think this was answered in the very last part of your last post about encoding="..." being redundant for UTF, as this would probably imply that the entire thing would need to be encoded and the encoding would have to be defined outside the XML. Can you confirm this?Yes. It's the contents of the XML file as a whole you specify the encoding for. XML has no way of including content in mixed encodings.
XML has mandated support and detection for UTF-8 and UTF-16. That means those encodings both work without being specified in the XML prologue and with no charset specified in Content-Type headings (or associated through file type, file metadata, MIME type etc.).
All other encodings need to be specified, either in the XML prologue (which requires that the encoding is ASCII compatible) or externally.
oesxyl
02-14-2008, 05:27 AM
I am aware of the differences in character sets, my question was more like this:
Say I wanted to write some content that contains a lot of languages, for example
<greetings>Hello, Bonjour, 今日は, 你好, السلام عليكم</greetings>
(English, French, Japanese, Chinese, Arabic)
Assuming there was some encode function in an arbitrary language, would it be
utf16encode(...<greetings>...</greetings>...)
or would it be
...<greetings>utf16encode(...)</greetings>...
I think this was answered in the very last part of your last post about encoding="..." being redundant for UTF, as this would probably imply that the entire thing would need to be encoded and the encoding would have to be defined outside the XML. Can you confirm this?
the xml spec say that if you don't specify any encoding in xml prolog default is utf-8. :)
Is a matter of schema design of your xml file. If you use greetings element in more then one language is a better choice to use xml:lang attribute for each language.
<greetings xml:lang="en">Hello</greetings>
<greetings xml:lang="fr">Bonjour</greetings>
<greetings xml:lang="??">今日は</greetings>
<greetings xml:lang="??">你好</greetings>السلام عليكم
<greetings xml:lang="??"></greetings>
Also I want to mention that xml have it's own compression mode on 8 levels, I thing is gzip not sure what alghoritm, and using bzip or bzip2 can create problem over the net in sense that can make transfer and manipulation of the file more difficult and non standard.
I hope this help. :)
best regards
liorean
02-14-2008, 06:50 AM
the xml spec say that if you don't specify any encoding in xml prolog default is utf-8. :)<uri:http://www.w3.org/TR/xml/#charsets>The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1 [Unicode3]; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.
Also I want to mention that xml have it's own compression mode on 8 levels, I thing is gzip not sure what alghoritm, and using bzip or bzip2 can create problem over the net in sense that can make transfer and manipulation of the file more difficult and non standard.XML does not have it's own compression model. Using GZip only negatively affects networking clients who don't implement GZip at all. You just need to check the Accept-* headers.
oesxyl
02-14-2008, 07:21 AM
<uri:http://www.w3.org/TR/xml/#charsets>
XML does not have it's own compression model. Using GZip only negatively affects networking clients who don't implement GZip at all. You just need to check the Accept-* headers.
both observation are accurate, thank you for correcting me.
Usualy I don't post only to confirm that I was wrong or inaccurate only if I see my self and nobody correct what I'm saying, it's useles for poster to explain why I do so. :)
This time I post because when I replay I have in mind one of the most used backend on the net, as far I know, libxml and libxml2. I think that is more effective to use gzip compression, allready implemented then reimplement compression using bzip or bzip2.
Using headers is a matter of transmition and I agree with you about using correct header, but not agree about negative effect, I'm thinking more specific to bandwidth. I guess that are only few user agents that are not support gzip. I would like, if you have time to spend with this subject, to give more details.
to avoid any confusion: libxml, libxml2 and libxslt are compliant with w3c standards, at least about encoding in prolog, I don't use this as an excuse or something about first observation of liorean, I have nothing to say about this.
best regards
ghell
02-14-2008, 11:19 AM
Thanks everyone.
There is nothing wrong with bzip2. bzip2 has no patent issues or anything as lz compression does, in a similar way to png (png was developed to replace gif with its patents). gzip is just more widely used because it is a lot older. bzip2 usually gives better text compression but is more expensive to compress and decompress.
There is no need to reimplement bzip2 any more than there is to reimplement gzip. There are plenty of libraries that do them, and if there weren't you would need to reimplement gzip as well, so it is no different.
The only con of using compression (pretty much any compression) for me is that very small data (under a few hundred bytes) actually end up bigger when compressed due to the overhead. There is obviously also a computational time issue. For example:$ cat text.txt
Hello
$ ls -l
-rw-r--r-- 1 6 2008-02-14 11:09 text.txt
-rw-r--r-- 1 44 2008-02-14 11:09 text.txt.bz2
-rw-r--r-- 1 35 2008-02-14 11:09 text.txt.gz
Again, I have no content-encoding or accept headers or anything like that. What I am doing has absolutely nothing to do with HTTP in the slightest and they are both HTTP headers. I also do not need to worry about http user agents (web browsers) and their support levels, as what I am doing has nothing to do with the www and will never be viewed in a browser. I am using XML to transfer data between networked applications that may reside on the internet, but not the web.
My structure puts no limit on language. My example was just there to show that a single tag may contain multiple languages (I just used "Hello" as an example in all of them, they are not necessarily the same). For example, one may send a message in Chinese and then send a reply in English (or any other language), quoting part of the original Chinese message. This is type of message more common than you may think because people may be able to read the Chinese but struggle to write it, particularly if they only speak one dialect but can just about read another such as Hokkien. This means that I need a wide character set to be able to hold a wide range of characters at the same time.
xml have it's own compression mode on 8 levelsWhat did you mean by this? (XML does not have its own but you probably based this on something)
Now that that is all cleared up, I have another couple of questions:
When data is transmitted using 7 bits, binary data must be represented in base64. If I am transmitting the entire message in UTF-16, is there any better way of encoding binary data, as ignoring compression, binary data will end up very large if base64 encoding is used over UTF-16.
One more question - If existing XML parsers must support UTF-16, do they have any specific requirements about endian? Can I safely transmit data in any one of little endian, big endian and byte order mark and just assume that any existing XML parser that could be used in the receiving application will be able to decode it, or must it be a certain endian?
liorean
02-14-2008, 02:43 PM
When data is transmitted using 7 bits, binary data must be represented in base64. If I am transmitting the entire message in UTF-16, is there any better way of encoding binary data, as ignoring compression, binary data will end up very large if base64 encoding is used over UTF-16.Again a great argument for using UTF-8 instead of UTF-16! :)
You could consider using Content-Type "multipart/related" (or MHTML) with CID URIs though. It adds some boilerplate but removes the demand to base64 encode stuff.One more question - If existing XML parsers must support UTF-16, do they have any specific requirements about endian? Can I safely transmit data in any one of little endian, big endian and byte order mark and just assume that any existing XML parser that could be used in the receiving application will be able to decode it, or must it be a certain endian?They must support UTF-16 with a byte order mark as first two bytes in the document. They must support UTF-8. The do not have to support other related encodings like CESU-8 (Like UTF-8 but encoding the four-byte UTF-16 characters as two separate three-byte UTF-8-like characters instead of as a single four-byte UTF-8 character) or byte-order-mark-less UTF-16-BE or UTF-16-LE. Nor do they need to support the less frequent UTF-7, UTF-32, UTF-1, UTF-9, UTF-18 etc. encodings.
oesxyl
02-14-2008, 03:02 PM
Thanks everyone.
There is nothing wrong with bzip2. bzip2 has no patent issues or anything as lz compression does, in a similar way to png (png was developed to replace gif with its patents). gzip is just more widely used because it is a lot older. bzip2 usually gives better text compression but is more expensive to compress and decompress.
There is no need to reimplement bzip2 any more than there is to reimplement gzip. There are plenty of libraries that do them, and if there weren't you would need to reimplement gzip as well, so it is no different.
The only con of using compression (pretty much any compression) for me is that very small data (under a few hundred bytes) actually end up bigger when compressed due to the overhead. There is obviously also a computational time issue. For example:$ cat text.txt
Hello
$ ls -l
-rw-r--r-- 1 6 2008-02-14 11:09 text.txt
-rw-r--r-- 1 44 2008-02-14 11:09 text.txt.bz2
-rw-r--r-- 1 35 2008-02-14 11:09 text.txt.gz
Again, I have no content-encoding or accept headers or anything like that. What I am doing has absolutely nothing to do with HTTP in the slightest and they are both HTTP headers. I also do not need to worry about http user agents (web browsers) and their support levels, as what I am doing has nothing to do with the www and will never be viewed in a browser. I am using XML to transfer data between networked applications that may reside on the internet, but not the web.
My structure puts no limit on language. My example was just there to show that a single tag may contain multiple languages (I just used "Hello" as an example in all of them, they are not necessarily the same). For example, one may send a message in Chinese and then send a reply in English (or any other language), quoting part of the original Chinese message. This is type of message more common than you may think because people may be able to read the Chinese but struggle to write it, particularly if they only speak one dialect but can just about read another such as Hokkien. This means that I need a wide character set to be able to hold a wide range of characters at the same time.
What did you mean by this? (XML does not have its own but you probably based this on something)
libxml have internal support for gzip that means the parser detect compressed file and decompress on load and compressing a xml is reduced to seting the level of compression, between 0 and 8, same meaning as for gzip, bzip or bzip2.
Now that that is all cleared up, I have another couple of questions:
When data is transmitted using 7 bits, binary data must be represented in base64. If I am transmitting the entire message in UTF-16, is there any better way of encoding binary data, as ignoring compression, binary data will end up very large if base64 encoding is used over UTF-16.
I don't know too much about this, maybe liorean could help.
I know there is a recomandation of w3c about serialization of binary data:
http://www.w3.org/TR/xop10/
but I don't know how this work in real life, :)
One more question - If existing XML parsers must support UTF-16, do they have any specific requirements about endian? Can I safely transmit data in any one of little endian, big endian and byte order mark and just assume that any existing XML parser that could be used in the receiving application will be able to decode it, or must it be a certain endian?
as far as I know this is solved to the tcp/ip level. Anyways if this is not true somehow, it's easy to solve using htonl/ntohl and htons/ntohs functions. see man htonl. This is on transport level before parser and I think that affect xml stuff. Is like sending a jpg file, :)
best regards
vBulletin® v3.8.2, Copyright ©2000-2012, Jelsoft Enterprises Ltd.