View Full Version : Character encoding, cleaning CMS input etc.

08-20-2007, 02:10 PM
I just don't understand this character encoding thing. Never have. ANSI, ASCII, UTF, Unicode... it may as well be Greek...

So: my simple CMS lets admin users type content into a textarea. Before it goes into the database, I take care of any dodgy chars (ie: single quotes) with escape(). When it comes out, I replace "%0D%0A" with a couple of <br/> tags, then unescape() it and dump it all on the page.

Generally, that's fine. However - one client uses a Mac to update her site. It's not causing me any problems as such, although the double quotes look a bit...odd. But she's saying that some chars are getting replaced with those "I don't know what char this is supposed to be" question mark symbols.

To clarify (hopefully - I hope the forum software doesn't do exactly what I'm trying to and fixes the dodgy char):

- Client pastes a into textarea.
- I escape() it. Apparently <%=escape("")%> returns %E2%u20AC%u0153. :confused:
- <%=asc("")%> returns 226
- I try to fix it with output= replace(output,"", "&ldquo;") - which does, it seems, nothing. :confused:

So... can anyone explain to me, preferably in words of two syllables or less, what the nuts is going on and how to fix it? It is character encodings? Is it locale ID's? It is ANSI or Unicode? What is it? :confused:

How the chuff do I find these things and replace then with something... standard?

08-20-2007, 02:34 PM
Welcome to MS Word as an HTML editor...

MS Word and Mac Word use certain special characters (UNICODE) to produce the effect that you are experiencing.

How do you fix it? Use UTF-8. All 32bit Windows servers use UTF-8 as their character encoding.

Set that encoding schema on your form page.

To see the characters that they are using, go to START > RUN > charmap (or, Start > All Programs > Accessories > Character Map)

Font: Times New Roman

The first character to look at is double quotes, first line, second character in.

Now use the GO TO UNICODE Box: Type in 02DD, 201C, 201D, and 2033.

This will show you all the different types of double quotes (although not all are named 'double quotes').

08-20-2007, 04:59 PM
It's getting a little clearer, thanks :)

So... what you're saying is that I need to take the user input and UTF-8 encode it?

The web seems awash with UTF-8 encoding functions: here's some I found at CodeToad (http://www.codetoad.com/asp_utf8.asp):

function DecodeUTF8(s)
dim i
dim c
dim n
i = 1
do while i <= len(s)
c = asc(mid(s,i,1))
if c and &H80 then
n = 1
do while i + n < len(s)
if (asc(mid(s,i+n,1)) and &HC0) <> &H80 then
exit do
end if
n = n + 1
if n = 2 and ((c and &HE0) = &HC0) then
c = asc(mid(s,i+1,1)) + &H40 * (c and &H01)
c = 191
end if
s = left(s,i-1) + chr(c) + mid(s,i+n)
end if
i = i + 1
DecodeUTF8 = s
end function

function EncodeUTF8(s)
dim i
dim c
i = 1
do while i <= len(s)
c = asc(mid(s,i,1))
if c >= &H80 then
s = left(s,i-1) + chr(&HC2 + ((c and &H40) / &H40)) + chr(c and &HBF) + mid(s,i+1)
i = i + 1
end if
i = i + 1
EncodeUTF8 = s
end function

That look about right to you? If so... integrating this into my current code would be something like:

- take user input
- UFT-8 encode
- escape()
- drop into database

... and exactly the same in reverse for displaying on a page?

God knows why I've never come up against this one before...

ps. Just to clarify, all pages (admin forms and front-end display) have the following:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

08-20-2007, 05:07 PM
You have probably not come up on this before because TEXTAREAs are not the same as WSIWYG editors.

If the client/user is using XML schemas at all, like in Office 2000 and above, WSIWYG Editors use said XML schemas and they can screw up your input. COPY AND PASTE is a blessing and a curse.

XML schemas, unless specified otherwise, are UNICODE. Textareas use the server's encoding (i.e., UTF-8 or whatever you tell IIS to use).

Happened to me the first time I created one, and I haven't looked back since.

Your code looks right, but you may be able to use the IIS server variable of Server.HTMLEncode to do the work for you.

You might want to try that, but I cannot guarantee that will work.

08-20-2007, 05:13 PM
More data for you:


08-20-2007, 05:41 PM
Ok, thanks for the info. Will look into altering the Codepage. Have taken interim measure of emailing client with "stop pasting stuff out of Word, it's screwing everything up".