I just don't understand this character encoding thing. Never have. ANSI, ASCII, UTF, Unicode... it may as well be Greek...
So: my simple CMS lets admin users type content into a textarea. Before it goes into the database, I take care of any dodgy chars (ie: single quotes) with escape(). When it comes out, I replace "%0D%0A" with a couple of <br/> tags, then unescape() it and dump it all on the page.
Generally, that's fine. However - one client uses a Mac to update her site. It's not causing me any problems as such, although the double quotes look a bit...odd. But she's saying that some chars are getting replaced with those "I don't know what char this is supposed to be" question mark symbols.
To clarify (hopefully - I hope the forum software doesn't do exactly what I'm trying to and fixes the dodgy char):
- Client pastes a “ into textarea.
- I escape() it. Apparently <%=escape("“")%> returns %E2%u20AC%u0153.
- <%=asc("“")%> returns 226
- I try to fix it with output= replace(output,"“", "“") - which does, it seems, nothing.
So... can anyone explain to me, preferably in words of two syllables or less, what the nuts is going on and how to fix it? It is character encodings? Is it locale ID's? It is ANSI or Unicode? What is it?
How the chuff do I find these things and replace then with something... standard?
So... what you're saying is that I need to take the user input and UTF-8 encode it?
The web seems awash with UTF-8 encoding functions: here's some I found at CodeToad:
Code:
<%
function DecodeUTF8(s)
dim i
dim c
dim n
i = 1
do while i <= len(s)
c = asc(mid(s,i,1))
if c and &H80 then
n = 1
do while i + n < len(s)
if (asc(mid(s,i+n,1)) and &HC0) <> &H80 then
exit do
end if
n = n + 1
loop
if n = 2 and ((c and &HE0) = &HC0) then
c = asc(mid(s,i+1,1)) + &H40 * (c and &H01)
else
c = 191
end if
s = left(s,i-1) + chr(c) + mid(s,i+n)
end if
i = i + 1
loop
DecodeUTF8 = s
end function
function EncodeUTF8(s)
dim i
dim c
i = 1
do while i <= len(s)
c = asc(mid(s,i,1))
if c >= &H80 then
s = left(s,i-1) + chr(&HC2 + ((c and &H40) / &H40)) + chr(c and &HBF) + mid(s,i+1)
i = i + 1
end if
i = i + 1
loop
EncodeUTF8 = s
end function
%>
That look about right to you? If so... integrating this into my current code would be something like:
- take user input
- UFT-8 encode
- escape()
- drop into database
... and exactly the same in reverse for displaying on a page?
God knows why I've never come up against this one before...
ps. Just to clarify, all pages (admin forms and front-end display) have the following:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Last edited by Spudhead; 08-20-2007 at 04:05 PM..
Reason: clarifimification
You have probably not come up on this before because TEXTAREAs are not the same as WSIWYG editors.
If the client/user is using XML schemas at all, like in Office 2000 and above, WSIWYG Editors use said XML schemas and they can screw up your input. COPY AND PASTE is a blessing and a curse.
XML schemas, unless specified otherwise, are UNICODE. Textareas use the server's encoding (i.e., UTF-8 or whatever you tell IIS to use).
Happened to me the first time I created one, and I haven't looked back since.
Your code looks right, but you may be able to use the IIS server variable of Server.HTMLEncode to do the work for you.
You might want to try that, but I cannot guarantee that will work.
__________________
Quote:
To say my fate is not tied to your fate is like saying, 'Your end of the boat is sinking.' -- Hugh Downs
Please, if you found my post helpful, pay it forward. Go and help someone else today.
Ok, thanks for the info. Will look into altering the Codepage. Have taken interim measure of emailing client with "stop pasting stuff out of Word, it's screwing everything up".