View Full Version : Unicode in forms & PHP

03-26-2007, 10:33 PM
I'm not sure whether my problem is with HTML forms or PHP, but I hope someone here can help.

I need to create a web form that accepts words with diacritical marks such as or . I have a text field that accepts these characters, and I can even successfully write them into and read them from a database. However, whenever I re-populate the form from $_POST data after submitting it (so that the text input is persistent in the form), I don't retain the diacritical marks.

Here is a skeleton PHP program that illustrates the problem. It is a standalone script that creates a form, allows the user to input a word or phrase, then when submitted, simply recreates the form -- I use the template whenever I need to create a form-based application.

Try this out and input a word like "tst". I always get "têst" back out...


// Print the HTML header

print '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> ';
print '<html> ';
print '<head> ';
print '<meta http-equiv="content-type" content="text/html;charset=utf-8"> ';
print '<title>Form Test</title> ';
print '</head> ';
print '<body> ';

// Check the dummy/hidden form field to see if we are entering this page for the first time
// or, as a result of clicking the submit button on the form. This allows us to use a single file
// to both present the initial form, and process the form.

if ($_POST['_submit_check']) {
// If validate_form returns errors, pass them to show_form()
if ($form_errors = validate_form()) {
} else {
process_form(); // Process the form with data coming from form
} else {

// Construct the form

function show_form($errors = ''){

$defaults = $_POST;

if ($errors) {
print 'Please correct these errors: <ul><li>';
print implode('</li><li>', $errors);
print '</li></ul>';

print ' <form accept-charset="utf-8" id="FormTest" action="' . $_SERVER[PHP_SELF] . '" method="post" name="FormTest">';

print ' <label>Enter text:</label>';
form_text("test", $defaults);

form_submit("submitButton", "Go");

print ' <input type="hidden" name="_submit_check" value="1"> ';

print ' </form> ';


// Process a submitted form

function process_form() {



// Check for errors in the form and do some security checking

function validate_form() {

$errors = array();

// Trim leading or trailing white space

$_POST['test'] = trim($_POST['test']);

// Remove any (probably malicious) HTML markup

$_POST['test'] = strip_tags($_POST['test']);

// Return the possibly empty array of errors

return $errors;


// Form Helpers

//print a text box
function form_text($element_name, $values) {
print '<input type="text" name="' . $element_name .'" value="';
print htmlentities($values[$element_name]) . '">';

//print a submit button
function form_submit($element_name, $label) {
print '<input type="submit" name="' . $element_name .'" value="';
print htmlentities($label) .'"/>';

//print a textarea
function form_textarea($element_name, $values) {
print '<textarea name="' . $element_name .'">';
print htmlentities($values[$element_name]) . '</textarea>';

// Print the footers

print ' </body> ';
print ' </html> ';


03-26-2007, 11:10 PM
I didn't try it myself on your particular problem, but I believe the function htmlentities() (http://us3.php.net/manual/en/function.htmlentities.php) will help you.

html_entity_decode() (http://us3.php.net/manual/en/function.html-entity-decode.php) is the reverse.

03-26-2007, 11:18 PM
The problem has nothing to do with that.

The problem is that PHP is in single byte character mode. And it received a multi byte character mode string.

Hence it turns the multi byte into 2 single bytes and .

I forgot what configures PHP to properly deal with these, as I've had the problem before as well.

03-26-2007, 11:25 PM
Encoding and decoding the string would solve the problem.

03-27-2007, 12:01 AM
You can fill a bucket of water by going to the ocean, and coming back.

Or you can walk over to the tap :P

I think it's better to solve the root cause, instead of working around the problem.

03-27-2007, 12:31 AM
I think that the problem is that you are using htmlentities. It converts more than just ampersands, double quotes and angles into entities. By default it reads your string as ISO-8859-1 one byte at a time (unless you supply an extra argument specifying encoding). Check your generated HTML source and you'll see what I mean.

Instead use htmlspecialchars. It only converts ampersands, double quotes and angles so that your HTML won't break.

03-27-2007, 12:39 AM
htmlentities() was the problem. I used it in my formhelpers. If I remove it, all works fine, even the stuff in the db that was encoded is coming out okay. It was left over from when I wrote the formhelpers before Unicode...

03-27-2007, 12:41 AM
Whoops, htmlspecialchars(), yeah that's what I meant :p