[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Sunday, October 16, 2005

PHP 4.3, DHTML and i18n for select boxes

I'm not a web developer, and I'm not always aware of the quirks that experienced web developers are aware of. Like this one for example.

Populating list boxes through javascript

The standard way to populate a list box using javascript (or at least the way that I learnt, is to create Option objects, and add them to the list, like this:
var o = new Option('text', 'value');
var s = document.getElementById("my-selectbox");
s.options[s.options.length] = o;

Pretty simple, and when you're doing AJAX, it's just easy to return a javascript array, or maybe even the code that needs to be executed, and just eval that in the calling script.

HTML entities in a javascript populated list box

I'd been doing this for a while, when I got a bug report from someone saying that their single quotes were showing up as &#039;. Checked up, and found out that while populating a select box using javascript, you've got to pass it the raw text — barring the primary html entities that is; <, > and & (and possibly ")). So, I set about unescaping all text before handing it to the array that was to populate the list box.

Not too much trouble. I was doing it in PHP, and this is the code I had to put in:
?>
var sel = document.getElementById("my-select-box");
sel.options.length = 0;   // delete all elements from sel
<?php
while($o = mysql_fetch_array($result, MYSQL_ASSOC))
{
$on = preg_replace("/(['\\\\])/", '\\\\$1',
html_entity_decode($o['name'], ENT_QUOTES));
?>
sel.options[sel.options.length] = new Option('<?php print $on ?>', '<?php print $o['id'] ?>'));
<?php
}
?>

Now don't get confused with the mixture of code in there. The stuff inside the <?php ... ?> blocks are php code, and the stuff outside is javascript.

What that code basically does is, fetch a bunch of records from the database. The names are stored htmlescaped in the database (because that's how we were dealing with them initially). After fetching, we need to unescape them because we're populating a select box using javascript, but because we're enclosing our javascript strings in single quotes, we need the funny looking extra preg_replace to escape single quotes in the string.

This code worked perfectly (although I should really be puttin unescaped data into the database anyway), until, that is, someone tried putting japanese characters into the name field.

What happened was a series of things. To start with, the japanese characters were showing up url encoded everywhere. Now url encoding is different from html encoding. Url encoding is where a character is converted into its hex code and prefixed by a % sign. Japanese characters in utf-8 are two bytes long, and hence take up 4 hex digits (hexits?). The url encoded representation of them was something like %u3c4A, and I had a whole bunch of strings that were filled with that.

URL encoding to HTML entities

A simple regex converted the url encoded data to an html entity:
$text = preg_replace("/%u([a-z0-9A-Z]+)/", "&#$1;", $text);

Note that there may be major flaws in this code. For starters, I pick the longest string of text after the %u that matches a hex digit. This may not always be correct. I prolly need to test that it's a valid utf8 character first, but I'll leave that for later.

Ok, so the url encoded text is converted to an html entity, and I can do this before it goes to the db, and everyone is happy. Not really. Remember that the list box populated via javascript doesn't like html entities either. It needs plain text.

My code above should have fixed that, except that it wasn't prepared for utf-8 strings. The default character set used by php 4's entity functions is iso-8859-1. Looked up the docs (I'm not a PHP programmer), and found that the third parameter to html_entity_decode was the charset, so, I changed my code to this:
html_entity_decode($o['name'], ENT_QUOTES, 'UTF-8')

But, instead of working, I got a whole bunch of errors telling me that PHP 4.3 didn't support UTF-8 conversions for MBCS (Multi byte character strings). This was a combined WTF and Gaaah moment. I'd promised I18N support, and now either PHP or javascript were playing spoilt sport.

The exact error message I received (for the search engines to index) is this:
Warning: cannot yet handle MBCS in html_entity_decode()!

Tried several things. Found out that javascript's unescape function could in fact handle the string, and I tried wrapping my text with a call to unescape(), but it didn't work.

The Eureka moment

The Eureka moment was when I realised that it was only list boxes populated via javascript that had this problem. If I had a list box completely populated with html, then there was no problem.

I'd been using javascript arrays all along because it's far cheaper to send them across a network and process on the client side than it is to send complete rendered html. Still, it seemed like I had little choice.

I decided to take advantage of the innerHTML property of all container elements, and add the options to the select element as an HTML block. This didn't work, and made the select box have one single long option.

My next thought was to tackle the select element's parent and dump the entire list box as its innerHTML. There were other problems with this approach though. Some select boxes were in a container element along with other elements, and I didn't want to have to draw all of them. One of the select boxes had javascript event handlers attached to it from the window's onload handler, and deleting and recreating the select box would effectively wipe out those handlers. Still, this was the best approach.

So, I decided to first surround all my selects by a div that did nothing else, and build the entire select box and options list as an html block, and stick that into the innerHTML property of the div. Finally, I had to reattach any event handlers that had existed before. This was simply a matter of saving the event handler before wiping out the select box, and adding it again later.

Here's my final code:
var sel = document.getElementById("my-select-box");
var fn = sel.ondblclick;
var sp = sel.parentNode;

// sel is useless after this point because we basically wipe it from memory

// get all html up to the start of the first option
// just in case we have more than just this select box in the parent
var shtml = sp.innerHTML.substr(0, sp.innerHTML.indexOf("<option"));

// get all html from the closing select onwards
// for the same reason as above
var ehtml = sp.innerHTML.substr(sp.innerHTML.indexOf("</select>"));
<?php
while($o = mysql_fetch_array($result, MYSQL_ASSOC))
{
$on = preg_replace('/%u([0-9A-F]+)/i', '&#x$1;', 
preg_replace("/(['\\\\])/", '\\\\$1', $o['name']));
?>
shtml += '\n<option value="<?php print $o['id'] ?>"><?php print $on ?></option>';
<?php
}
?>

sp.innerHTML = shtml + ehtml;

// need to reget the element because its location in memory has changed
// reset the ondblclick event handler
document.getElementById("my-select-box").ondblclick = fn;

Finally, code that works, and shows all characters correctly, and lets me go home to have lunch.

I'd also tried stuff with the DOM, like using createElement, createTextNode and appendChild, but it had the same problems as before - javascript want unescaped strings.

Now, maybe this is just a problem with firefox, I'm not sure as I hadn't tested in IE, but anyway, I found it and fixed it for me, maybe it will help someone else.

Ciao.

...===...