[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Monday, March 17, 2025

On Migrating Character Encodings

Several discussions I've had with friends and colleagues recently reminded me of an incident we faced several years ago at Yahoo!

Now Yahoo! as a company was made up of many different local offices around the world, each responsible for content in their locale. Since there was a lot of user generated content, this meant users in a particular locale could easily enter content (blog posts, restaurant reviews, etc.) in their local language script.

Everyone was happy!

From about 2005 onwards, the company was looking to unify some of the platforms used around the world. For example, we had something like 4 or 5 different platforms to do ratings and reviews and it didn't make sense to have different architectures, database layouts, BCP setups, and a separate team managing each of these, so we started unifying. Building a common architecture was the easy part. I worked on several of these projects. Getting front end teams to migrate was also not terribly hard. Migrating content though, was tough because each region had content in their own locale and MySQL didn't let you set multiple character encodings on text columns.

So the i18n team started working with teams across Y! to move everything to utf8. The easy part was changing HTTP headers and <meta> tags. Content was a little harder, but doable with iconv(1) since in most cases we knew the source character encoding and the destination was always utf-8. In some cases we had to guess, but it generally worked...

...until at one point we also decided to do it for authentication.

One of the things that was localized was authentication, because it allowed users in, for example, South Korea, to use Hangul characters in their passwords. Usernames were always restricted to just alphanumeric characters and underscores (If I remember correctly).

Passwords are stored, as they should be, salted and hashed, so the character encoding of the database column was always us-ascii, which is compatible with utf-8, so no biggie..., except the input character encoding used by the browser was based on the HTTP headers or META tags of the page, and the transfer encoding was based on the enctype of the login FORM.

Prior to this move, these were all set to a character encoding that made sense locally, so Korea used EUC-KR and China used Big5, and so the hashed passwords used the byte sequences that resulted from treating the input as one of these encodings.

After the move, the user would still type in the same password, but when we converted them to bytes, we used utf-8, which resulted in a different byte sequence than the original encoding, so hashing this new sequence of bytes resulted in a different hash, and users could no longer log in. Well, only users that used non-ASCII characters in their passwords.

I forget what the actual fix was, but there were several options on the table. One was to revert the character encoding changes on the login page and to re-encode all passwords after a successful login. Another was to generate two hashes, one using utf-8 and another using the pre-migration character encoding for the region and to allow a success on either to go through.

...===...