i18n HTML: Bring the Pain

I have to stay up a little later this evening than I’d planned, so as a result I’m finally going through all the tabs and browser windows I’ve had open on my personal laptop. I think some of these have been “open” for months (yes, there have been browser restarts, but they’re always there when the session restores). One that I’ve meant to blog is Wil Clouser’s post on string substitution in .po files. It’s actually [at least] his second post on the subject, recanting his prior advice, coming around to what others told him previously: don’t use substitution strings in .po files.

I wasn’t aware of Chris’s previous advice, but had I read it when first published, I would have nodded my head vigorously; after all, that’s how we did it. Er, that’s how we, uh, do it. And we’re not really in a position to change that at the moment, although we’ve certainly looked pretty hard at the issue.

A bit of background: One of the core pieces of technology we’ve built at Creative Commons is the license chooser. It’s a relatively simple application, with a few wrinkles that make it interesting. It manages a lot of requests, a lot of languages, and has to spit out the right license (type, version, and jurisdiction) based on what the user provides. The really interesting thing it generates is some XHTML with RDFa that includes the license badge, name, and any additional information the user gives us; it’s this metadata that we use to generate the copy and paste attribution HTML on the deed. So what does this have to do with internationalization? The HTML is internationalized. And it contains substitutions. Yikes.

To follow in the excellent example of AMO and Gnome, we’d start using English as our msgids, leaving behind the current symbolic keys of the past. Unfortunately it’s not quite so easy. Every time we look at this issue (and for my first year as CTO we really looked; Asheesh can atest we looked at it again and again) and think we’ve got it figured out, we realize there’s another corner case that doesn’t quite work.

The real issue with the HTML is the HTML: zope.i18n, our XSLT selectors, the ZPT parse tree: none of them really play all that well with HTML msgids. The obvious solution would be to get rid of the HTML in translation, and we’ve tried doing that, although we keep coming back to our current approach. I guess we’re always seduced by keeping all the substitution in one place, and traumatized by the time we tried assembling the sentences from smaller pieces.

So if we accept that we’re stuck with the symbolic identifiers, what do we do? Build tools, of course. This wasn’t actually an issue until we started using a “real” translation tool — Pootle, to be specific. Pootle is pretty powerful, but some of the features depend on having “English” msgids. Luckily it has no qualms about HTML in those msgids, it has decent VCS support, and we know how to write post-commit hooks.

To support Pootle and provide a better experience for our translators, we maintain two sets of PO files: the “CC style” symbolic msgid files, and the “normal” English msgid files. We keep a separate “master” PO file where the msgid is the “CC style” msgid, and the “translation” is the English msgid. It’s this file that we update when we need to make changes, and luckily using that format actually makes the extraction work the way it’s supposed to. Or close. And when a user commits their work from Pootle (to the “normal” PO file), a post-commit hook keeps the other version in sync.

While we’ve gotten a lot better at this and have learned to live with this system, it’s far from perfect. The biggest imperfection is its custom nature: I’m still the “expert”, so when things go wrong, I get called first. And when people want to work on the code, it takes some extra indoctrination before they’re productive. My goal is still to get to a single set of PO files, but for now, this is what we’ve got. Bring the pain.

For a while, at least. We’re working on a new version of the chooser driven by our the license RDF. This will be better for re-use, but not really an improvement in this area.

This works great in English, but in languages where gender is more strongly expressed in the word forms, uh, not so much.

date:2010-03-01 23:21:20
category:cc, development
tags:cc, i18n, license engine, zope

Unicode output from Zope 3

The Creative Commons licene engine has gone through several iterations, the most recent being a Zope 3 / Grok application. This has actually been a great implementation for us[1]_, but since the day it was deployed there’s been a warning in `README.txt <http://code.creativecommons.org/svnroot/cc.engine/trunk/README.txt>`_:

If you get a UnicodeDecodeError from the cc.engine (you’ll see this if it’srunning in the foreground) when you try to access the http://host:9080/license/then it’s likely that the install of python you are using is set to use ASCIIas it’s default output.  You can change this to UTF-8 by creating the file/usr/lib/python<version>/sitecustomize.py and adding these lines:

  import sys

This always struck me as a bit inelegant — having to muck with something outside my application directory. After all, this belief that the application should be self-contained is the reason I use zc.buildout and share Jim’s belief in the evil of the system Python. Like a lot of inelegant things, though, it never rose quite to the level of annoyance needed to motivate me to do it right.

Today I was working on moving the license engine to a different server[2]_ and ran into this problem again. I decided to dig in and see if I could track it down. In fact I did track down the initial problem — I was making a comparison between an encoded Unicode string and without specifying an explicit codec to use for the decode. Unfortunately once I fixed that I found it was turtles all the way down.

Turns out the default Zope 3 page template machinery uses `StringIO <http://www.python.org/doc/lib/module-StringIO.html>`_ to collect the output. StringIO uses, uh, strings — strings with the default system encoding. Reading the module documentation, it would appear that mixing String and Unicode input in your StringIO will cause this sort of issue.

Andres suggested marking my templates as UTF-8 XML using something like:

< ?xml version="1.0" encoding="UTF-8" ?>

but even after doing this and fixing the resulting entity errors, there’s still obviously some 8 bit Strings leaking into the output. In conversations on IRC the question was then asked: “is there a reason you don’t want a reasonable system wide encoding if your locale can support it?”

I guess not[3]_.

UPDATE Martijn has a tangentially related post which sheds some light on why Python does/should ship with ascii as the default codec. At least people smarter than me have problems with this sort of thing, too.

[1]Yes, I may be a bit biased — I wrote the Zope3/Grok implementation. Of course, I wrote the previous implementation, too, and I can say without a doubt it was… “sub-optimal”.
[2]We’re doing a lot of shuffling lately to complete a 32 to 64 bit conversion; see the CC Labs blog post for the harrowing details.
[3]So the warning remains.
date:2008-07-19 12:57:33
category:cc, development
tags:cc, development, license engine, python, zope