I have to stay up a little later this evening than I’d planned, so as a
result I’m finally going through all the tabs and browser windows I’ve
had open on my personal laptop. I think some of these have been “open”
for months (yes, there have been browser restarts, but they’re always
there when the session restores). One that I’ve meant to blog is Wil
Clouser’s post on string substitution in .po
files.
It’s actually [at least] his second post on the subject, recanting his
prior
advice,
coming around to what others told him previously: don’t use
substitution strings in .po files.
I wasn’t aware of Chris’s previous advice, but had I read it when first
published, I would have nodded my head vigorously; after all, that’s how
we did it. Er, that’s how we, uh, do it. And we’re not really in a
position to change that at the moment, although we’ve certainly looked
pretty hard at the issue.
A bit of background: One of the core pieces of technology we’ve built at
Creative Commons is the license
chooser. It’s a relatively simple
application, with a few wrinkles that make it interesting. It manages a
lot of requests, a lot of languages, and has to spit out the right
license (type, version, and jurisdiction) based on what the user
provides. The really interesting thing it generates is some XHTML with
RDFa that includes the license badge, name, and any additional
information the user gives us; it’s this metadata that we use to
generate the copy and paste attribution HTML on the deed. So what does
this have to do with
internationalization?
The HTML is internationalized. And it contains substitutions. Yikes.
To follow in the excellent example of AMO and Gnome, we’d start using
English as our msgids, leaving behind the current symbolic keys of
the past. Unfortunately it’s not quite so easy. Every time we look at
this issue (and for my first year as CTO we really looked;
Asheesh can atest we looked at it again and
again) and think we’ve got it figured out, we realize there’s another
corner case that doesn’t quite work.
The real issue with the HTML is the HTML:
zope.i18n, our XSLT
selectors†, the ZPT parse tree: none of them really play all
that well with HTML msgids. The obvious solution would be to get
rid of the HTML in translation, and we’ve tried doing that, although we
keep coming back to our current approach. I guess we’re always seduced
by keeping all the substitution in one place, and traumatized by the
time we tried assembling the sentences from smaller pieces‡.
So if we accept that we’re stuck with the symbolic identifiers, what do
we do? Build tools, of course. This wasn’t actually an issue until we
started using a “real” translation tool —
Pootle, to be
specific. Pootle is pretty powerful, but some of the features depend on
having “English” msgids. Luckily it has no qualms about HTML in
those msgids, it has decent
VCS support, and we
know how to write post-commit hooks.
To support Pootle and provide a better experience for our translators,
we maintain two sets of PO files: the “CC style” symbolic msgid
files, and the “normal” English msgid files. We keep a separate
“master” PO file where the msgid is the “CC style” msgid, and the
“translation” is the English msgid. It’s this file that we update
when we need to make changes, and luckily using that format actually
makes the extraction work the way it’s supposed to. Or close. And when a
user commits their work from Pootle (to the “normal” PO file), a
post-commit hook keeps the other version in sync.
While we’ve gotten a lot better at this and have learned to live with
this system, it’s far from perfect. The biggest imperfection is its
custom nature: I’m still the “expert”, so when things go wrong, I get
called first. And when people want to work on the code, it takes some
extra indoctrination before they’re productive. My goal is still to get
to a single set of PO files, but for now, this is what we’ve got. Bring
the pain.
† For a while, at least. We’re working on a new version of the
chooser driven
by our the license RDF. This will be better for re-use, but not really
an improvement in this area.
‡ This works great in English, but in languages where gender is
more strongly expressed in the word forms, uh, not so much.
date: | 2010-03-01 23:21:20 |
wordpress_id: | 1501 |
layout: | post |
slug: | i18n-html-bring-the-pain |
comments: | |
category: | cc, development |
tags: | cc, i18n, license engine, zope |