Narrow Python

2007-01-03Comments

I needed to investigate character code points beyond the Unicode basic multilingual plane. As usual Python was the tool I reached for first — well, not quite first, since I’d already leafed through the introductory sections of the Unicode book, in which I noticed the following encouraging words from Python’s inventor:

“Modern programs must handle Unicode — Python has excellent support for Unicode, and will keep getting better.” — Guido van Rossum

I’m not sure I can fully agree with the excellent support bit of this quotation: in this case, I had to put in the batteries myself.

Legacy Systems

Incidentally, I agree with the BDFL and the many others who are on record as saying that Unicode is both necessary and great. It’s just a shame it didn’t happen sooner, because we now have any number of legacy systems which make a poor fist of things — C++’s built in wchar_t being a typically half-baked solution.

(Questions:

  • Is a wchar_t suitable for Unicode characters?
  • Can a std:wstring help us write international applications in a portable way?
  • What’s the best way to handle text data in a C++ program?

Answers:

  • Maybe.
  • Probably not.
  • Watch this space.)

Narrow builds

As I write this, the Python installed on my machine — and indeed on all the machines I have access to — behaves as follows:

narrow python problems
>>> help(unichr)
Help on built-in function unichr in module __builtin__:

unichr(...)
    unichr(i) -> Unicode character

    Return a Unicode string of one character with
    ordinal i; 0 <= i <= 0x10ffff.

>>> uc = unichr(0x10000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

The error message suggested I’d need to rebuild Python to make it behave. I didn’t really want to do this, so I tried a bit of googling in case it found me a more favourable answer (it didn’t). I had a closer look at what python -h had to tell me in case I could supply a -wide option (I couldn’t). I even wondered if the pythonw which lives alongside python might be just what I wanted (it wasn’t).

Wide builds

Finally, I decided I’d have to rebuild Python after all. For the record, here’s what you do:

Building a wide version Python
$ tar xjf Python-2.5.tar.bz2
$ cd Python-2.5
$ ./configure --enable-unicode=ucs4 && make

No, it’s not that hard, once you’ve read PEP 261, which explains the configure options. You’ll have to work out for yourself if and where you want to install this new and flabby version of Python, which — shock!, horror!! — doubles the memory used for most Unicode strings.

Loss of power

For once, I’m disappointed in Python. The default build provides weakened support for Unicode, which in some ways is worse than no support for Unicode. Why? Because the language appears to support Unicode, but is likely to let you down if you ever venture past the safe region of the basic multilingual plane — the kind of unwelcome surprise which experienced programmers rightly fear. And because the behaviour you see on a wide build differs from the behaviour you get on a narrow build. Worse again, Python is perfectly able to support the full Unicode standard, if you’re prepared to trade in a bit of memory for compliance. This is the kind of trade-off Python users are usually more than happy to accept. If and when they need to get closer to the silicon, they just use C.

Hope

Of course Guido van Rossum did say:

“… Python has excellent support for Unicode, and will keep getting better.”

(Emphasis mine). This looks like one particular area where support could be better. I’ve seen hints that C++0X (which may well end up becoming C++1X) will place improved Unicode support into the standard language, but I’d bet Python will stay ahead by a comfortable margin.

Feedback