Narrow Python

2007-01-03 • Python • Comments

I needed to investigate character code points beyond the Unicode basic multilingual plane. As usual Python was the tool I reached for first — well, not quite first, since I’d already leafed through the introductory sections of the Unicode book, in which I noticed the following encouraging words from Python’s inventor:

“Modern programs must handle Unicode — Python has excellent support for Unicode, and will keep getting better.” — Guido van Rossum

I’m not sure I can fully agree with the excellent support bit of this quotation: in this case, I had to put in the batteries myself.

Legacy Systems

Incidentally, I agree with the BDFL and the many others who are on record as saying that Unicode is both necessary and great. It’s just a shame it didn’t happen sooner, because we now have any number of legacy systems which make a poor fist of things — C++’s built in wchar_t being a typically half-baked solution.

(Questions:

Is a wchar_t suitable for Unicode characters?
Can a std:wstring help us write international applications in a portable way?
What’s the best way to handle text data in a C++ program?

Answers:

Maybe.
Probably not.
Watch this space.)

Narrow builds

As I write this, the Python installed on my machine — and indeed on all the machines I have access to — behaves as follows:

narrow python problems

>>> help(unichr)
Help on built-in function unichr in module __builtin__:

unichr(...)
    unichr(i) -> Unicode character

    Return a Unicode string of one character with
    ordinal i; 0 <= i <= 0x10ffff.

>>> uc = unichr(0x10000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

The error message suggested I’d need to rebuild Python to make it behave. I didn’t really want to do this, so I tried a bit of googling in case it found me a more favourable answer (it didn’t). I had a closer look at what python -h had to tell me in case I could supply a -wide option (I couldn’t). I even wondered if the pythonw which lives alongside python might be just what I wanted (it wasn’t).

Wide builds

Finally, I decided I’d have to rebuild Python after all. For the record, here’s what you do:

Building a wide version Python

$ tar xjf Python-2.5.tar.bz2
$ cd Python-2.5
$ ./configure --enable-unicode=ucs4 && make

No, it’s not that hard, once you’ve read PEP 261, which explains the configure options. You’ll have to work out for yourself if and where you want to install this new and flabby version of Python, which — shock!, horror!! — doubles the memory used for most Unicode strings.

Loss of power

For once, I’m disappointed in Python. The default build provides weakened support for Unicode, which in some ways is worse than no support for Unicode. Why? Because the language appears to support Unicode, but is likely to let you down if you ever venture past the safe region of the basic multilingual plane — the kind of unwelcome surprise which experienced programmers rightly fear. And because the behaviour you see on a wide build differs from the behaviour you get on a narrow build. Worse again, Python is perfectly able to support the full Unicode standard, if you’re prepared to trade in a bit of memory for compliance. This is the kind of trade-off Python users are usually more than happy to accept. If and when they need to get closer to the silicon, they just use C.

Hope

Of course Guido van Rossum did say:

“… Python has excellent support for Unicode, and will keep getting better.”

(Emphasis mine). This looks like one particular area where support could be better. I’ve seen hints that C++0X (which may well end up becoming C++1X) will place improved Unicode support into the standard language, but I’d bet Python will stay ahead by a comfortable margin.

Feedback

Jez 2007-01-05

This is so intriguing!

Guido has many splendid qualities, but identifying (or at least admitting) to weaknesses in Python isn't one of them. That said, I don't think any language that I've used supports Unicode in an "excellent" way, certainly not outside the BMP.

I'm extremely curious as to what you were doing out beyond the BMP. Student of ancient history perhaps? Or merely curious?

I'll be watching for whatever you propose for text handling in C++ too. Years ago, I was involved in proposing a book with almost exactly that title - "International text handling in C++". It didn't proceed, which is probably a good thing because I'm pretty sure now I'd have completely ballsed it up. I've persued this subject with vigour since then, but it's still something I'm interested in. If you've got code, I'd love to see it.
Thomas Guest 2007-01-05

Guido has many splendid qualities, but identifying (or at least admitting) to weaknesses in Python isn't one of them.

I disagree. The whole Python 3000 thing is about cleaning up design errors before they become entrenched.

That said, I don't think any language that I've used supports Unicode in an "excellent" way, certainly not outside the BMP.

I've never had to explore there before, so I couldn't really comment. What I saw of Java looked very good, though. Surely a newer language such as C# also gets things right?

I'll be watching for whatever you propose for text handling in C++ too. Years ago, I was involved in proposing a book with almost exactly that title - "International text handling in C++". It didn't proceed, which is probably a good thing because I'm pretty sure now I'd have completely ballsed it up. I've persued this subject with vigour since then, but it's still something I'm interested in. If you've got code, I'd love to see it.

Sorry, I have no proposal and no code I can show. I just recall reading somewhere -- probably the CVu standards report -- that improved support for Unicode was on the C++0X agenda.
Jez 2007-01-06

Ah, I've misunderstood - I thought you had some text processing ace up your sleeve :)
Thomas Guest 2007-01-06

Apologies for raising your hopes. When I wrote "watch this space" I should really have said "wait and see what the C++ standards committee come up with".

Actually, Python's support for Unicode is starting to grow on me, though I still think the variant builds are unfortunate. Support for various other text encodings is extremely good.

My general recommendation for text conversion utilities and similar in C++ is to use Python to generate the required C++. Here's a simple example of what I mean.
Markus Guest 2007-03-16

At least the python version brought by debian testing seems to be a wide build.

Word Aligned

space sensitive programming

Narrow Python

Legacy Systems

Narrow builds

Wide builds

Loss of power

Hope

Feedback

Excerpt

Tagged

Chain

Feeds