Narrow Python
I needed to investigate character code points beyond the Unicode basic multilingual plane. As usual Python was the tool I reached for first — well, not quite first, since I’d already leafed through the introductory sections of the Unicode book, in which I noticed the following encouraging words from Python’s inventor:
“Modern programs must handle Unicode — Python has excellent support for Unicode, and will keep getting better.” — Guido van Rossum
I’m not sure I can fully agree with the excellent support bit of this quotation: in this case, I had to put in the batteries myself.
Legacy Systems
Incidentally, I agree with the BDFL and the many others who are
on record as saying that Unicode is both necessary and great. It’s
just a shame it didn’t happen sooner, because we now have any number
of legacy systems which make a poor fist of things — C++’s built in
wchar_t
being a typically half-baked solution.
(Questions:
- Is a
wchar_t
suitable for Unicode characters? - Can a
std:wstring
help us write international applications in a portable way? - What’s the best way to handle text data in a C++ program?
Answers:
- Maybe.
- Probably not.
- Watch this space.)
Narrow builds
As I write this, the Python installed on my machine — and indeed on all the machines I have access to — behaves as follows:
>>> help(unichr) Help on built-in function unichr in module __builtin__: unichr(...) unichr(i) -> Unicode character Return a Unicode string of one character with ordinal i; 0 <= i <= 0x10ffff. >>> uc = unichr(0x10000) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: unichr() arg not in range(0x10000) (narrow Python build)
The error message suggested I’d need to rebuild Python to make it
behave. I didn’t really want to do this, so I tried a bit of
googling
in case it found me a more favourable answer (it didn’t). I had a closer
look at what python -h
had to tell me in case I could supply a -wide
option (I couldn’t). I even wondered if the pythonw
which lives alongside
python
might be just what I wanted (it wasn’t).
Wide builds
Finally, I decided I’d have to rebuild Python after all. For the record, here’s what you do:
$ tar xjf Python-2.5.tar.bz2 $ cd Python-2.5 $ ./configure --enable-unicode=ucs4 && make
No, it’s not that hard, once you’ve read PEP 261, which
explains the configure
options. You’ll have to work out for yourself
if and where you want to install this new and flabby version of Python,
which — shock!, horror!! — doubles the memory used for most Unicode
strings.
Loss of power
For once, I’m disappointed in Python. The default build provides weakened support for Unicode, which in some ways is worse than no support for Unicode. Why? Because the language appears to support Unicode, but is likely to let you down if you ever venture past the safe region of the basic multilingual plane — the kind of unwelcome surprise which experienced programmers rightly fear. And because the behaviour you see on a wide build differs from the behaviour you get on a narrow build. Worse again, Python is perfectly able to support the full Unicode standard, if you’re prepared to trade in a bit of memory for compliance. This is the kind of trade-off Python users are usually more than happy to accept. If and when they need to get closer to the silicon, they just use C.
Hope
Of course Guido van Rossum did say:
“… Python has excellent support for Unicode, and will keep getting better.”
(Emphasis mine). This looks like one particular area where support could be better. I’ve seen hints that C++0X (which may well end up becoming C++1X) will place improved Unicode support into the standard language, but I’d bet Python will stay ahead by a comfortable margin.
Feedback
-
This is so intriguing!
Guido has many splendid qualities, but identifying (or at least admitting) to weaknesses in Python isn't one of them. That said, I don't think any language that I've used supports Unicode in an "excellent" way, certainly not outside the BMP.
I'm extremely curious as to what you were doing out beyond the BMP. Student of ancient history perhaps? Or merely curious?
I'll be watching for whatever you propose for text handling in C++ too. Years ago, I was involved in proposing a book with almost exactly that title - "International text handling in C++". It didn't proceed, which is probably a good thing because I'm pretty sure now I'd have completely ballsed it up. I've persued this subject with vigour since then, but it's still something I'm interested in. If you've got code, I'd love to see it.
-
Guido has many splendid qualities, but identifying (or at least admitting) to weaknesses in Python isn't one of them.
I disagree. The whole Python 3000 thing is about cleaning up design errors before they become entrenched.
That said, I don't think any language that I've used supports Unicode in an "excellent" way, certainly not outside the BMP.
I've never had to explore there before, so I couldn't really comment. What I saw of Java looked very good, though. Surely a newer language such as C# also gets things right?
I'll be watching for whatever you propose for text handling in C++ too. Years ago, I was involved in proposing a book with almost exactly that title - "International text handling in C++". It didn't proceed, which is probably a good thing because I'm pretty sure now I'd have completely ballsed it up. I've persued this subject with vigour since then, but it's still something I'm interested in. If you've got code, I'd love to see it.
Sorry, I have no proposal and no code I can show. I just recall reading somewhere -- probably the CVu standards report -- that improved support for Unicode was on the C++0X agenda.
-
Ah, I've misunderstood - I thought you had some text processing ace up your sleeve :)
-
Apologies for raising your hopes. When I wrote "watch this space" I should really have said "wait and see what the C++ standards committee come up with".
Actually, Python's support for Unicode is starting to grow on me, though I still think the variant builds are unfortunate. Support for various other text encodings is extremely good.
My general recommendation for text conversion utilities and similar in C++ is to use Python to generate the required C++. Here's a simple example of what I mean.
-
At least the python version brought by debian testing seems to be a wide build.