Tony Hoare’s vision, car crashes, and Alan Turing
Yesterday afternoon, at Europython 2009, David Jones addressed the subject “What sucks about Python?” Despite this provocative title, David Jones had lots of good things to say about Python, and the two topics which really roused the audience (the global interpreter lock and the over-crowded Python packaging space) had more to do with the Python-the-platform than Python-the-language. He also failed to mention the thing I miss most when working with Python: via Peter Norvig’s Python IAQ, quoting Bjarne Stroustrup:
“If I were to design a language from scratch, I would follow the Algol68 path and make every statement and declaration an expression that yields a value.”
Of course what really matters to an engineer is Python-the-platform rather than Python-the-language. Python famously comes with batteries included, and, stretching the metaphor, it also excels at integrating with the other batteries used in modern software applications. The packaging confusion is a side-effect of Python-the-platform’s success. More on over-stretched batteries later …
The day had started with Bruce Eckel’s keynote on Software Archeology. Bruce Eckel is a relaxed and engaging speaker but I found his presentation rather flimsy. Its substance could (and did!) fit on to a couple of David Jones’ slides and the remainder dwelt a little too long on Bruce Eckel, his Java and C++ books, blogs, and www.mindviewinc.com.
During the rest of the day, I had a choice of 4 or 5 different presentations at any one time, generally on the subject of Python modules or frameworks. The common theme I took away is that people turn to Python to get things done, and they’re reluctant to turn back. As the afternoon drew on everyone regathered in the Adrian Boult Hall to listen to a brilliant series of lightning talks which developed on this same theme. We also heard a wonderful short story about how best to wreck cars and software systems.
Partitioning with Python
Sums and Splits
On the subject of hunting for eodermdromes, here are a couple of semi-related partitioning problems.
- for a positive integer, N, find the positive integer sequences which sum to N
- for a sequence, S, find the distinct partitions of that sequence
As an example of the first, the 16 distinct integer sequences which sum to 5 are:
5 4 + 1 3 + 1 + 1 3 + 2 2 + 1 + 2 2 + 1 + 1 + 1 2 + 2 + 1 2 + 3 1 + 1 + 3 1 + 1 + 2 + 1 1 + 1 + 1 + 1 + 1 1 + 1 + 1 + 2 1 + 2 + 2 1 + 2 + 1 + 1 1 + 3 + 1 1 + 4
and of the second, the 8 distinct ways of partitioning the sequence ABCD are:
ABCD A BCD AB CD ABC D A B CD A BC D AB C D A B C D
Note that I’ve counted 2 + 1 + 2, 2 + 2 + 1, and 1 + 2 + 2 as distinct sums totalling 5. That happens to be the formulation of the problem which interested me.
Oulipo and the Eodermdrome challenge
SHOES ON HENS
SAMSON MOANS
DRAB RED BEAD
Oulipo
At the Mathematics and Fiction workshop held last weekend in Oxford I particularly enjoyed David Bellos’ wonderful talk about Oulipo, the world’s longest running literary movement. The Oulipo is a group of writers interested in exploring the application of mathematical structures, patterns and algorithms to writing.
As an example, poet and novelist Raymond Queneau unleashed the exponential power of combinatorics to write a small book of sonnets which he hadn’t finished reading himself!
Run-length encoding in Python
Recently I discussed run-length encoding and DEFLATE compression. I never actually showed a Python implementation of a run-length encoder, so here’s one now.
import itertools as its
def ilen(it):
'''Return the length of an iterable.
>>> ilen(range(7))
7
'''
return sum(1 for _ in it)
def runlength_enc(xs):
'''Return a run-length encoded version of the stream, xs.
The resulting stream consists of (count, x) pairs.
>>> ys = runlength_enc('AAABBCCC')
>>> next(ys)
(3, 'A')
>>> list(ys)
[(2, 'B'), (3, 'C')]
'''
return ((ilen(gp), x) for x, gp in its.groupby(xs))
The decoder is equally simple. Itertools.repeat expands a (count, value) pair into an iterable which will generate count elements. Itertools.chain flattens these iterables into a single stream.
DEFLATE: run-length encoding, but better
Run-length encoding
Run-length encoding is a simple compression scheme in which runs of equal values are represented by the value and a repeat count. For example, a supermarket cashier might process this line of shopping
as
- 4 bananas
- 3 apples
- 2 bananas
- 1 pineapple
- 3 apples
Unix packs in its very own run length encoder, uniq -c. It works just fine — so long as the values you want to encode are newline separated byte strings, that is.
Let’s use a sequence of coin tosses as an example stream. $RANDOM generates random numbers. and we use the least significant bit of these numbers as an index into an array containing the values heads, tails.
$ HT=(heads tails)
$ toss() { echo ${HT[$RANDOM&1]}; }
$ toss; toss; toss
heads
tails
tails
$ tosses() { while [ 1 ]; do toss; done; }
$ tosses | head
tails
tails
tails
heads
tails
heads
heads
heads
tails
tails
Copy, load, redirect and tee using C++ streambufs
out << std::setw(3) << place << ". "
<< "Name " << name
<< ", Score "
<< std::fixed << std::setprecision(2)
<< score << '\n';
fprintf(out, "%3d. Name %s, Score %.2f\n",
place, name, score);
Cout or printf?
Towards the end of an interview I was asked:
“Cout or printf?”
The question made me smile. It signalled that I’d done well enough in the technical section and now we’d be moving on to a more relaxed discussion. Similar tailing-off questions include:
- Which editor do you use?
- Tabs or spaces?
- Are singletons evil?
(Come to think of it, the correct answer to the cout vs. printf question is probably “No!” Both provide global access to the standard output stream, and globals, like singletons, are dangerous. Parametrise from above, pass down ostreams or FILE * handles as needed. Do the right thing! Anyway…)
I answered something suitably vague. The interview moved on. The truth is, the C++ iostream library may be technically superior to its C predecessor — safer to use, easier to extend — but it’s failed to supplant it. Does anyone truly find all those chained shift operators and manipulators easier to read than a succinct format string? Does the char, wchar_t, <your-char> dimension really help? Even Nicolai Josuttis’ excellent book, The C++ Standard Library: A Tutorial and Reference, shies away:
This chapter does not attempt to discuss all aspects of the IOStream library in detail; to do that would take an entire book by itself.
Exposed buffers
I’m not the only person to find iostream formatting somewhat clunky, but input/output is more than just formatting. There’s also buffering and synchronising, for example. One under-appreciated feature of the iostream library is that the low-level read and write operations are delegated to separate stream buffer objects.
C++ streams allow direct access to their underlying buffers. You can customise these buffers. You can swap them around. In some ways, C++ goes in at a lower level than C. Poke around in the streambuf class, and you’ll find the member function names even sound like assembler instructions: egptr, xsputn, pbump, epptr, for example.
The remainder of this article works through some examples which use std::streambufs to copy, load, redirect and tee streams.
Generic documentation
/** * @brief This does what you think it does. * @param a A thing of arbitrary type. * @param b Another thing of arbitrary type. .... */
The Rings of Saturn
On board The Nellie waiting for the turn of the tide, Marlow entertains his companions with the tale of a mission he undertook as an employee of a Belgian trading company, traveling up the Congo river. This entertainment, the framed story in Joseph Conrad’s Heart of Darkness, quickly becomes very dark indeed, as Marlow presses on through disease, destruction and death, increasingly preoccupied with tracing the elusive Mr Kurtz, until eventually, when it seems he can go no further, he comes to a clearing. Looking through his glasses he sees the slope of a hill on which stands a house. There was no enclosure or fence of any kind, says Marlow, but there had been one apparently, for near the house half-a-dozen slim posts remained in a row, roughly trimmed, and with their upper ends ornamented with round carved balls. The occupant of this house, a harlequin figure, a jester at the court of King Kurtz if you like, also appears in Terry Jones’ fluidinfo blog, and it was this blog article which sent me back to Conrad, his character Marlow, and his terrible journey.
Software development checklist += 3
Various people better qualified than me have created checklists for healthy software development environments. I won’t be offering my own such list, but I would like to mention a couple of items which deserve a place on it.
- Does your company have a good library? Can anyone order a book for this library, easily?
- Do team members attend conferences? Do they give presentations at conferences?
Review: Expert Python Programming
If you follow Python blogs you’ll have seen more than one review of Tarek Ziadé’s book, Expert Python Programming. As a result of these reviews I’d decided that, although the book looked interesting, I wouldn’t be investing in a copy for myself; but when the publishers contacted me directly with the offer of a free review copy, I accepted.
Patience sort and the Longest increasing subsequence
Play cards and watch TV
In 1999 The Bulletin of the American Mathematical Society published a paper by David Aldous and Persi Diaconis entitled: “Longest Increasing Subsequences: From Patience Sorting to the Baik-Deift-Johansson Theorem”.
In case that sounds heavy going, the authors kick off with a card game.
Take a deck of cards labeled 1, 2, 3, … , n. The deck is shuffled, cards are turned up one at a time and dealt into piles on the table, according to the rule
- A low card may be placed on a higher card (e.g. 2 may be placed on 7), or may be put into a new pile to the right of the existing piles.
At each stage we see the top card on each pile. If the turned up card is higher than the cards showing, then it must be put into a new pile to the right of the others. The object of the game is to finish with as few piles as possible.
And if this still sounds too mathematical (when did you last find a deck of cards labelled 1 through n?) Aldous and Diaconis suggest.
To play with real cards one needs to linearly order the 52 cards, e.g. by putting suits in the bridge-bidding order ♣ ♦ ♥ ♠. This mindless form of solitaire is then quite playable, perhaps while watching television.
As a target they recommend aiming for 9 piles, which, combined with an optimal strategy, gives you roughly a 5% chance of winning. In what follows I’ll also be adopting the bridge convention that Aces are high.
A greedy strategy turns out to be optimal — otherwise you’d probably need to switch off the TV in order to concentrate. What’s more, this same strategy discovers the longest increasing subsequence of cards contained within the shuffled deck.
OCR. Wrong characters, right meaning! (chuckles)
Run this image through the tesseract OCR engine and it gets the characters wrong but the meaning right.
$ curl wordaligned.org/images/chuckles.tif > chuckles.tif $ tesseract chuckles.tif ocr-chuckles && cat ocr-chuckles.txt [HEHE]
At first I assumed I’d cracked an easter egg but now I’m not so sure. Crop to the region of interest and all is well.
$ curl wordaligned.org/images/chuckles-cropped.tif > cropped.tif $ tesseract cropped.tif ocr-cropped && cat ocr-cropped.txt (chuckles)
Just in case you were wondering … the graphic appears in the subtitles of a TV advert featuring Rolf Harris and the Churchill dog. Rolf is the one who’s chuckling.
This oddity happened using tesseract 2.03 built on OS X, untrained. The grayscale images shown on this page are PNGs, not TIFFs because — much to my surprise — browser support for TIFFs is limited. Tesseract only accepts TIFF images, and the file extension has to be .tif.
Good maths, bad computers
Do you like maths and computers? If so, I recommend Mark Chu-Carroll’s Good Math, Bad Math blog. Mathematicians strive for precision and clarity, which creates a natural tension with the more pragmatic world of computer programming. As Donald Knuth famously wrote:
Beware of bugs in the above code; I have only proved it correct, not tried it.
Yesterday Mark Chu-Carroll shared a precise formula for solving that most slippery computing problem of them all — figuring out a project schedule.
[One of my favorite managers] said that when a programmer gives you an estimate of how long something should take, multiply it by two and increase the unit. So if they say it’ll take a day, assume two weeks. If they say a week, assume two months. In my experience, it’s actually a really good predictor.
And if they say it’ll take 5 months to complete a novel digital typesetting program, expect delivery 10 years later…
Certainly Mark Chu-Carroll’s metric is of more practical use than Hofstadter’s self-referential law. I’d like to see it built into project planning tools.
Mathematicians are equally precise about random numbers, a notorious source of computing issues. I also recommend John D. Cook’s blog, The Endeavour, which covers a broad range of mathematical and programming topics. In a recent article he points out why quasi-random numbers can sometimes be better than truly random numbers.
Longest common subsequence
I pinched the idea for this animation from Ned Batchelder’s clever visualisation of Román Cortés’ brilliant CSS Homer Simpson. It depends on Javascript, CSS and a font containing upwards, leftwards and north west arrows. If it doesn’t work in your user agent, my apologies. If you’d like to find out why I’m comparing humans with chimpanzees, read on!
Ordered sublists. A brute force approach
Younger runners
Recently I posed a puzzle:
Starting with a list of runners ordered by their finishing time in a race, select a sublist of runners who are getting younger. What is the longest such sublist?
Below, I’ve highlighted just such a sublist within the list of 8 runners who completed last year’s New York marathon in under two and a quarter hours. As you can see, MARILSON GOMES DO SANTOS, 31, is older and faster than RONO, 30, who is older and faster in turn than ROHATINSKY, 26.
- MARILSON GOMES DOS SANTOS, 31, M, 2:08:43
- ABDERRAHIM GOUMRI, 32, M, 2:09:07
- DANIEL RONO, 30, M, 2:11:22
- PAUL TERGAT, 39, M, 2:13:10
- ABDERRAHIME BOURAMDANE, 30, M, 2:13:33
- ABDI ABDIRAHMAN, 31, M, 2:14:17
- JOSH ROHATINSKY, 26, M, 2:14:23
- JASON LEHMKUHLE, 31, M, 2:14:30
This is a deceptively tricky problem. Even on such a small input list it’s hard to be absolutely sure our solution is optimal. Certainly there are other ordered triples with decreasing ages, but might there be a quartet? And even if we’re convinced Gomes, Rono, Rohatinsky do form a longest ordered sublist, I think it’s already clear that as more runners finish, there may be no longer be a longest ordered sublist which starts with these three.
Dumb computers
Rather than invent a clever strategy to find an optimal solution, why not get a computer to exhaust the possibilities? If we generate all possible sublists then filter out the ones whose age fields decrease, then our answer will be the longest of these.












