Tony Hoare’s vision, car crashes, and Alan Turing

2009-07-02, Comments

Europython Logo

Yesterday afternoon, at Europython 2009, David Jones addressed the subject “What sucks about Python?” Despite this provocative title, David Jones had lots of good things to say about Python, and the two topics which really roused the audience (the global interpreter lock and the over-crowded Python packaging space) had more to do with the Python-the-platform than Python-the-language. He also failed to mention the thing I miss most when working with Python: via Peter Norvig’s Python IAQ, quoting Bjarne Stroustrup:

“If I were to design a language from scratch, I would follow the Algol68 path and make every statement and declaration an expression that yields a value.”

Of course what really matters to an engineer is Python-the-platform rather than Python-the-language. Python famously comes with batteries included, and, stretching the metaphor, it also excels at integrating with the other batteries used in modern software applications. The packaging confusion is a side-effect of Python-the-platform’s success. More on over-stretched batteries later …

The day had started with Bruce Eckel’s keynote on Software Archeology. Bruce Eckel is a relaxed and engaging speaker but I found his presentation rather flimsy. Its substance could (and did!) fit on to a couple of David Jones’ slides and the remainder dwelt a little too long on Bruce Eckel, his Java and C++ books, blogs, and www.mindviewinc.com.

During the rest of the day, I had a choice of 4 or 5 different presentations at any one time, generally on the subject of Python modules or frameworks. The common theme I took away is that people turn to Python to get things done, and they’re reluctant to turn back. As the afternoon drew on everyone regathered in the Adrian Boult Hall to listen to a brilliant series of lightning talks which developed on this same theme. We also heard a wonderful short story about how best to wreck cars and software systems.

Sir Tony Hoare

Partitioning with Python

2009-06-17, , Comments

Sums and Splits

On the subject of hunting for eodermdromes, here are a couple of semi-related partitioning problems.

  1. for a positive integer, N, find the positive integer sequences which sum to N
  2. for a sequence, S, find the distinct partitions of that sequence

As an example of the first, the 16 distinct integer sequences which sum to 5 are:

5
4 + 1
3 + 1 + 1
3 + 2
2 + 1 + 2
2 + 1 + 1 + 1
2 + 2 + 1
2 + 3
1 + 1 + 3
1 + 1 + 2 + 1
1 + 1 + 1 + 1 + 1
1 + 1 + 1 + 2
1 + 2 + 2
1 + 2 + 1 + 1
1 + 3 + 1
1 + 4

and of the second, the 8 distinct ways of partitioning the sequence ABCD are:

ABCD
A BCD
AB CD
ABC D
A B CD
A BC D
AB C D
A B C D

Note that I’ve counted 2 + 1 + 2, 2 + 2 + 1, and 1 + 2 + 2 as distinct sums totalling 5. That happens to be the formulation of the problem which interested me.

Oulipo and the Eodermdrome challenge

2009-06-05Comments

SHOES ON HENS

SAMSON MOANS

DRAB RED BEAD

Oulipo

At the Mathematics and Fiction workshop held last weekend in Oxford I particularly enjoyed David Bellos’ wonderful talk about Oulipo, the world’s longest running literary movement. The Oulipo is a group of writers interested in exploring the application of mathematical structures, patterns and algorithms to writing.

Queneau sonnets

As an example, poet and novelist Raymond Queneau unleashed the exponential power of combinatorics to write a small book of sonnets which he hadn’t finished reading himself!

Run-length encoding in Python

2009-06-01, Comments

Recently I discussed run-length encoding and DEFLATE compression. I never actually showed a Python implementation of a run-length encoder, so here’s one now.

import itertools as its

def ilen(it):
    '''Return the length of an iterable.
    
    >>> ilen(range(7))
    7
    '''
    return sum(1 for _ in it)

def runlength_enc(xs):
    '''Return a run-length encoded version of the stream, xs.
    
    The resulting stream consists of (count, x) pairs.
    
    >>> ys = runlength_enc('AAABBCCC')
    >>> next(ys)
    (3, 'A')
    >>> list(ys)
    [(2, 'B'), (3, 'C')]
    '''
    return ((ilen(gp), x) for x, gp in its.groupby(xs))

The decoder is equally simple. Itertools.repeat expands a (count, value) pair into an iterable which will generate count elements. Itertools.chain flattens these iterables into a single stream.

DEFLATE: run-length encoding, but better

2009-05-21, , , , Comments

Run-length encoding

Run-length encoding is a simple compression scheme in which runs of equal values are represented by the value and a repeat count. For example, a supermarket cashier might process this line of shopping

Fruit salad

as

  • 4 bananas
  • 3 apples
  • 2 bananas
  • 1 pineapple
  • 3 apples

Unix packs in its very own run length encoder, uniq -c. It works just fine — so long as the values you want to encode are newline separated byte strings, that is.

Let’s use a sequence of coin tosses as an example stream. $RANDOM generates random numbers. and we use the least significant bit of these numbers as an index into an array containing the values heads, tails.

$ HT=(heads tails)
$ toss() { echo ${HT[$RANDOM&1]}; }
$ toss; toss; toss
heads
tails
tails
$ tosses() { while [ 1 ]; do toss; done; }
$ tosses | head
tails
tails
tails
heads
tails
heads
heads
heads
tails
tails

tails tails tails heads tails heads heads heads tails tails

Copy, load, redirect and tee using C++ streambufs

2009-05-13, Comments

out << std::setw(3) << place << ". "
    << "Name " << name
    << ", Score " 
    << std::fixed << std::setprecision(2) 
    << score << '\n';

fprintf(out, "%3d. Name %s, Score %.2f\n", 
        place, name, score);

Cout or printf?

Towards the end of an interview I was asked:

“Cout or printf?”

The question made me smile. It signalled that I’d done well enough in the technical section and now we’d be moving on to a more relaxed discussion. Similar tailing-off questions include:

(Come to think of it, the correct answer to the cout vs. printf question is probably “No!” Both provide global access to the standard output stream, and globals, like singletons, are dangerous. Parametrise from above, pass down ostreams or FILE * handles as needed. Do the right thing! Anyway…)

The C++ Standard Library: A Tutorial and Reference cover

I answered something suitably vague. The interview moved on. The truth is, the C++ iostream library may be technically superior to its C predecessor — safer to use, easier to extend — but it’s failed to supplant it. Does anyone truly find all those chained shift operators and manipulators easier to read than a succinct format string? Does the char, wchar_t, <your-char> dimension really help? Even Nicolai Josuttis’ excellent book, The C++ Standard Library: A Tutorial and Reference, shies away:

This chapter does not attempt to discuss all aspects of the IOStream library in detail; to do that would take an entire book by itself.

Exposed buffers

I’m not the only person to find iostream formatting somewhat clunky, but input/output is more than just formatting. There’s also buffering and synchronising, for example. One under-appreciated feature of the iostream library is that the low-level read and write operations are delegated to separate stream buffer objects.

C++ streams allow direct access to their underlying buffers. You can customise these buffers. You can swap them around. In some ways, C++ goes in at a lower level than C. Poke around in the streambuf class, and you’ll find the member function names even sound like assembler instructions: egptr, xsputn, pbump, epptr, for example.

The remainder of this article works through some examples which use std::streambufs to copy, load, redirect and tee streams.

Generic documentation

2009-04-30, Comments

/**
 *  @brief This does what you think it does.
 *  @param  a  A thing of arbitrary type.
 *  @param  b  Another thing of arbitrary type.
    ....
 */

The Rings of Saturn

2009-04-28, , , Comments

Roi des Belges, the ship Conrad used to travel up the Congo

On board The Nellie waiting for the turn of the tide, Marlow entertains his companions with the tale of a mission he undertook as an employee of a Belgian trading company, traveling up the Congo river. This entertainment, the framed story in Joseph Conrad’s Heart of Darkness, quickly becomes very dark indeed, as Marlow presses on through disease, destruction and death, increasingly preoccupied with tracing the elusive Mr Kurtz, until eventually, when it seems he can go no further, he comes to a clearing. Looking through his glasses he sees the slope of a hill on which stands a house. There was no enclosure or fence of any kind, says Marlow, but there had been one apparently, for near the house half-a-dozen slim posts remained in a row, roughly trimmed, and with their upper ends ornamented with round carved balls. The occupant of this house, a harlequin figure, a jester at the court of King Kurtz if you like, also appears in Terry Jones’ fluidinfo blog, and it was this blog article which sent me back to Conrad, his character Marlow, and his terrible journey.

Software development checklist += 3

2009-04-21, , Comments

Various people better qualified than me have created checklists for healthy software development environments. I won’t be offering my own such list, but I would like to mention a couple of items which deserve a place on it.

  1. Does your company have a good library? Can anyone order a book for this library, easily?
  2. Do team members attend conferences? Do they give presentations at conferences?

Review: Expert Python Programming

2009-04-16, Comments

Expert Python Programming cover Tarek Ziadé

If you follow Python blogs you’ll have seen more than one review of Tarek Ziadé’s book, Expert Python Programming. As a result of these reviews I’d decided that, although the book looked interesting, I wouldn’t be investing in a copy for myself; but when the publishers contacted me directly with the offer of a free review copy, I accepted.

Patience sort and the Longest increasing subsequence

2009-03-26, , , Comments

Play cards and watch TV

Ace of Spades 10 of Spades 6 of Spades 7 of Spades 5 of Spades King of Spades 9 of Spades Queen of Spades 2 of Spades 8 of Spades 4 of Spades 3 of Spades Knave of Spades

In 1999 The Bulletin of the American Mathematical Society published a paper by David Aldous and Persi Diaconis entitled: “Longest Increasing Subsequences: From Patience Sorting to the Baik-Deift-Johansson Theorem”.

In case that sounds heavy going, the authors kick off with a card game.

Take a deck of cards labeled 1, 2, 3, … , n. The deck is shuffled, cards are turned up one at a time and dealt into piles on the table, according to the rule

  • A low card may be placed on a higher card (e.g. 2 may be placed on 7), or may be put into a new pile to the right of the existing piles.

At each stage we see the top card on each pile. If the turned up card is higher than the cards showing, then it must be put into a new pile to the right of the others. The object of the game is to finish with as few piles as possible.

And if this still sounds too mathematical (when did you last find a deck of cards labelled 1 through n?) Aldous and Diaconis suggest.

To play with real cards one needs to linearly order the 52 cards, e.g. by putting suits in the bridge-bidding order ♣ ♦ ♥ ♠. This mindless form of solitaire is then quite playable, perhaps while watching television.

As a target they recommend aiming for 9 piles, which, combined with an optimal strategy, gives you roughly a 5% chance of winning. In what follows I’ll also be adopting the bridge convention that Aces are high.

A greedy strategy turns out to be optimal — otherwise you’d probably need to switch off the TV in order to concentrate. What’s more, this same strategy discovers the longest increasing subsequence of cards contained within the shuffled deck.

OCR. Wrong characters, right meaning! (chuckles)

2009-03-19, Comments

Chuckles graphic

Run this image through the tesseract OCR engine and it gets the characters wrong but the meaning right.

$ curl wordaligned.org/images/chuckles.tif > chuckles.tif
$ tesseract chuckles.tif ocr-chuckles && cat ocr-chuckles.txt
[HEHE]

At first I assumed I’d cracked an easter egg but now I’m not so sure. Crop to the region of interest and all is well.

Cropped chuckles graphic
$ curl wordaligned.org/images/chuckles-cropped.tif > cropped.tif
$ tesseract cropped.tif ocr-cropped && cat ocr-cropped.txt
(chuckles)

Just in case you were wondering … the graphic appears in the subtitles of a TV advert featuring Rolf Harris and the Churchill dog. Rolf is the one who’s chuckling.

Rolf Harris and Churchill


This oddity happened using tesseract 2.03 built on OS X, untrained. The grayscale images shown on this page are PNGs, not TIFFs because — much to my surprise — browser support for TIFFs is limited. Tesseract only accepts TIFF images, and the file extension has to be .tif.

Good maths, bad computers

2009-03-17Comments

Do you like maths and computers? If so, I recommend Mark Chu-Carroll’s Good Math, Bad Math blog. Mathematicians strive for precision and clarity, which creates a natural tension with the more pragmatic world of computer programming. As Donald Knuth famously wrote:

Beware of bugs in the above code; I have only proved it correct, not tried it.

Yesterday Mark Chu-Carroll shared a precise formula for solving that most slippery computing problem of them all — figuring out a project schedule.

[One of my favorite managers] said that when a programmer gives you an estimate of how long something should take, multiply it by two and increase the unit. So if they say it’ll take a day, assume two weeks. If they say a week, assume two months. In my experience, it’s actually a really good predictor.

And if they say it’ll take 5 months to complete a novel digital typesetting program, expect delivery 10 years later…

Certainly Mark Chu-Carroll’s metric is of more practical use than Hofstadter’s self-referential law. I’d like to see it built into project planning tools.

Mathematicians are equally precise about random numbers, a notorious source of computing issues. I also recommend John D. Cook’s blog, The Endeavour, which covers a broad range of mathematical and programming topics. In a recent article he points out why quasi-random numbers can sometimes be better than truly random numbers.

Longest common subsequence

2009-03-11, , , , Comments

H
U
M
A
N
C
0
0
0
0
0
H
1
1
1
1
1
I
1
1
1
1
1
M
1
1
2
2
2
P
1
1
2
2
2
A
1
1
2
3
3
N
1
1
2
3
4
Z
1
1
2
3
4
E
1
1
2
3
4
E
1
1
2
3
4
N
A
M
H

I pinched the idea for this animation from Ned Batchelder’s clever visualisation of Román Cortés’ brilliant CSS Homer Simpson. It depends on Javascript, CSS and a font containing upwards, leftwards and north west arrows. If it doesn’t work in your user agent, my apologies. If you’d like to find out why I’m comparing humans with chimpanzees, read on!

Ordered sublists. A brute force approach

2009-03-09, , Comments

Younger runners

Recently I posed a puzzle:

Starting with a list of runners ordered by their finishing time in a race, select a sublist of runners who are getting younger. What is the longest such sublist?

Below, I’ve highlighted just such a sublist within the list of 8 runners who completed last year’s New York marathon in under two and a quarter hours. As you can see, MARILSON GOMES DO SANTOS, 31, is older and faster than RONO, 30, who is older and faster in turn than ROHATINSKY, 26.

  1. MARILSON GOMES DOS SANTOS, 31, M, 2:08:43
  2. ABDERRAHIM GOUMRI, 32, M, 2:09:07
  3. DANIEL RONO, 30, M, 2:11:22
  4. PAUL TERGAT, 39, M, 2:13:10
  5. ABDERRAHIME BOURAMDANE, 30, M, 2:13:33
  6. ABDI ABDIRAHMAN, 31, M, 2:14:17
  7. JOSH ROHATINSKY, 26, M, 2:14:23
  8. JASON LEHMKUHLE, 31, M, 2:14:30

This is a deceptively tricky problem. Even on such a small input list it’s hard to be absolutely sure our solution is optimal. Certainly there are other ordered triples with decreasing ages, but might there be a quartet? And even if we’re convinced Gomes, Rono, Rohatinsky do form a longest ordered sublist, I think it’s already clear that as more runners finish, there may be no longer be a longest ordered sublist which starts with these three.

Dumb computers

Rather than invent a clever strategy to find an optimal solution, why not get a computer to exhaust the possibilities? If we generate all possible sublists then filter out the ones whose age fields decrease, then our answer will be the longest of these.