Ordered sublists. A brute force approach

2009-03-09, , Comments

Younger runners

Recently I posed a puzzle:

Starting with a list of runners ordered by their finishing time in a race, select a sublist of runners who are getting younger. What is the longest such sublist?

Below, I’ve highlighted just such a sublist within the list of 8 runners who completed last year’s New York marathon in under two and a quarter hours. As you can see, MARILSON GOMES DO SANTOS, 31, is older and faster than RONO, 30, who is older and faster in turn than ROHATINSKY, 26.

  1. MARILSON GOMES DOS SANTOS, 31, M, 2:08:43
  2. ABDERRAHIM GOUMRI, 32, M, 2:09:07
  3. DANIEL RONO, 30, M, 2:11:22
  4. PAUL TERGAT, 39, M, 2:13:10
  5. ABDERRAHIME BOURAMDANE, 30, M, 2:13:33
  6. ABDI ABDIRAHMAN, 31, M, 2:14:17
  7. JOSH ROHATINSKY, 26, M, 2:14:23
  8. JASON LEHMKUHLE, 31, M, 2:14:30

This is a deceptively tricky problem. Even on such a small input list it’s hard to be absolutely sure our solution is optimal. Certainly there are other ordered triples with decreasing ages, but might there be a quartet? And even if we’re convinced Gomes, Rono, Rohatinsky do form a longest ordered sublist, I think it’s already clear that as more runners finish, there may be no longer be a longest ordered sublist which starts with these three.

Dumb computers

Rather than invent a clever strategy to find an optimal solution, why not get a computer to exhaust the possibilities? If we generate all possible sublists then filter out the ones whose age fields decrease, then our answer will be the longest of these.

Exhaustive search
from itertools import combinations as sublists
from functools import partial

def is_ordered(xs, comp):
    '''Return True if the sequence xs is ordered, False otherwise.
    
    >>> from operator import gt
    >>> is_ordered((3, 2, 1), gt)
    True
    '''
    return all(comp(xs[i], xs[i+1]) for i in range(len(xs)-1))

def longest_ordered_sublist(xs, comp):
    '''Find a longest sublist of "xs" which is ordered by "comp"
    
    (Note: there may be no unique solution, and this function makes
    no guarantees about which longest sublist is returned.)
    
    >>> from operator import lt
    >>> items = 2, 1, 5, 2, 2, 3, 1
    >>> longest_ordered_sublist(items, lt)
    (1, 2, 3)
    '''
    in_order = partial(is_ordered, comp=comp)
    xss = (xss for n in range(len(xs)) for xss in sublists(xs, n))
    return max(filter(in_order, xss), key=len)

The heavy lifting here is done by a recent addition to the itertools module, itertools.combinations.

itertools.combinations(iterable, r)

Return r length subsequences of elements from the input iterable.

Generating all possible sublists is as easy (too easy!) as looping over r. All that remains is to filter the ordered sublists, then use the max builtin function keyed by length.

Let’s apply this function to our top 8 finishers.

>>> results = (
... ("Gomes", 31), ("Goumri", 32), ("Rono", 30),
... ("Tergat", 39), ("Bouramdane", 30), ("Abdirahman", 31),
... ("Rohatinsky", 26), ("Lehmkuhle", 31))
>>> def older(runner_a, runner_b):
...     return runner_a[1] > runner_b[1]
>>> longest_ordered_sublist(results, older)
(('Gomes', 31), ('Rono', 30), ('Rohatinsky', 26))

So, the longest ordered sublist has length 3, and Gomes, Rono, Rohatinsky is such a sublist.

Adapting longest_ordered_sublist to return all longest ordered sublists is easy enough:

  • replace max(filter(in_order, xss), key=len) with sorted(filter(in_order, xss), key=len)
  • feed the sorted results through itertools.groupby
  • and capture the final group.

It turns out there are no fewer than 7 longest subsequences ordered by decreasing age.

All longest ordered subsequences
(('Gomes', 31), ('Rono', 30), ('Rohatinsky', 26))
(('Gomes', 31), ('Bouramdane', 30), ('Rohatinsky', 26))
(('Goumri', 32), ('Rono', 30), ('Rohatinsky', 26))
(('Goumri', 32), ('Bouramdane', 30), ('Rohatinsky', 26))
(('Goumri', 32), ('Abdirahman', 31), ('Rohatinsky', 26))
(('Tergat', 39), ('Bouramdane', 30), ('Rohatinsky', 26))
(('Tergat', 39), ('Abdirahman', 31), ('Rohatinsky', 26))

Dumber programmers

Did you spot the defects in longest_ordered_sublist()? It gets the answer wrong for empty sequences and totally ordered sequences.

>>> import operator
>>> longest_ordered_sublist([], operator.lt)
Traceback (most recent call last):
....
ValueError: max() arg is an empty sequence
>>> longest_ordered_sublist((1, 2, 3), operator.lt)
(1, 2)

Fixing the code is easy enough, but what should we do about the documentation, and indeed about testing these edge cases? Certainly edge cases should be tested, and it’s tempting to add a couple more examples to the function’s docstring and let doctest confirm correctness.

def longest_ordered_sublist(xs, comp):
    '''Find a longest sublist of "xs" which is ordered by "comp"
    
    (Note: there may be no unique solution, and this function makes
    no guarantees about which longest sublist is returned.)
    
    >>> from operator import lt
    >>> items = 2, 1, 5, 2, 2, 3, 1
    >>> longest_ordered_sublist(items, lt)
    (1, 2, 3)
    >>> longest_ordered_sublist((1, 2, 3), lt)
    (1, 2, 3)
    >>> longest_ordered_sublist((), lt)
    ()
    '''
    in_order = partial(is_ordered, comp=comp)
    xss = (xss for n in range(len(xs) + 1) for xss in sublists(xs, n))
    return max(filter(in_order, xss), key=len)

Tempting, yes. Good idea? Not really!

As Ned Batchelder puts it:

Docstrings, and the long sequence of code they encourage, may be good ways to explain what code does, but explaining and testing are two different tasks, and the code you write for each will be different. So why try to serve two masters at once?

Either your expository text will be cluttered with uninformative edge cases, or your tests will merely skim the surface of what your code can do.

In the case of the revised version of longest_ordered_sublist(), the first example in the docstring explains clearly what the function does, but the second and third count as clutter.

Ignorant machines

Returning to our combinatorial algorithm, this kind of exhaustive search approach is often referred to as brute force. When you’ve got a machine which can do billions of things every second without breaking a sweat, it’s a great technique.

In this case, though, brute force turns out to be machine ignorance. If a list has N items, then each item will be either in or out of any particular combination, giving a total of 2N possible combinations. On my machine it took over 5 seconds to confirm the longest ordered subsequence of the first 20 runners in the New York marathon was of length 7.

5 seconds isn’t so very long but every additional runner doubles up the time, and we can predict this particular algorithm would fail to process a result list of just 32 entries within the 2 and a bit hours it took Marílson Gomes dos Santos to complete the course.

We need to do better.

Marilson Gomes dos Santos

We can do better! This problem has been studied and some particularly interesting solutions have been found. All talk here about races, ages, etc. only exists to disguise the real problem and encourage you to consider it afresh. If you want an answer now, search for “longest ordered subsequence” or “longest increasing subsequence”. If you’d like to have a crack at the problem yourself, the 2008 New York marathon results can be found online.

I get an answer of 1724 for the length of the longest sublist of these results formed of runners whose ages are non-increasing, and 60 for the longest sublist formed of runners whose ages are strictly decreasing.

Alternatively, stick around — I’ll be writing up my own notes in the next couple of articles.


Python 3.0 notes

As I mentioned at the start of the year, I intend to use Python 3.0 for all new code examples posted here on Word Aligned. I’ll also be making notes about what I discover about using this new version of Python, starting right here.

  • my emacs python editing mode stopped working when I moved to Python 3.0. (I was using whatever came as standard with Carbon Emacs 22). The fix was to use Python-3.0/Misc/python-mode.el straight out of the Python 3.0 tarball. Happily this also works with Python 2.6 and earlier.
  • the code presented runs out of time before running out of memory since, in Python 3.0, the filter builtin function generates elements on demand. Use future_builtins.filter or itertools.ifilter if you want this behaviour with 2.6.
  • while not strictly Python 3.0 (it appears in 2.6 as well) itertools.combinations is worth mentioning again, as is another related member of the itertools module. Check out Raymond Hettinger’s clever solution of the eight Queens puzzle, which uses itertools.permutations to shuffle pieces around a chessboard. Unlike other itertools, combinations and permutations won’t cope with infinite streams.
Another way to exhaust your computer
>>> from itertools import combinations, count
>>> pairs = combinations(count(), 2)