Binary search gets a sort key

2022-02-15, Comments

Suppose you have an list of distinct elements which has been sorted and rotated. How would you look up an element within that list?

For example, the list:

[7, 11, 13, 19, 2, 3, 5]

is sorted (the first 7 primes, in order) and rotated (to put 7 first).

With this list as input, then:

  • look up 13 returns 2 since 13 is at index 2
  • look up 2 returns 4
  • look up 4 returns the sentinel value -1

The obvious technique is to just search the list:

def lookup(values, v):
    try:
        return values.index(v)
    except IndexError:
        return -1

This is a linear algorithm which processes the entire list. Is there a way to take advantage of its sorted+rotated-ness?

If the list was sorted then we could apply a binary search for a logarithmic lookup. And in fact, by applying a custom ordering, the list is sorted.

How can we apply a custom ordering in Python?

The way to do this has changed as Python has developed. The table below shows the evolution of standard Python functions which sort and compare.

Version Year Functions Notes
2.0 2000 max, min, list.sort([cmp]), bisect.bisect Note that [cmp] slows the sorting process down considerably
2.3 2003 heapq Also known as the priority queue algorithm
2.4 2004 sorted(iterable[, cmp[, key[, reverse]]]), list.sort([cmp[, key[, reverse]]]), itertools.groupby(iterable[, key) In general, the key and reverse conversion processes are much faster than specifying an equivalent cmp function
2.4 2004 heapq.nlargest, heapq.nsmallest Heapq extended
2.5 2006 max([key]), min([key]), heapq.nlargest(…[, key]), heapq.smallest(…[, key])
2.6 2008 heapq.merge Heapq extended again
3.0 2008 sorted(iterable, key[, reverse]), list.sort(key[, reverse]) No more cmp
3.5 2015 heapq.merge(…, key)
3.10 2021 bisect.bisect(…, key) etc That leaves the low-level heapq functions

The earliest versions of Python allowed you to sort lists, and the sort was customised using a cmp function — though the documentation warned there would be a performance penalty[*]. The builtin min and max functions could not be cutomised, and nor could the comparison used in the bisect module — which is Python’s binary search implementation.

At 2.3 the heapq module appeared, but, like bisect, there was no way to customise the ordering of heap elements.

2.4 introduced the key argument to customise ordering, noting this should be preferred to cmp. Unlike cmp which compares two elements, key takes a single element and returns a sort rank for that element. The key argument was added alongside cmp in list.sort and the new sorted builtin function; and key was the only way to customise the grouping in itertools.groupby.

2.5 added key to min and max, and also to heapq.nlargest, heapq.nsmallest. Although these heapq functions now accept a key, the lower level heap functions to heapify, push, pop and replace do not.

2.6 introduced heapq.merge, a handy function to merge sorted inputs using a heap, but with no option to specify a sort key.

3.0 got rid of cmp, making key the only way to customise sorting. To migrate from Python 2 to 3 any cmp functions need converting to key functions. As with Python 2.5, at 3.0 you could apply a key to list.sort, sorted, min, max, itertools.groupby, heapq.nlargest, heapq.nsmallest — but not to bisect or other heapq functions.

3.5 added key to heapq.merge, aligning it with heapq.nlargest and heapq.nsmallest, though it remained impossible to use a sort key with the lower level heap functions.

The next change came in 3.10, when the key parameter was added to the bisect module. As far as I can tell that means the only part of standard Python which doesn’t let you fully customise ordering is the heapq module.

Bisect with a search key

So, to return to the original puzzle, consider our example sorted+rotated list:

>>> values = [7, 11, 13, 19, 2, 3, 5]

The first four elements are greater than or equal to the first element, 7. The last three elements are less than 7.

>>> [v < values[0] for v in values]
[False, False, False, False, True, True, True]

Since False < True the result of the list comprehension above is sorted. Extending this idea, the following comprehension is also sorted.

>>> [(v < values[0], v) for v in values]
[(False, 7), (False, 11), (False, 13), (False, 19), (True, 2), (True, 3), (True, 5)]

Now we have a logarithmic lookup using binary search:

from bisect import bisect_left

def lookup(values, v):
    key = lambda x: (x < values[0], x)
    i = bisect_left(values, key(v), key=key)
    return i if (i < len(values) and values[i] == v) else -1

Don’t search for v, search for its key

I was surprised to realise the value v being looked up has to have the key function applied before calling bisect_left. That is, to find where v should go in values to maintain the sort order, we pass key(v) to bisect_left. This doesn’t match the interface to binary search in other languages. It also means in our example we have to handle empty lists as a special case.

def lookup(values, v):
    if not values: return -1
    key = lambda x: (x < values[0], x)
    i = bisect_left(values, key(v), key=key)
    return i if (i < len(values) and values[i] == v) else -1

[*] Before the introduction of the sort key, the standard pattern was decorate-sort-undecorate.