Binary search gets a sort key
Suppose you have an list of distinct elements which has been sorted and rotated. How would you look up an element within that list?
For example, the list:
[7, 11, 13, 19, 2, 3, 5]
is sorted (the first 7 primes, in order) and rotated (to put 7 first).
With this list as input, then:
- look up
13returns2since13is at index2 - look up
2returns4 - look up
4returns the sentinel value-1
The obvious technique is to just search the list:
def lookup(values, v):
try:
return values.index(v)
except IndexError:
return -1
This is a linear algorithm which processes the entire list. Is there a way to take advantage of its sorted+rotated-ness?
If the list was sorted then we could apply a binary search for a logarithmic lookup. And in fact, by applying a custom ordering, the list is sorted.
How can we apply a custom ordering in Python?
The way to do this has changed as Python has developed. The table below shows the evolution of standard Python functions which sort and compare.
| Version | Year | Functions | Notes |
|---|---|---|---|
| 2.0 | 2000 | max, min, list.sort([cmp]), bisect.bisect | Note that [cmp] slows the sorting process down considerably |
| 2.3 | 2003 | heapq | Also known as the priority queue algorithm |
| 2.4 | 2004 | sorted(iterable[, cmp[, key[, reverse]]]), list.sort([cmp[, key[, reverse]]]), itertools.groupby(iterable[, key) | In general, the key and reverse conversion processes are much faster than specifying an equivalent cmp function |
| 2.4 | 2004 | heapq.nlargest, heapq.nsmallest | Heapq extended |
| 2.5 | 2006 | max([key]), min([key]), heapq.nlargest(…[, key]), heapq.smallest(…[, key]) | |
| 2.6 | 2008 | heapq.merge | Heapq extended again |
| 3.0 | 2008 | sorted(iterable, key[, reverse]), list.sort(key[, reverse]) | No more cmp |
| 3.5 | 2015 | heapq.merge(…, key) | |
| 3.10 | 2021 | bisect.bisect(…, key) etc | That leaves the low-level heapq functions |
The earliest versions of Python allowed you to sort lists, and
the sort was customised using a cmp function — though the documentation
warned there would be a performance penalty[*]. The builtin min and max
functions could not be cutomised, and nor could the comparison used in
the bisect module — which is Python’s binary search implementation.
At 2.3 the heapq module appeared, but, like bisect, there was no way
to customise the ordering of heap elements.
2.4 introduced the key argument to customise ordering,
noting this should be preferred to cmp. Unlike cmp which compares two
elements, key takes a single element and returns a sort rank for that element.
The key argument was added alongside cmp in list.sort and the new sorted
builtin function; and key was the only way to customise the grouping in
itertools.groupby.
2.5 added key to min and max, and also to heapq.nlargest, heapq.nsmallest.
Although these heapq functions now accept a key, the lower level heap functions
to heapify, push, pop and replace do not.
2.6 introduced heapq.merge, a handy function to merge sorted inputs
using a heap, but with no option to specify a sort key.
3.0 got rid of cmp, making key the only way to customise
sorting. To migrate from Python 2 to 3 any cmp functions need
converting to key functions. As with Python 2.5, at 3.0 you could
apply a key to list.sort, sorted, min, max,
itertools.groupby, heapq.nlargest, heapq.nsmallest — but not to
bisect or other heapq functions.
3.5 added key to heapq.merge, aligning it with heapq.nlargest
and heapq.nsmallest, though it remained impossible to use a sort key
with the lower level heap functions.
The next change came in 3.10, when the key parameter was
added to the bisect module. As far as I can tell that means the only part of
standard Python which doesn’t let you fully customise ordering is the heapq module.
Bisect with a search key
So, to return to the original puzzle, consider our example sorted+rotated list:
>>> values = [7, 11, 13, 19, 2, 3, 5]
The first four elements are greater than or equal to the first element, 7. The last three elements are less than 7.
>>> [v < values[0] for v in values]
[False, False, False, False, True, True, True]
Since False < True the result of the list comprehension above is sorted.
Extending this idea, the following comprehension is also sorted.
>>> [(v < values[0], v) for v in values]
[(False, 7), (False, 11), (False, 13), (False, 19), (True, 2), (True, 3), (True, 5)]
Now we have a logarithmic lookup using binary search:
from bisect import bisect_left
def lookup(values, v):
key = lambda x: (x < values[0], x)
i = bisect_left(values, key(v), key=key)
return i if (i < len(values) and values[i] == v) else -1
Don’t search for v, search for its key
I was surprised to realise the value v being looked up has to have the key
function applied before calling bisect_left. That is, to find where v
should go in values to maintain the sort order, we pass key(v) to bisect_left.
This doesn’t match the interface to binary search in other languages. It also means
in our example we have to handle empty lists as a special case.
def lookup(values, v):
if not values: return -1
key = lambda x: (x < values[0], x)
i = bisect_left(values, key(v), key=key)
return i if (i < len(values) and values[i] == v) else -1
[*] Before the introduction of the sort key, the standard pattern was
decorate-sort-undecorate.