Binary search gets a sort key
Suppose you have an list of distinct elements which has been sorted and rotated. How would you look up an element within that list?
For example, the list:
[7, 11, 13, 19, 2, 3, 5]
is sorted (the first 7 primes, in order) and rotated (to put 7 first).
With this list as input, then:
- look up
13
returns2
since13
is at index2
- look up
2
returns4
- look up
4
returns the sentinel value-1
The obvious technique is to just search the list:
def lookup(values, v):
try:
return values.index(v)
except IndexError:
return -1
This is a linear algorithm which processes the entire list. Is there a way to take advantage of its sorted+rotated-ness?
If the list was sorted then we could apply a binary search for a logarithmic lookup. And in fact, by applying a custom ordering, the list is sorted.
How can we apply a custom ordering in Python?
The way to do this has changed as Python has developed. The table below shows the evolution of standard Python functions which sort and compare.
Version | Year | Functions | Notes |
---|---|---|---|
2.0 | 2000 | max, min, list.sort([cmp]), bisect.bisect | Note that [cmp] slows the sorting process down considerably |
2.3 | 2003 | heapq | Also known as the priority queue algorithm |
2.4 | 2004 | sorted(iterable[, cmp[, key[, reverse]]]), list.sort([cmp[, key[, reverse]]]), itertools.groupby(iterable[, key) | In general, the key and reverse conversion processes are much faster than specifying an equivalent cmp function |
2.4 | 2004 | heapq.nlargest, heapq.nsmallest | Heapq extended |
2.5 | 2006 | max([key]), min([key]), heapq.nlargest(…[, key]), heapq.smallest(…[, key]) | |
2.6 | 2008 | heapq.merge | Heapq extended again |
3.0 | 2008 | sorted(iterable, key[, reverse]), list.sort(key[, reverse]) | No more cmp |
3.5 | 2015 | heapq.merge(…, key) | |
3.10 | 2021 | bisect.bisect(…, key) etc | That leaves the low-level heapq functions |
The earliest versions of Python allowed you to sort lists, and
the sort was customised using a cmp
function — though the documentation
warned there would be a performance penalty[*]. The builtin min
and max
functions could not be cutomised, and nor could the comparison used in
the bisect
module — which is Python’s binary search implementation.
At 2.3 the heapq
module appeared, but, like bisect
, there was no way
to customise the ordering of heap elements.
2.4 introduced the key
argument to customise ordering,
noting this should be preferred to cmp
. Unlike cmp
which compares two
elements, key
takes a single element and returns a sort rank for that element.
The key
argument was added alongside cmp
in list.sort
and the new sorted
builtin function; and key
was the only way to customise the grouping in
itertools.groupby
.
2.5 added key
to min
and max
, and also to heapq.nlargest
, heapq.nsmallest
.
Although these heapq
functions now accept a key
, the lower level heap functions
to heapify, push, pop and replace do not.
2.6 introduced heapq.merge
, a handy function to merge sorted inputs
using a heap, but with no option to specify a sort key.
3.0 got rid of cmp
, making key
the only way to customise
sorting. To migrate from Python 2 to 3 any cmp
functions need
converting to key
functions. As with Python 2.5, at 3.0 you could
apply a key to list.sort
, sorted
, min
, max
,
itertools.groupby
, heapq.nlargest
, heapq.nsmallest
— but not to
bisect
or other heapq
functions.
3.5 added key
to heapq.merge
, aligning it with heapq.nlargest
and heapq.nsmallest
, though it remained impossible to use a sort key
with the lower level heap functions.
The next change came in 3.10, when the key
parameter was
added to the bisect
module. As far as I can tell that means the only part of
standard Python which doesn’t let you fully customise ordering is the heapq
module.
Bisect with a search key
So, to return to the original puzzle, consider our example sorted+rotated list:
>>> values = [7, 11, 13, 19, 2, 3, 5]
The first four elements are greater than or equal to the first element, 7. The last three elements are less than 7.
>>> [v < values[0] for v in values]
[False, False, False, False, True, True, True]
Since False
< True
the result of the list comprehension above is sorted.
Extending this idea, the following comprehension is also sorted.
>>> [(v < values[0], v) for v in values]
[(False, 7), (False, 11), (False, 13), (False, 19), (True, 2), (True, 3), (True, 5)]
Now we have a logarithmic lookup using binary search:
from bisect import bisect_left
def lookup(values, v):
key = lambda x: (x < values[0], x)
i = bisect_left(values, key(v), key=key)
return i if (i < len(values) and values[i] == v) else -1
Don’t search for v, search for its key
I was surprised to realise the value v
being looked up has to have the key
function applied before calling bisect_left
. That is, to find where v
should go in values
to maintain the sort order, we pass key(v)
to bisect_left
.
This doesn’t match the interface to binary search in other languages. It also means
in our example we have to handle empty lists as a special case.
def lookup(values, v):
if not values: return -1
key = lambda x: (x < values[0], x)
i = bisect_left(values, key(v), key=key)
return i if (i < len(values) and values[i] == v) else -1
[*] Before the introduction of the sort key
, the standard pattern was
decorate-sort-undecorate.