Top Ten Tags

2008-02-19, , , Comments

Reworking this website reminded me of another classic sorting algorithm. The sidebar on the front page now has a Top Tags node which lists, in order, the 10 most frequently used tags for articles on this site. What’s the best way to find these?

More generally:

How do you select the N largest items, in order, from a collection?

Maximum

When N is 1, the standard maximum function does the job. That would be std::max_element in C++ or simply max in Python. Python’s max has an optional key parameter, allowing you to supply your own comparison function; C++ similarly has an overload of max_element which accepts a comparison predicate.

Sort and Slice

If N isn’t 1, you could sort the whole collection then slice the N largest elements from the end. In Python:

nlargest = sorted(collection)[-N:]

And in shell:

$ sort FILE | tail -$N

Partial Sort

A full sort isn’t required if you just need the top 10, though. For large collections and small N, gains can be had from partially sorting the collection. C++ provides an algorithm, std::partial_sort, which does just that, shuffling the collection in place until the first N elements are ordered and at the front of that collection.

Here’s a complete program based on the C++ partial sort algorithm. It reads integers from standard input into memory then writes the first 10 of them to standard output.

// Program reads integer values from standard input and
// writes the N largest of these values, largest first,
// to standard output.
#include <algorithm>
#include <functional>
#include <iostream>
#include <iterator>
#include <vector>

int main()
{
    typedef std::vector<long> numbers;
    typedef numbers::iterator iter;
    typedef std::istream_iterator<numbers::value_type> in;
    typedef std::ostream_iterator<numbers::value_type> out;

    // Read numbers from standard input
    numbers results;
    copy(in(std::cin), in(), back_inserter(results));

    // Make sure we cope with N > size of results.
    numbers::size_type const N = 10u;
    numbers::size_type const n = std::min(results.size(), N);

    // Find the N largest (hence the "greater" predicate)
    iter const first = results.begin();
    iter const middle = first + n;
    iter const last = results.end();
    partial_sort(first, middle, last, std::greater<long>());

    // Copy these to standard out
    copy(first, middle, out(std::cout, " "));
    return 0;
}

The C++ standard guarantees the complexity of partial_sort:

It takes approximately (last - first) * log(middle - first) comparisons.

The corresponding complexity for a full sort is:

Approximately (last - first) * log(last - first).

So the speed up is theoretically of the order of log(S)/log(N). Logarithms grow slowly so the gains aren’t spectacular, but they may well be worth having. I ran some tests on collections of 31 bit numbers generated by the standard C random() function, with the collection size, S, ranging between 2 million and 10 million.

S/millionPartial/msFull/msFull/Partial
2421754
4845156
61269758
81694456
1029120041

As you can see, for these test cases the partial sort runs around 50 times more quickly than the full sort; better than expected or predicted!

It’s also worth noting that we can find the top N elements without altering or (fully) copying the original collection: see std::partial_sort_copy() for details.

C++’s in-place partial sort works well with a paging model. To sketch the idea, partial_sort(first, first + N, last) yields the first page of results, then, if required, partial_sort(first + N, first + 2 * N, last) yields the second page, and so on. Of course, if we anticipate paging through a large portion of the entire collection, a full sort gets the job done up front.

The complexity guarantee for partial_sort is the same as for sort in the limiting case, when middle equals last. So an implementation could, I think, claim conformance by implementing sort on top of partial_sort.

Partial Sorting with Heaps

In fact the partial and full sort functions use quite different algorithms. Partial sort is based on the heap data structure and, on my present platform, is implemented largely in terms of the standard heap functions. Here’s the important part of the code, which I’ve reformatted for the purposes of this article. Please, compare against the same function in your own implementation.

template<typename RanIt>
void partial_sort(RanIt first, RanIt middle, RanIt last)
{
    typedef typename iterator_traits<RanIt>::value_type V;

    std::make_heap(first, middle);
    for (RanIt i = middle; i < last; ++i)
        if (*i < *first)
            std::__pop_heap(first, middle, i, V(*i));
    std::sort_heap(first, middle);
}

This function starts by making the half open range [first, middle) into a heap, which has the result that *first is the largest element in this range.

It then iterates through the elements [middle, last). Each time an element is smaller than *first — that is, smaller than the largest element of [middle, first) — it calls the implementation’s private std::__pop_heap() function. This in turn swaps the values at positions *first and i and adjusts the range [first, middle) to once more be a heap. Again, look in your standard library for details.

Loosely speaking, every time we see an element in the tail of the collection which is smaller than the largest element in the head of the collection, we swap these elements.

More precisely, the loop invariant is that [first, middle) is a heap, and all the elements in the range [middle, i] are greater than all the elements in [first, middle). It’s subtle, efficient, and dazzlingly clever!

Once the iterator i gets to the end of the range (last, that is), the container has been partitioned so the smallest N elements are at its front. All that remains is to sort these elements; and since the front of the container has already been heapified, we can just heap_sort it.

Partitioning with Heaps?

Note the distinction between finding the ordered top ten items in a collection and finding the ten largest items in a collection: the ten largest elements needn’t be ordered.

You may have spotted that if we pull out of the partial_sort() implementation shown above before applying the final sort_heap(), then we’ve partitioned the collection so that items in the range [first, middle) are larger than items in the range [middle, last).

In fact, there’s a a better way of partitioning the collection to put the N largest elements at the front. It doesn’t use heaps, and, amazingly, can be achieved with a linear algorithm. The C++ standard library provides just such an algorithm filed under the slightly misleading name of std::nth_element.

Sorting with Heaps

I claimed earlier that C++ sort implementers could reuse a special case of partial sort and still meet the C++ Standard’s complexity guarantee. It would be a hard trick to pull off though, since the constant factors differ. Sort is likely to be based on quicksort, acknowledged the most efficient general purpose sorting algorithm. Partial sort, as already mentioned, is a heap sort.

On my platform, std::sort() in fact delegates to an introsort — a hybrid algorithm which starts with a quicksort and bottoms out to std::partial_sort() once a heuristically determined recursion depth is exceeded.

I ran a full partial sort head to head against standard sort on my machine, feeding both algorithms large-ish (size up to 10 million) arrays of 31 bit numbers generated using the standard C random() function. The results indicate sort runs around four times faster than partial sort; someone’s probably got a theoretical proof of the exact multiplier.

Full Sort vs Full Partial Sort chart

N Largest in Python

Python makes no complexity guarantees, but the location of the nlargest function in the heapq module gives a pretty big hint about its implementation! Note that nlargest returns its results in order; it’s more than just a partitioning. Note too that it’s generous enough to handle the case when N is larger than the size of the collection.

Here’s a Python script which imitates our earlier C++ program:

from sys import stdin
from heapq import nlargest

numbers = map(int, stdin.read().split())
top_ten = nlargest(10, numbers)
print "\n".join(map(repr, top_ten))

For the purpose of comparison, I timed the nlargest() part of this function. I also timed a full (Python) sort of the numbers. Again, I ran on random collections of size S ranging from 2 to 10 million.

S/millionPartial/msFull/msFull/Partial
2100259026
4190580031
6290930032
84101288031
105101667033

This time, the partial sort ran about 30 times more quickly than the full sort. C++ proved about 13 times quicker than Python for the full sort, and 24 times quicker for partial sort.

N Largest in Shell

The Python script shown relies on being able to read the entire file into memory (that’s not a limitation of Python, just of the rather simplistic approach taken by the script). The C++ solution only needs space for the numbers, the input buffering being nicely handled by the iostreams framework. For sizable inputs — of the order of a GB, say, on a modern computer — we’d need to use secondary storage.

The Unix shell pipeline shown earlier has no such limitation. Given enough time and secondary storage, the following command finds the 10 largest numbers in BIGFILE, even if we can’t hold all these numbers in RAM.

$ sort -r -n BIGFILE | head

Executing this command on a ~9GB input file holding one billion 31 bit random numbers took over an hour and a half on my machine.

Parallel Algorithm Analysis

The language used in this article for discussing algorithm analysis works best for a single process running a single uninterrupted thread of execution. If we want to budget time for an algorithm which makes N * log(N) comparisons we plug in N, divide by the processor speed, and multiply by the number of cycles required for each comparison.

I wonder how well this language will survive in a world where processors have multiple cores. Will a new family of algorithms evolve, ones better equipped to use the new hardware?

This evolution is underway already. In a sequence of articles published in Dr. Dobbs Journal, Herb Sutter teaches programmers the traditional C++ way of doing things; a low-level, platform-dependent approach based on forking threads and locking resources. I’ve come to regard these techniques as a sure route to subtle bugs. On the systems I’ve worked on, a more C-style approach has worked well. At its simplest, a Unix pipeline distributes the load; this archetype generalises to a multi-process architecture, where we develop and prove each (single-threaded!) component in isolation.

erlang GIF

There are higher level languages though. Why limit ourselves to a single machine if we can devise a language which blurs the distinction between multiple processors on a single machine and multiple processors on a network? And why not build in some regulation of low level failures? When a task is distributed between workers, it’s natural to ask what should happen if a worker fails, or simply lags behind.

Functional programming turns out to have much to offer in this new, parallel world — Google’s Map-Reduce framework, for example — and it’s nice to know the fundamental ideas are far from being new: rather, their time has come.

Choosing an algorithm

When discussing algorithms it’s all too easy to fret about what happens when inputs grow massive. If we’ve used the standard libraries then resource use for sorting — both memory and CPU cycles — may not be a concern. The code in this article demonstrates highly efficient general purpose sorting routines; and in any final system it’s likely we could use full- and partial- sorting interchangeably without noticeably affecting overall performance.

What is always a concern, though, is readability. If it’s the largest few elements of a collection we want, calling std::partial_sort() in C++ or heapq.nlargest() in Python nicely expresses that desire.