Binary search returns … ?

2010-05-12 • Algorithms, C, C++ • Comments

In an article inspired by Jon Bentley’s classic book, Programming Pearls, Mike Taylor invites his readers to implement the binary search algorithm. To spice things up, he requests we work:

without reference to any existing implementation
without calling any library routine, such as bsearch
without writing tests.

Mike Taylor doesn’t formally specify the problem. He’s confident his readers will know what a binary search is, and if not, the description he quotes from Programming Pearls should suffice:

Binary search solves the problem [of searching within a pre-sorted array] by keeping track of a range within the array in which T [i.e. the sought value] must be if it is anywhere in the array. Initially, the range is the entire array. The range is shrunk by comparing its middle element to T and discarding half the range. The process continues until T is discovered in the array, or until the range in which it must lie is known to be empty.

So could our binary search implementation simply return a binary result, true if T is in the array, false otherwise? Well, Yes. And No. A binary search can provide more information, as Mike Taylor hints when he mentions bsearch.

Jon Bentley and Mike Taylor are primarily interested in how often programmers make a mess of what appears to be a simple assignment and in how to avoid this mess. In this article, I’d like to point out that the problem specification needs attention too.

Bsearch in C

The C library’s bsearch function returns the location of T, if found, or a sentinel value otherwise. We might use the array index of T or -1 as location and sentinel. Standard C uses pointers:

NAME
    bsearch -- binary search of a sorted table

SYNOPSIS
    #include <stdlib.h>

    void *
    bsearch(const void *key, const void *base, size_t nel, 
        size_t width,
        int (*compar) (const void *, const void *));

DESCRIPTION 
    The bsearch() function searches an array of `nel` objects, 
    the initial member of which is pointed to by `base`, for a member
    that matches the  object pointed to by `key`.  The size (in bytes)
    of each member of the array is specified by `width`.

    The contents of the array should be in ascending sorted order 
    according to the comparison function referenced by `compar`.  The 
    `compar` routine is expected to have two arguments which point to
    the `key` object and to an array member, in that order.  It should 
    return an integer which is less than, equal to, or greater than
    zero if the `key` object is found, respectively, to be less than,
    to match, or be greater than the array member.

RETURN VALUES
    The bsearch() function returns a pointer to a matching member
    of the array, or a null pointer if no match is found.  If two members
    compare as equal, which member is matched is unspecified.

Void pointers, function pointers, raw memory — generic functions in C aren’t pretty. How would this function look in a language with better support for generic programming?

Binary search in C++

C++ programmers can of course use bsearch directly since C++ includes the standard C library. The C++ counterpart would seem to be std::binary_search.

At first glance std::binary_search appears to be a weakened version of bsearch. Like bsearch, it searches for a value. Unlike bsearch, it simply returns a boolean result: true if the value is found, false otherwise. Nonetheless, it can tell us more than bsearch in some circumstances.

Let’s return to Mike Taylor’s second constraint, the one about implementing functions which already exist in standard libraries. In a follow up article he explains:

… sometimes you do need to write a binary search, and the library routines won’t get the job done. Or if they will, they’re grotesquely inefficient. For example, suppose you have a 64-bit integer, and you need to find out whether it’s among the nine billion 64-bit integers that are stored in ascending order in a 72 Gb file. The naive solution is to read the file into memory, making an array (or, heaven help us, an Array) of nine billion elements, then invoke the library search function. And of course that just plain won’t work — the array won’t fit in memory.

Agreed! We should know how our wheels work and be ready to reinvent them when necessary: but C++’s std::binary_search will solve this problem efficiently. All we need is a suitable iterator over the file, in this case one which:

increments in 8 byte steps
uses file seeks for larger steps
is dereferenced by reading 8 byte values from the file
stores file position, for use in ordering and distance operations

I include an implementation of just such an iterator towards the end of this article. My aging laptop didn’t have enough disk space for a 72GB data file but I found room for a 5GB one. Std::binary_search() took milliseconds to test the presence of values in this file, and the times improved dramatically on repeat runs; using a linear search, the time extended to minutes, and repeat runs showed no such improvements.

Std::binary_search() requirements

It’s fair to suggest that creating a custom iterator just so we could use std::binary_search merely moves the problem. The iterator’s implementation is longer and arguably more fiddly than any custom binary search function would be. Why couldn’t we use a standard input stream iterator with the standard binary search algorithm?

The reason is that std::istream_iterators are input iterators, suitable only for single pass algorithms. Binary search doesn’t need to take any backwards steps but it does need to be able copy its iterators and advance them repeatedly. As a minimum, then, it requires forwards iterators.

Note the algorithm’s complexity!

The number of comparisons is logarithmic: at most log(last - first) + 2. If ForwardIterator is a Random Access Iterator then the number of steps through the range is also logarithmic; otherwise, the number of steps is proportional to last - first.

In the case of our large file of numbers, comparisons are cheap; there’s little point minimising them if we’re going to take billions of short steps through the file. This is why we created a random access file iterator 1.

A more subtle point is that binary search deals with equivalence rather than equality: it only requires a less-than operator (or a comparison function), and returns true if it can find an element x which satisfies !(x < t) && !(t < x).

The point I’m making is that C++ does a nice job of separating algorithms and containers, which is why the same algorithm can be used on vectors, files, arrays etc. It also carefully defines minimum requirements on the types used by algorithms 2.

Std::binary_search() limitations

We noted earlier that std::binary_search delivers nothing more than a binary result. Is the element there or not? From the SGI STL documentation:

Note that this is not necessarily the information you are interested in!

Even bsearch tells us where it found the match; or rather, where it found a match, since there could be several. This imprecision is one of bsearch’s failings — but it really lets us down when it can’t find the element: in this case, it subdivides the range until it finds where the element would be if it were there, realises there is no match, then throws all positional information away and returns a null pointer.

Suppose our large file represents a set of numbers and we want to know where our test number should go in this file, if it isn’t already present? A C++ binary search algorithm can do this, but it isn’t std::binary_search.

Locating missing elements

Here’s another problem binary search can solve. Suppose we want to know how many runners finished the 2010 London marathon in a time between 3 and 4 hours. Let’s suppose we’ve already loaded the ordered finishing times into an array.

We might try using bsearch to find the position of the runners who finished with a time of exactly 3 hours and with a time of exactly 4 hours. Then the answer would be the difference between these two positions.

There are two problems with this approach:

what if no one finished with a time of exactly 3 or 4 hours?
what if more than one runner finished with a time of exactly 3 or 4 hours?

In the first case bsearch returns a null pointer and we can’t complete our calculation. In the second case, bsearch makes no guarantees about which of the equally-placed runners it will find, and even if we can make our calculation, we cannot be sure it is correct.

Bsearch is not much use, then, but a binary search can give us our answer.

Imagine we had a late result for the race, a runner who recorded a time of exactly 3 hours. What’s the first position in the array at which we could place this runner, whilst maintaining the array ordering? Similarly, where’s the first position at which we could insert a runner with a time of 4 hours, maintaining the array ordering. Both these positions are well defined and precise — even if everyone finished the race in less than 3 hours, or even if no one ran the race — and the correct answer is their difference.

Lower_bound

C++ supplies just such an algorithm. It goes by the name of std::lower_bound, but really it’s good old binary search. We want to find the first place our target element could go, whilst maintaining the ordering, which we do by repeatedly splitting the range.

while the range is non-empty
look at the element in the middle of the range
is its value less than the target value?
if so, continue looking in the top half of the range
if not, continue looking in the bottom half of the range

The while loop exits when the range has been reduced to a single point and this point is what we return. On my platform, the code itself reads a bit like:

template<typename fwd_it, typename t>
fwd_it
lower_bound(fwd_it first, fwd_it last, const t & val)
{
    typedef typename iterator_traits<fwd_it>::difference_type distance;

    distance len = std::distance(first, last);
    distance half;
    fwd_it middle;

    while (len > 0)
    {
        half = len >> 1;
        middle = first;
        std::advance(middle, half);
        if (*middle < val)
        {
            first = middle;
            ++first;
            len = len - half - 1;
        }
        else
            len = half;
    }
    return first;
}

I think this version of binary search is yet another gem from the C++ standard library. As Jon Bentley and Mike Taylor eloquently point out, the implementation is subtle — in particular, if (*middle < val) we must eliminate middle or risk an infinite loop — but by tightening the problem specification and paring back the requirements we’ve created a function which is far more useful than bsearch and arguably simpler to code.

For comparison, here’s the bsearch implemented by glibc, version 2.11.

/* Perform a binary search for KEY in BASE which has NMEMB elements
   of SIZE bytes each.  The comparisons are done by (*COMPAR)().  */
void *
bsearch (const void *key, const void *base, size_t nmemb, size_t size,
         int (*compar) (const void *, const void *))
{
  size_t l, u, idx;
  const void *p;
  int comparison;

  l = 0;
  u = nmemb;
  while (l < u)
    {
      idx = (l + u) / 2;
      p = (void *) (((const char *) base) + (idx * size));
      comparison = (*compar) (key, p);
      if (comparison < 0)
        u = idx;
      else if (comparison > 0)
        l = idx + 1;
      else
        return (void *) p;
    }

  return NULL;
}

Binary search variants

On my platform, std::binary_search is built directly on std::lower_bound. Here’s the code.

template<typename fwd_it, typename t>
fwd_it
lower_bound(fwd_it first, fwd_it last, const t & val)
{
    fwd_it i = std::lower_bound(first, last, val);
    return i != last && !(val < *i);
}

Std::upper_bound searches a sorted range to find the last position an item could be inserted without changing the ordering.

Std::equal_range returns a pair of iterators, logically equal to make_pair(lower_bound(...), upper_bound(...)).

Iterating over numbers in a file

The iterator class I created to use std::binary_search on an file containing fixed width binary formatted numbers appears below. To determine whether the file numbers.bin contains the target value 288230376151711744, we would write something like:

#include <algorithm>

....

    typedef binary_file_number_iter<long long, 8> iter;
    long long target = 288230376151711744LL;

    bool found = std::binary_search(iter("numbers.bin", iter::begin),
                                    iter("numbers.bin", iter::end),
                                    target);

To test the performance of these iterators I created a 5GB binary file packed with 8 byte numbers. These numbers were multiples of 3:

0, 3, 6, 9, ..., 2015231997

I then timed how long it took to search this file for 10 interesting numbers (and to confirm the returned results were as expected).

-1, 0, 1, 2, 1007616000, 1007616001, 1007616002, 1007616003, 2015231997, 2015232000

Binary_search() recorded a time of 0.308 seconds on a rather old MacBook^[3]. Using a hand-coded linear search the run time was just over 38 minutes. That is, the binary search ran 7000 times faster on this sample.

Interestingly, repeated runs of the binary search test using the same input file and the same targets ran in an average time of just 0.030 seconds, a 10-fold times speed up over the first run. Similarly repeating the linear search showed no such improvement. I’m attributing this to operating system file caching, but I don’t pretend to know exactly what’s going on here. (My thanks to Michal Mocny for his explanation in the comments below).

Binary file number iterator

#ifndef BINARY_FILE_NUMBER_ITERATOR_HPP_INCLUDED
#define BINARY_FILE_NUMBER_ITERATOR_HPP_INCLUDED

#include <fstream>
#include <ios>
#include <iostream>
#include <iterator>
#include <stdexcept>

// File read error, thrown when low level file access fails.
class file_read_error : public std::runtime_error
{
public:
    file_read_error(std::string const & what)
        : std::runtime_error(what)
    {
    }
};


// This iterator class is used for numbers packed into a file
// using a fixed width binary format. Numbers must be packed
// most significant byte first.
//
// The file is not read into memory. Iterators are moved by
// file seeking and dereferenced by reading from the file.
//
// These iterators declare themselves to be random access
// iterators but a file is not usually a random access device.
// For example, advancing an iterator a large distance may well
// take longer than advancing a small distance.
template <typename number, int number_size>
class binary_file_number_iter
{
    typedef binary_file_number_iter<number, number_size> iter;

private: // Sanity
    // Check things are OK, closing the stream and throwing an error on failure.
    void check(bool ok, std::string const & what)
    {
        if (!ok)
        {
            close();
            throw file_read_error(what);          
        }
    }

public: // Traits typedefs, which make this class usable with
        // algorithms which need a random access iterator.
    typedef std::random_access_iterator_tag iterator_category;
    typedef number value_type;
    typedef std::streamoff difference_type;
    typedef number * pointer;
    typedef number & reference;

public:
    static int const number_width = number_size;

public: // Enum used to construct begin, end iterators
    enum start_pos { begin, end };

public: // Lifecycle
    binary_file_number_iter(std::string const & filename,
                            start_pos where = begin)
        : filename(filename)
        , pos(-1)
    {
        open();
        if (where == end)
        {
            seek_end();
        }
    }

    binary_file_number_iter()
        : pos(-1)
    {
    }

    binary_file_number_iter(iter const & other)
        : filename(other.filename)
    {
        open();
        setpos(other.pos);
    }

    ~binary_file_number_iter()
    {
        close();
    }

    iter & operator=(iter const & other)
    {
        close();
        filename = other.filename;
        open();
        setpos(other.pos);
        return *this;
    } 

public: // Comparison
        // Note: it is an error to compare iterators into different files.
    bool operator<(iter const & other) const
    {
        return pos < other.pos;
    }

    bool operator>(iter const & other) const
    {
        return pos > other.pos;
    }

    bool operator==(iter const & other) const
    {
        return pos == other.pos;
    }

    bool operator!=(iter const & other) const
    {
        return pos != other.pos;
    }

public: // Iteration
    iter & operator++()
    {
        return *this += 1;
    }

    iter & operator--()
    {
        return *this -= 1;
    }

    iter operator++(int)
    {
        iter tmp(*this);
        ++(*this);
        return tmp;
    }

    iter operator--(int)
    {
        iter tmp(*this);
        --(*this);
        return tmp;
    }

public: // Step
    iter & operator+=(difference_type n)
    {
        advance(n);
        return *this;
    }

    iter & operator-=(difference_type n)
    {
        advance(-n);
        return *this;
    }

    iter operator+(difference_type n)
    {
        iter result(*this);
        return result += n;
    }

    iter operator-(difference_type n)
    {
        iter result(*this);
        return result -= n;
    }

public: // Distance
    difference_type operator-(iter & other)
    {
        return (pos - other.pos) / number_size;
    }

public: // Access
    value_type operator*()
    {
        return (*this)[0];
    }

    value_type operator[](difference_type n)
    {
        std::streampos restore = getpos();
        advance(n);
        value_type const result = read();
        setpos(restore);
        return result;
    }

    // Allow access to the underlying stream position
    std::streampos getpos()
    {
        std::streampos s = in.tellg();
        check(in, "getpos failed");
        return s;
    }
private: // Implementation details
    void open()
    {
        in.open(filename.c_str(), std::ios::binary);
        check(in, "open failed");
        pos = getpos();
    }

    void close()
    {
        if (in.is_open())
        {
            in.close();
        }
    }

    void advance(difference_type n)
    {
        check(in.seekg(n * number_size, std::ios_base::cur), "advance failed");
        pos = getpos();
    }

    void seek_end()
    {
        check(in.seekg(0, std::ios_base::end), "seek_end failed");
        pos = getpos();
    }

    void setpos(std::streampos newpos)
    {
        check(in.seekg(newpos), "setpos failed");
        pos = newpos;
    }

    value_type read()
    {
        number n = 0;
        unsigned char buf[number_size];
        check(in.read((char *)buf, number_size), "read failed");

        for (int i = 0; i != number_size; ++i)
        {
            n <<= 8;
            n |= buf[i];
        }
        return n;
    }

private: // State
    std::string filename;
    std::ifstream in;
    std::streampos pos;
};

#endif // BINARY_FILE_NUMBER_ITERATOR_HPP_INCLUDED

Here are some basic tests for the binary file number iterator.

Test binary file number iterator

#include <assert.h>
#include <fstream>
#include <algorithm>
#include <ext/algorithm> // For Gnu's non-standard is_sorted
#include <iostream>

#include "binary_file_number_iterator.hpp"

typedef binary_file_number_iter<long long, 8> iter8;
typedef binary_file_number_iter<int, 4> iter4;
typedef binary_file_number_iter<short, 2> iter2;
typedef binary_file_number_iter<char, 1> iter1;

template <typename fwd_it>
bool is_sorted(fwd_it beg, fwd_it end)
{
    return __gnu_cxx::is_sorted(beg, end);
}

char const * empty_test_file(char const * name)
{
    std::ofstream ofile;
    ofile.open(name);
    ofile.close();
    return name;
}

/*
  Create a small test file containing numbers, in ascending order,
  for number sizes 1, 2, 4 and 8 bytes.

  A hex view of the file looks like:

  0000 0000 0000 0000 0303 0303 0303 0303
  0606 0606 0606 0606 0909 0909 0909 0909
  0c0c 0c0c 0c0c 0c0c 0f0f 0f0f 0f0f 0f0f
*/
char const * basic_test_file(char const * name)
{
    std::ofstream ofile;
    ofile.open(name);
    for (unsigned char i = 0; i != 18; i += 3)
        for (unsigned j = 0; j != 8; ++j)
            ofile << i;            
    ofile.close();
    return name;
}

void empty_file_tests()
{
    char const * empty_file = empty_test_file("empty_test_file");
    iter1 beg(empty_file, iter1::begin);
    iter1 end(empty_file, iter1::end);
    assert(beg == end);
    assert(std::lower_bound(beg, end, -1) == end);
    assert(std::upper_bound(beg, end, -1) == end);
    assert(!std::binary_search(beg, end, 0));
    assert(std::equal_range(beg, end, -1) == std::make_pair(beg, beg));
}

template <typename value_type>
value_type repeat(int v, int w)
{
    value_type result = 0;
    while (w-- != 0)
    {
        result <<= 8;
        result |= v;
    }
    return result;
}

template <typename iter>
void basic_file_tests()
{
    char const * basic_file = basic_test_file("basic_test_file");

    typedef typename iter::value_type value_t;
    typedef typename std::pair<iter, iter> range;
    int const w = iter::number_width;

    iter beg(basic_file, iter::begin);
    iter end(basic_file, iter::end);
    assert(beg < end);
    assert(!(beg > end));
    assert(!(beg == end));
    assert(beg != end);
    assert(end - beg == 48 / w);

    iter mid = beg;
    assert(mid[0] == 0);
    assert(mid[8/w] == repeat<value_t>(3, w));
    assert(*mid == 0);
    assert(*mid++ == 0);
    assert(*--mid == 0);
    assert(*(mid += 16/w) == repeat<value_t>(6, w));
    assert(mid < end);
    assert(mid > beg);

    assert(is_sorted(beg, end));
    assert(std::lower_bound(beg, mid, -1) == beg);
    assert(std::lower_bound(beg, mid, 0) == beg);
    assert(std::upper_bound(beg, mid, 0) == beg + 8/w);
    assert(std::upper_bound(beg, mid, 1) == beg + 8/w);
    assert(std::binary_search(beg, end, 0));
    assert(std::binary_search(beg, end, repeat<value_t>(0xf, w)));

    mid = beg + 8/w;
    assert(std::equal_range(beg, end, 0) == std::make_pair(beg, mid));
    assert(std::equal_range(beg, end, 1) == std::make_pair(mid, mid));
}

int main()
{
    empty_file_tests();
    basic_file_tests<iter1>();
    basic_file_tests<iter2>();
    basic_file_tests<iter4>();
    basic_file_tests<iter8>();
    return 0;
}

Triple fail

In this article we’ve discussed binary search:

referring to existing implementations
calling library routines, such as std::binary_search
and written some tests.

Despite this indiscipline, we never even bothered to roll our own binary search: we’ve tackled the exact opposite of the problem which Mike Taylor set. Programming tasks rarely start with a clear specification, and even if they do, the specification needs questioning.

Thanks

My thanks to pinprick for the cheese photos, and for this delicious description.

morbier is a soft-ripened, washed rind cheese. the tradition of bathing the rinds in salty water (or strong ale) goes back to trappist monks, who perfected the art. washing the rind makes it tougher, protecting the cheese and making it last longer. washing the rind also makes it a place where a certain bacteria, b. linens, love to hang out. while they work their magic, making the cheese inside smooth and creamy and silky, they also make the outside stinky. there isn’t any good way to put it. however, most stinky cheese taste amazing, and once you realize that, you find that you love the smell of stinky cheese. stink on the outside means gold on the inside! — pinprick

How long before flickr implements scratch and sniff?

1: Well, something which masquerades as a random access iterator. Files are not usually random access devices, and the time taken by a seek operation may well vary with the seek offset. By supplying random access scaffolding, we at least ensure that a single, efficient, seek operation is used each time we advance the file position.

2: The C++ standard describes the requirements on types in some detail. Unfortunately C++ implementations provide little support for enforcing these requirements. Violations are likely to be punished by grotesque compiler warnings.

[3]: The laptop specification:

Hardware Overview:
  Model Name:               MacBook
  Model Identifier:     MacBook1,1
  Processor Name:       Intel Core Duo
  Processor Speed:      2 GHz
  Number Of Processors:     1
  Total Number Of Cores:    2
  L2 Cache (per processor): 2 MB
  Memory:                   2 GB
  Bus Speed:                667 MHz

Word Aligned

sweating the small stuff