Unleash the test army

2017-05-29 • Tests, Python • Comments

Are the tests adequate?

Recently I described a solution to the problem of dividing a list into evenly sized chunks. It’s a simple enough problem with just two inputs: the list (or other sliceable container) xs and the number of chunks n. Nonetheless, there are traps to avoid and special cases to consider — what if n is larger than the list, for example? Must the chunks comprise contiguous elements from the original list?

The tests I came up with are straightforward and uninspiring. They were developed within the context of my own assumptions about the solution and the special cases I could imagine. They were written after the implementation — which is to say, development wasn’t driven by tests. They are whitebox tests, designed to cover the various paths through the code based on my insider knowledge.

Are these tests adequate? Certainly they don’t accurately represent the data which will hit the algorithm in practice. Can we be sure we haven’t missed anything? Would the tests still cover all paths if the implementation changed?

Property based testing

David R MacIver described another, complementary, approach at a talk I attended at ACCU 2016. In the talk abstract he characterises the (class of) tests I’d written as anecdotal — “let me tell you about this time I called a function … and then it returned this .. and then it raised an exception … etc. etc.”

How about if the test suite instead describes the properties required of the system under test, and then conjures up inputs designed to see if these properties hold under stress? So, rather than our test suite being a limited set of input/output pairs, it becomes an executable specification validated by a robot army.

Hypothesis

This approach sounds compelling but I had my doubts. I also had my doubts about the adequacy of both my code and tests. A perfect opportunity, then, to try out Hypothesis, an open source property-based testing library developed by David MacIver.

I used the Python version of the library, which is the primary implementation. The rest of this article describes my experience of using hypothesis for the first time: I’m not claiming expertise.

First impressions

Excellent!

Installation was the usual pip invocation. The documentation is written with care. It’s evident the library is mature, supported and actively developed. It’s licensed under the Mozilla Public License.

My first test

Recall that the code I wanted to test reads:

def chunk(xs, n):
    '''Split the list, xs, into n evenly sized chunks'''
    L = len(xs)
    assert 0 < n <= L
    s, r = divmod(L, n)
    t = s + 1
    return ([xs[p:p+t] for p in range(0, r*t, t)] +
            [xs[p:p+s] for p in range(r*t, L, s)])

I also proposed a second itertools based version:

from itertools import accumulate, chain, repeat, tee

def chunk(xs, n):
    '''Split the list, xs, into n evenly sized chunks'''
    assert n > 0
    L = len(xs)
    s, r = divmod(L, n)
    widths = chain(repeat(s+1, r), repeat(s, n-r))
    offsets = accumulate(chain((0,), widths))
    b, e = tee(offsets)
    next(e)
    return [xs[s] for s in map(slice, b, e)]

The first thing you notice when thinking about a property based test is that the specification — the function’s docstring — doesn’t describe the exact form of the output. In fact, as a comment on the article pointed out, my own interpretation of the specification is not the only one, and allowing the chunks to be formed from non-contiguous items permits a particularly elegant solution.

Also, if the list doesn’t divide exactly into n chunks, what should the result be? Well, although I’d have been happy with any evenly-chunked solution, my conventional unit tests assumed an implementation which placed the larger chunks first.

def test_chunk():
    assert chunk('', 1) == ['']
    assert chunk('ab', 2) == ['a', 'b']
    assert chunk('abc', 2) == ['ab', 'c']

    xs = list(range(8))
    assert chunk(xs, 2) == [[0, 1, 2, 3], [4, 5, 6, 7]]
    assert chunk(xs, 3) == [[0, 1, 2], [3, 4, 5], [6, 7]]
    assert chunk(xs, 5) == [[0, 1], [2, 3], [4, 5], [6], [7]]

    rs = range(1000000)
    assert chunk(rs, 2) == [range(500000), range(500000, 1000000)]

Notice, by the way, that although the docstring only mentions lists, I can’t resist demonstrating the algorithm also behaves for strings and ranges — for any sliceable sequence, in fact.

Here’s what I started with when I tried specifying the “evenly sized” property using hypothesis.

def test_evenly_chunked(xs, n):
    chunks = chunk(xs, n)
    assert len(chunks) == n
    chunk_lens = {len(c) for c in chunks}
    assert len(chunk_lens) in {1, 2}
    assert max(chunk_lens) - min(chunk_lens) in {0, 1}

This first test case defines “evenly sized”, stating that the result comprises n chunks, that the set of the lengths of these chunks is either 1 (all chunks the same size) or 2, and the maximum chunk length is equal to or one more than the minumum chunk length.

This doesn’t fully specify the function. We also need assertions which confirm that recombining the chunks produces the original sequence.

def test_combining_chunks(xs_n):
    pass # We'll come back to this!

We’ll come back to this later!

Now, test_evenly_chunked() looks quite like a conventional test function. It just needs some input values. Rather than poke the function with some hand-chosen values, we can let hypothesis have a go.

Based on a read of the Quick start guide I tried this:

import hypothesis as ht
import hypothesis.strategies as st

@ht.given(xs=st.lists(), n=st.integers())
def test_evenly_chunked(xs, n):
    chunks = chunk(xs, n)
    assert len(chunks) == n
    chunk_lens = {len(c) for c in chunks}
    assert len(chunk_lens) in {1, 2}
    assert max(chunk_lens) - min(chunk_lens) in {0, 1}

As you can see, the test function pre-conditions are encapsulated in a hypothesis.given decorator, which specifies the use of hypothesis.strategies.lists() and hypothesis.strategies.integers() to generate test values for xs and n respectively.

The result was a lengthy but helpful failure, which printed out the documentation of the lists() strategy and the usage error:

hypothesis.errors.InvalidArgument: Cannot create non-empty lists without an element type

OK then. The function doesn’t really care about the element type. Integers will do.

@ht.given(xs=st.lists(st.integers()), n=st.integers())
def test_evenly_chunked(xs, n):
    ....

This gave me an error, along with a minimal test case which produces it.

xs = [], n = 0

    def chunk(xs, n):
        '''Split the list, xs, into n evenly sized chunks'''
        L = len(xs)
>       assert 0 < n <= L
E       AssertionError

Our function, chunk() requires the value n to be in the closed range (0, len(xs)]. Looking more closely at the failure, we can see that the function under test, chunk(), isn’t great, since we won’t be able to split an empty list into any number of chunks since, in this case, L is zero and no value of n satisfies 0 < n <= L.

At this point I had to makes some choices:

should my tests confirm chunk() was checking pre-conditions (by catching the AssertionError)?
should my function handle the case when n > L? It’s not the intended use of the function, but it can be handled.
what about when n == 0? Splitting a non-empty list into 0 chunks is impossible, but I guess an empty list can be split into 0 chunks.

My second test

I made some decisions.

I decided not to test the pre-condition assertions. Instead, I’d modify the test strategy to pass in valid inputs.
I decided I’d go with the itertools chunk function which naturally handles n > L.
I decided my function needn’t handle n == 0, even when xs == [].

Here’s the modified test code

@ht.given(xs=st.lists(st.integers()),
          n=st.integers(min_value=1))
def test_evenly_chunked(xs, n):
    chunks = chunk(xs, n)
    assert len(chunks) == n
    chunk_lens = {len(c) for c in chunks}
    assert len(chunk_lens) in {1, 2}
    assert max(chunk_lens) - min(chunk_lens) in {0, 1}
    ....

When I tried running the tests again, they appeared to hang until I interrupted them.

> py.test
============================= test session starts =============================
platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
....
plugins: hypothesis-3.8.3
collected 1 items

test_chunk.py   C-c C-c

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
to show a full traceback on KeyboardInterrupt use --fulltrace

Now, I had a suspicion that chunk() couldn’t really handle any input integers — it was designed for a value of n equal to multiprocessing.cpu_count() — but I wanted to see what would happen with no upper limits. Here was my answer. Running and interrupting again with --fulltrace set, I got several pages of output ending with the test inputs:

xs = [50], n = 67108865

Evidently my code was taking a while to create a list comprising a single list [50] and over 67 million empty lists [], [], [], ....

Once again, I had a decision to make. Perhaps unsurprisingly, it’s a decision I’d already faced. I could make chunk() a generator function, yielding the chunks one at a time — a trivial and natural change to the itertools based implementation — or I could constrain the tests to more suitable values of n.

In this case I decided to stick with what I had: my function would accept a list and return a list (of lists). In an attempt to get some passing tests, I set a maximum value on n.

@ht.given(xs=st.lists(st.integers()),
          n=st.integers(min_value=1, max_value=100))
def test_evenly_chunked(xs, n):
    ....

At last, I had a passing test.

py.test 
============================= test session starts =============================
platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /work/sliced-python, inifile:
plugins: hypothesis-3.8.3
collected 1 items

test_chunk.py .

========================== 1 passed in 0.23 seconds ===========================

Building on this success, I wanted to confirm the function also handled other sliceable types — strings and bytes specifically. Hypothesis provides a one_of strategy for combining other strategies.

@ht.given(xs=st.one_of(st.text(),
                       st.binary(),
                       st.lists(st.integers())),
          n=st.integers(min_value=1, max_value=100))
def test_evenly_chunked(xs, n):
    chunks = chunk(xs, n)
    assert len(chunks) == n
    chunk_lens = {len(c) for c in chunks}
    assert len(chunk_lens) in {1, 2}
    assert max(chunk_lens) - min(chunk_lens) in {0, 1}

Again, the test passes.

py.test
============================= test session starts =============================
platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /work/sliced-python, inifile:
plugins: hypothesis-3.8.3
collected 1 items

test_chunk.py .

========================== 1 passed in 0.30 seconds ===========================

This output is rather inscrutable. Generally, passing tests shouldn’t draw attention to themselves, but what inputs had my test strategies generated? Were they sufficient?

A commandline switch provides a little more detail.

py.test --hypothesis-show-statistics

....

test_chunk.py::test_evenly_chunked:

  - 200 passing examples, 0 failing examples, 0 invalid examples
  - Typical runtimes: < 1ms
  - Stopped because settings.max_examples=200

It’s also possible to peek at examples produced by the test strategy.

>>> s=st.one_of(st.text(), st.binary(), st.lists(st.integers()))
>>> s.example()
b''
>>> s.example()
b'\xc2\xfd6['
>>> s.example()
':\nú&\U000ea7e8'
>>> s.example()
b'\xe7z'
>>> s.example()
''
>>> s.example()
[184, 36, -205, 1486638017]

Complete test suite

Here’s my final test suite. Rather than hard code a maximum value for n, I used a composite strategy which adapts n to the size of xs. I’ve also added a test which confirms the result does comprise chunks of the input sequence.

import functools

import hypothesis as ht
import hypothesis.strategies as st

@st.composite
def items_and_chunk_count(draw):
    xs = draw(st.one_of(st.text(),
                        st.binary(),
                        st.lists(st.integers())))
    n = draw(st.integers(min_value=1,
                         max_value=max(1, len(xs))))
    return xs, n


@ht.given(xs_n=items_and_chunk_count())
def test_evenly_chunked(xs_n):
    '''Verify there are n evenly sized chunks'''
    xs, n = xs_n
    chunks = chunk(xs, n)
    assert len(chunks) == n
    chunk_lens = {len(c) for c in chunks}
    assert len(chunk_lens) in {1, 2}
    assert max(chunk_lens) - min(chunk_lens) in {0, 1}


@ht.given(xs_n=items_and_chunk_count())
def test_combining_chunks(xs_n):
    '''Verify recombining the chunks reproduces the original sequence.'''
    xs, n = xs_n
    chunks = chunk(xs, n)
    assert functools.reduce(lambda x, y: x+y, chunks) == xs

Quality of failure

In the comments to my original article Mike Edey put forward an elegant solution to the original problem of evenly subdividing a sequence into an exact number of chunks:

def chunk(xs, n):
    return [xs[index::n] for index in range(n)]

This is a delightful piece piece of code, and an approach I simply hadn’t considered. If the input list xs represents a number of tasks to be distributed amongst n workers, this does the job evenly. In my actual motivating example, though, however, the input sequence was a document which caused a problem, and what I wanted to do was split that document up into a number of sections and see which of these exhibited the same problem: that is, I needed the chunks to be contiguous blocks of text from the original document. This is the property which test_combining_chunks() checks.

Running Mike Edey’s implementation through the test suite, we get:

py.test
============================= test session starts =============================
platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /work/sliced-python, inifile:
plugins: hypothesis-3.8.3
collected 2 items

test_chunk.py .F

================================== FAILURES ===================================
____________________________ test_combining_chunks ____________________________

    @ht.given(xs_n=items_and_chunk_count())
>   def test_combining_chunks(xs_n):

test_chunk.py:29: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
d:\venvs\slackbot\lib\site-packages\hypothesis\core.py:634: in wrapped_test
    state.run()
d:\venvs\slackbot\lib\site-packages\hypothesis\core.py:531: in run
    print_example=True, is_final=True
d:\venvs\slackbot\lib\site-packages\hypothesis\executors.py:58: in default_new_style_executor
    return function(data)
d:\venvs\slackbot\lib\site-packages\hypothesis\core.py:113: in run
    return test(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

xs_n = ('001', 2)

    @ht.given(xs_n=items_and_chunk_count())
    def test_combining_chunks(xs_n):
        '''Verify recombining the chunks reproduces the original sequence.'''
        xs, n = xs_n
        chunks = chunk(xs, n)
>       assert functools.reduce(lambda x, y: x+y, chunks) == xs
E       AssertionError: assert '010' == '001'
E         - 010
E         + 001

test_chunk.py:33: AssertionError
--------------------------------- Hypothesis ----------------------------------
Falsifying example: test_combining_chunks(xs_n=('001', 2))
===================== 1 failed, 1 passed in 0.52 seconds ======================

Hypothesis has discovered a minimal failing example: the string 001 splits into 2 chunks as 01, 0.

Conclusions

Hypothesis worked well for this particular example.

it forced me to pin down the function specification
I had to consider the special cases: would the function behave in the face of logically permissable inputs, and not just the ones I had in mind when I wrote it
it increased my confidence the function was correct
and particularly appealing, in this case — the tests were not tied to a detail of the implementation, and would continue to work if, for example, the larger chunks were to appear at the end of the results.

More generally, I found the hypothesis library solid. It’s well designed and documented, as well as being a fine example of how to use Python decorators.

I’d say property based testing complements example based testing. Example based unit tests show you how a function is used, for instance; with hypothesis, this useful demonstration happens behind the scenes (though note that your hypothesis tests can include explicit @examples). Example based unit tests are typically one or two orders of magnitude quicker to execute. It’s not a problem if a couple of tests take half a second to run, but what if you have a couple of thousand tests?

In my case the built-in strategies were good enough to generate inputs to my function. I can imagine that’s not the case for functions higher up a software stack. Test setup functions can be hard work and I suspect test setup strategies would be harder.

In closing, I’d like to quote from the section of the Hypothesis documentation which describes its purpose.

Software is, as they say, eating the world. Software is also terrible. It’s buggy, insecure and generally poorly thought out. This combination is clearly a recipe for disaster.

And the state of software testing is even worse. Although it’s fairly uncontroversial at this point that you should be testing your code, can you really say with a straight face that most projects you’ve worked on are adequately tested?

A lot of the problem here is that it’s too hard to write good tests. Your tests encode exactly the same assumptions and fallacies that you had when you wrote the code, so they miss exactly the same bugs that you missed when you wrote the code.

— The Purpose of Hypothesis

Word Aligned

space sensitive programming