Unleash the test army
Are the tests adequate?
Recently I described a solution to the problem of dividing a list into evenly sized chunks. It’s a simple enough problem with just two inputs: the list (or other sliceable container) xs
and the number of chunks n
. Nonetheless, there are traps to avoid and special cases to consider — what if n
is larger than the list, for example? Must the chunks comprise contiguous elements from the original list?
The tests I came up with are straightforward and uninspiring. They were developed within the context of my own assumptions about the solution and the special cases I could imagine. They were written after the implementation — which is to say, development wasn’t driven by tests. They are whitebox tests, designed to cover the various paths through the code based on my insider knowledge.
Are these tests adequate? Certainly they don’t accurately represent the data which will hit the algorithm in practice. Can we be sure we haven’t missed anything? Would the tests still cover all paths if the implementation changed?
Property based testing
David R MacIver described another, complementary, approach at a talk I attended at ACCU 2016. In the talk abstract he characterises the (class of) tests I’d written as anecdotal — “let me tell you about this time I called a function … and then it returned this .. and then it raised an exception … etc. etc.”
How about if the test suite instead describes the properties required of the system under test, and then conjures up inputs designed to see if these properties hold under stress? So, rather than our test suite being a limited set of input/output pairs, it becomes an executable specification validated by a robot army.
Hypothesis
This approach sounds compelling but I had my doubts. I also had my doubts about the adequacy of both my code and tests. A perfect opportunity, then, to try out Hypothesis, an open source property-based testing library developed by David MacIver.
I used the Python version of the library, which is the primary implementation. The rest of this article describes my experience of using hypothesis for the first time: I’m not claiming expertise.
First impressions
Excellent!
Installation was the usual pip
invocation. The documentation is written with care. It’s evident the library is mature, supported and actively developed. It’s licensed under the Mozilla Public License.
My first test
Recall that the code I wanted to test reads:
def chunk(xs, n): '''Split the list, xs, into n evenly sized chunks''' L = len(xs) assert 0 < n <= L s, r = divmod(L, n) t = s + 1 return ([xs[p:p+t] for p in range(0, r*t, t)] + [xs[p:p+s] for p in range(r*t, L, s)])
I also proposed a second itertools
based version:
from itertools import accumulate, chain, repeat, tee def chunk(xs, n): '''Split the list, xs, into n evenly sized chunks''' assert n > 0 L = len(xs) s, r = divmod(L, n) widths = chain(repeat(s+1, r), repeat(s, n-r)) offsets = accumulate(chain((0,), widths)) b, e = tee(offsets) next(e) return [xs[s] for s in map(slice, b, e)]
The first thing you notice when thinking about a property based test is that the specification — the function’s docstring — doesn’t describe the exact form of the output. In fact, as a comment on the article pointed out, my own interpretation of the specification is not the only one, and allowing the chunks to be formed from non-contiguous items permits a particularly elegant solution.
Also, if the list doesn’t divide exactly into n
chunks, what should the result be? Well, although I’d have been happy with any evenly-chunked solution, my conventional unit tests assumed an implementation which placed the larger chunks first.
def test_chunk(): assert chunk('', 1) == [''] assert chunk('ab', 2) == ['a', 'b'] assert chunk('abc', 2) == ['ab', 'c'] xs = list(range(8)) assert chunk(xs, 2) == [[0, 1, 2, 3], [4, 5, 6, 7]] assert chunk(xs, 3) == [[0, 1, 2], [3, 4, 5], [6, 7]] assert chunk(xs, 5) == [[0, 1], [2, 3], [4, 5], [6], [7]] rs = range(1000000) assert chunk(rs, 2) == [range(500000), range(500000, 1000000)]
Notice, by the way, that although the docstring only mentions lists, I can’t resist demonstrating the algorithm also behaves for strings and ranges — for any sliceable sequence, in fact.
Here’s what I started with when I tried specifying the “evenly sized” property using hypothesis.
def test_evenly_chunked(xs, n): chunks = chunk(xs, n) assert len(chunks) == n chunk_lens = {len(c) for c in chunks} assert len(chunk_lens) in {1, 2} assert max(chunk_lens) - min(chunk_lens) in {0, 1}
This first test case defines “evenly sized”, stating that the result comprises n
chunks, that the set of the lengths of these chunks is either 1 (all chunks the same size) or 2, and the maximum chunk length is equal to or one more than the minumum chunk length.
This doesn’t fully specify the function. We also need assertions which confirm that recombining the chunks produces the original sequence.
def test_combining_chunks(xs_n): pass # We'll come back to this!
We’ll come back to this later!
Now, test_evenly_chunked()
looks quite like a conventional test function. It just needs some input values. Rather than poke the function with some hand-chosen values, we can let hypothesis have a go.
Based on a read of the Quick start guide I tried this:
import hypothesis as ht import hypothesis.strategies as st @ht.given(xs=st.lists(), n=st.integers()) def test_evenly_chunked(xs, n): chunks = chunk(xs, n) assert len(chunks) == n chunk_lens = {len(c) for c in chunks} assert len(chunk_lens) in {1, 2} assert max(chunk_lens) - min(chunk_lens) in {0, 1}
As you can see, the test function pre-conditions are encapsulated in a hypothesis.given
decorator, which specifies the use of hypothesis.strategies.lists()
and hypothesis.strategies.integers()
to generate test values for xs
and n
respectively.
The result was a lengthy but helpful failure, which printed out the documentation of the lists()
strategy and the usage error:
hypothesis.errors.InvalidArgument: Cannot create non-empty lists without an element type
OK then. The function doesn’t really care about the element type. Integers will do.
@ht.given(xs=st.lists(st.integers()), n=st.integers()) def test_evenly_chunked(xs, n): ....
This gave me an error, along with a minimal test case which produces it.
xs = [], n = 0 def chunk(xs, n): '''Split the list, xs, into n evenly sized chunks''' L = len(xs) > assert 0 < n <= L E AssertionError
Our function, chunk()
requires the value n
to be in the closed range (0, len(xs)]
. Looking more closely at the failure, we can see that the function under test, chunk()
, isn’t great, since we won’t be able to split an empty list into any number of chunks since, in this case, L
is zero and no value of n
satisfies 0 < n <= L
.
At this point I had to makes some choices:
- should my tests confirm
chunk()
was checking pre-conditions (by catching theAssertionError
)? - should my function handle the case when
n > L
? It’s not the intended use of the function, but it can be handled. - what about when
n == 0
? Splitting a non-empty list into0
chunks is impossible, but I guess an empty list can be split into0
chunks.
My second test
I made some decisions.
- I decided not to test the pre-condition assertions. Instead, I’d modify the test strategy to pass in valid inputs.
- I decided I’d go with the
itertools
chunk function which naturally handlesn > L
. - I decided my function needn’t handle
n == 0
, even whenxs == []
.
Here’s the modified test code
@ht.given(xs=st.lists(st.integers()), n=st.integers(min_value=1)) def test_evenly_chunked(xs, n): chunks = chunk(xs, n) assert len(chunks) == n chunk_lens = {len(c) for c in chunks} assert len(chunk_lens) in {1, 2} assert max(chunk_lens) - min(chunk_lens) in {0, 1} ....
When I tried running the tests again, they appeared to hang until I interrupted them.
> py.test ============================= test session starts ============================= platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 .... plugins: hypothesis-3.8.3 collected 1 items test_chunk.py C-c C-c !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! to show a full traceback on KeyboardInterrupt use --fulltrace
Now, I had a suspicion that chunk()
couldn’t really handle any input integers — it was designed for a value of n
equal to multiprocessing.cpu_count()
— but I wanted to see what would happen with no upper limits. Here was my answer. Running and interrupting again with --fulltrace
set, I got several pages of output ending with the test inputs:
xs = [50], n = 67108865
Evidently my code was taking a while to create a list comprising a single list [50]
and over 67 million empty lists [], [], [], ...
.
Once again, I had a decision to make. Perhaps unsurprisingly, it’s a decision I’d already faced. I could make chunk()
a generator function, yielding the chunks one at a time — a trivial and natural change to the itertools
based implementation — or I could constrain the tests to more suitable values of n
.
In this case I decided to stick with what I had: my function would accept a list and return a list (of lists). In an attempt to get some passing tests, I set a maximum value on n
.
@ht.given(xs=st.lists(st.integers()), n=st.integers(min_value=1, max_value=100)) def test_evenly_chunked(xs, n): ....
At last, I had a passing test.
py.test ============================= test session starts ============================= platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 rootdir: /work/sliced-python, inifile: plugins: hypothesis-3.8.3 collected 1 items test_chunk.py . ========================== 1 passed in 0.23 seconds ===========================
Building on this success, I wanted to confirm the function also handled other sliceable types — strings and bytes specifically. Hypothesis provides a one_of
strategy for combining other strategies.
@ht.given(xs=st.one_of(st.text(), st.binary(), st.lists(st.integers())), n=st.integers(min_value=1, max_value=100)) def test_evenly_chunked(xs, n): chunks = chunk(xs, n) assert len(chunks) == n chunk_lens = {len(c) for c in chunks} assert len(chunk_lens) in {1, 2} assert max(chunk_lens) - min(chunk_lens) in {0, 1}
Again, the test passes.
py.test ============================= test session starts ============================= platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 rootdir: /work/sliced-python, inifile: plugins: hypothesis-3.8.3 collected 1 items test_chunk.py . ========================== 1 passed in 0.30 seconds ===========================
This output is rather inscrutable. Generally, passing tests shouldn’t draw attention to themselves, but what inputs had my test strategies generated? Were they sufficient?
A commandline switch provides a little more detail.
py.test --hypothesis-show-statistics .... test_chunk.py::test_evenly_chunked: - 200 passing examples, 0 failing examples, 0 invalid examples - Typical runtimes: < 1ms - Stopped because settings.max_examples=200
It’s also possible to peek at examples produced by the test strategy.
>>> s=st.one_of(st.text(), st.binary(), st.lists(st.integers())) >>> s.example() b'' >>> s.example() b'\xc2\xfd6[' >>> s.example() ':\nú&\U000ea7e8' >>> s.example() b'\xe7z' >>> s.example() '' >>> s.example() [184, 36, -205, 1486638017]
Complete test suite
Here’s my final test suite. Rather than hard code a maximum value for n
, I used a composite strategy which adapts n
to the size of xs
. I’ve also added a test which confirms the result does comprise chunks of the input sequence.
import functools import hypothesis as ht import hypothesis.strategies as st @st.composite def items_and_chunk_count(draw): xs = draw(st.one_of(st.text(), st.binary(), st.lists(st.integers()))) n = draw(st.integers(min_value=1, max_value=max(1, len(xs)))) return xs, n @ht.given(xs_n=items_and_chunk_count()) def test_evenly_chunked(xs_n): '''Verify there are n evenly sized chunks''' xs, n = xs_n chunks = chunk(xs, n) assert len(chunks) == n chunk_lens = {len(c) for c in chunks} assert len(chunk_lens) in {1, 2} assert max(chunk_lens) - min(chunk_lens) in {0, 1} @ht.given(xs_n=items_and_chunk_count()) def test_combining_chunks(xs_n): '''Verify recombining the chunks reproduces the original sequence.''' xs, n = xs_n chunks = chunk(xs, n) assert functools.reduce(lambda x, y: x+y, chunks) == xs
Quality of failure
In the comments to my original article Mike Edey put forward an elegant solution to the original problem of evenly subdividing a sequence into an exact number of chunks:
def chunk(xs, n): return [xs[index::n] for index in range(n)]
This is a delightful piece piece of code, and an approach I simply hadn’t considered. If the input list xs
represents a number of tasks to be distributed amongst n
workers, this does the job evenly. In my actual motivating example, though, however, the input sequence was a document which caused a problem, and what I wanted to do was split that document up into a number of sections and see which of these exhibited the same problem: that is, I needed the chunks to be contiguous blocks of text from the original document. This is the property which test_combining_chunks()
checks.
Running Mike Edey’s implementation through the test suite, we get:
py.test ============================= test session starts ============================= platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 rootdir: /work/sliced-python, inifile: plugins: hypothesis-3.8.3 collected 2 items test_chunk.py .F ================================== FAILURES =================================== ____________________________ test_combining_chunks ____________________________ @ht.given(xs_n=items_and_chunk_count()) > def test_combining_chunks(xs_n): test_chunk.py:29: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ d:\venvs\slackbot\lib\site-packages\hypothesis\core.py:634: in wrapped_test state.run() d:\venvs\slackbot\lib\site-packages\hypothesis\core.py:531: in run print_example=True, is_final=True d:\venvs\slackbot\lib\site-packages\hypothesis\executors.py:58: in default_new_style_executor return function(data) d:\venvs\slackbot\lib\site-packages\hypothesis\core.py:113: in run return test(*args, **kwargs) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ xs_n = ('001', 2) @ht.given(xs_n=items_and_chunk_count()) def test_combining_chunks(xs_n): '''Verify recombining the chunks reproduces the original sequence.''' xs, n = xs_n chunks = chunk(xs, n) > assert functools.reduce(lambda x, y: x+y, chunks) == xs E AssertionError: assert '010' == '001' E - 010 E + 001 test_chunk.py:33: AssertionError --------------------------------- Hypothesis ---------------------------------- Falsifying example: test_combining_chunks(xs_n=('001', 2)) ===================== 1 failed, 1 passed in 0.52 seconds ======================
Hypothesis has discovered a minimal failing example: the string 001
splits into 2
chunks as 01
, 0
.
Conclusions
Hypothesis worked well for this particular example.
- it forced me to pin down the function specification
- I had to consider the special cases: would the function behave in the face of logically permissable inputs, and not just the ones I had in mind when I wrote it
- it increased my confidence the function was correct
- and particularly appealing, in this case — the tests were not tied to a detail of the implementation, and would continue to work if, for example, the larger chunks were to appear at the end of the results.
More generally, I found the hypothesis library solid. It’s well designed and documented, as well as being a fine example of how to use Python decorators.
I’d say property based testing complements example based testing. Example based unit tests show you how a function is used, for instance; with hypothesis, this useful demonstration happens behind the scenes (though note that your hypothesis tests can include explicit @examples). Example based unit tests are typically one or two orders of magnitude quicker to execute. It’s not a problem if a couple of tests take half a second to run, but what if you have a couple of thousand tests?
In my case the built-in strategies were good enough to generate inputs to my function. I can imagine that’s not the case for functions higher up a software stack. Test setup functions can be hard work and I suspect test setup strategies would be harder.
In closing, I’d like to quote from the section of the Hypothesis documentation which describes its purpose.
Software is, as they say, eating the world. Software is also terrible. It’s buggy, insecure and generally poorly thought out. This combination is clearly a recipe for disaster.
And the state of software testing is even worse. Although it’s fairly uncontroversial at this point that you should be testing your code, can you really say with a straight face that most projects you’ve worked on are adequately tested?
A lot of the problem here is that it’s too hard to write good tests. Your tests encode exactly the same assumptions and fallacies that you had when you wrote the code, so they miss exactly the same bugs that you missed when you wrote the code.