Running Sums in Python

2008-06-24, , Comments

Suppose we want to generate the running sum series r formed by sums of n consecutive elements taken from a series s. For example, to sum consecutive pairs taken from the first 6 integers:

>>> n = 2
>>> s = 0, 1, 2, 3, 4, 5
>>> running_sum(s, 2)
[1, 3, 5, 7, 9]

One approach would be to combine the sum built-in function with list slices and comprehensions.

>>> def running_sum(s, n):
... 	return [sum(s[lo:lo + n]) for lo in range(len(s) - n + 1)]
... 
>>> running_sum([0, 1, 2, 3, 4, 5], 2)
[1, 3, 5, 7, 9]

This is fine if:

  1. s is finite
  2. s supports slice access (i.e. s[lo:hi] doesn’t raise a TypeError)
  3. n isn’t too big

With just a little extra thought we can address all these issues.

To deal with the first two points we return to the specification. What exactly do we require of s in order to generate r? Well, all that’s really needed is for s to be iterable — which is to say we can advance along it — then our running sum function can arrange to buffer n items from s and yield their sums. For maximum flexibility the result series r should also be iterable, allowing clients to choose how to consume it[1].

In Python an object, o, becomes iterable by implementing the iterator protocol: o.__iter__() should return an iterator, i, over the container, which i.next() advances, raising a StopIteration exception when done.

>>> r = [1, 2, 3]
>>> i = r.__iter__()
>>> i.next()
1
>>> i.next()
2
>>> i.next()
3
>>> i.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Python programs typically don’t expose this protocol directly since we can build more convenient looping constructs on top of it. By using the yield statement our running sum filter needn’t implement the iterator protocol directly either. Here’s a generator function which uses itertools.islice in place of the original list slices.

Running sum for infinite series
import itertools

def running_sum(s, n):
    while True:
        r, s = itertools.tee(s)
        yield sum(itertools.islice(r, n))
        s.next()

As you can see, objects returned by this function implement the iterator protocol.

# Running sum of pairs from 0, 1, 2, 3, ...
>>> rs = running_sum(itertools.count(), 2)
>>> i = rs.__iter__()
>>> i.next()
1
>>> i.next()
3
>>> for s in rs: print s
... 
5
7
9
11
13
....
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyboardInterrupt

We have to kill the for loop by interrupting it since rs, being infinite (in this particular case), never raises a StopIteration. In fact, this particular version of running_sum() fails badly on finite inputs for reasons we’ll touch on later.

I won’t dwell on this flawed variant of running_sum(), except to note in passing that the built-in sum function doesn’t buffer n items from its input stream — it’s a lazy function which accumulates these items one at a time. There’s no magic here, though: behind the scenes, as the teed iterators r and s diverge, the in-between values must be stored somewhere!

Each slice of items from s overlaps the one before: if we visualise the sliced range sliding along the series, at each stage an element gets pushed in at the top and an element gets popped out from the bottom. Rather than repeatedly summing all the elements of these slices, we can calculate a single sum at the start of the series then adjust it as we progress[2].

def running_sum(s, n):
    lo, hi = itertools.tee(s)
    rs = sum(itertools.islice(hi, n))
    while True:
        yield rs
        rs += hi.next() - lo.next()

Before we sign this function off, there’s a bug to fix. What if n is larger than the length of the input series? We’d expect the output series to be empty, but:

>>> list(running_sum([1, 2, 3], 4))
[6]

Oops! The problem here is in passing an itertools.islice series to sum(), which happily swallows the StopIteration exception without knowing if the sliced stream reached its end or if we reached the end of the slice, or indeed both.

>>> i = itertools.islice([1, 2, 3], 4)
>>> i.next(), i.next(), i.next()
(1, 2, 3)
>>> i.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

A fix is to pass sum a list comprehension. If n is too big a StopIteration exception gets raised before sum ever sees this list.

def running_sum(s, n):
    lo, hi = itertools.tee(s)
    rs = sum([hi.next() for _ in range(n)])
    while True:
        yield rs
        rs += hi.next() - lo.next()

As a final tweak, we can make lo and hi iterator.next functions rather than iterators, which saves a few attribute access calls.

def running_sum(s, n):
    '''Generate the series of running sums of n elements of s.
    
    >>> list(running_sum([1, 2, 3, 4], 2))
    [3, 5, 7]
    >>> rs = running_sum(itertools.count(), 3)
    >>> rs.next(), rs.next(), rs.next()
    (3, 6, 9)
    >>> list(running_sum([1, 2], 3))
    []
    '''
    lo, hi = [i.next for i in itertools.tee(s)]
    rs = sum([hi() for _ in range(n)])
    while True:
        yield rs
        rs += hi() - lo()


Thanks to doubtingthomas for pointing out an off-by-one error in the original version of this article.

[1] In a recent post on his blog, Jake clears up some misconceptions about Haskell. In doing so he analyses the difference between lazy and strict types with clarity and insight. Recommended reading!

This parallel indicates pretty clearly that recursively operating on each element of a stream is an infinite loop. Stream is a control structure. It doesn’t exist to persist data across many parts of a program. It exists to feed data into a function one element at a time. In contrast, a Vector exists to persist an intermediate or final result of some 3-dimensional computation. In short, Stream is for controlling computation, and Vector is for storing data. This generalizes to lazy and strict types, respectively.

[2] There may be situations where we really want each running sum to be generated directly from n consecutive elements of the source stream: for example, if we are dealing with a series of floating point numbers, then addition is not exact and we must take care to avoid accumulated errors.