Animated pair streams

2008-01-13, , , , , , , Comments

Name that Series

What’s going on in the animation below?

Pair stream slideshow

Bounded Pairs

While you’re waiting for the graphic to load, recall a problem we encountered while hunting for maximum subsequences:

Given a number n, generate all the pairs (i, j) such that i and j are in the range 0 to n and j is greater than i.

A Python solution reads:

def generate_pairs(n):
    "Generate all pairs (i, j) such that 0 <= i <= j < n"
    for i in range(n):
        for j in range(i, n):
            yield i, j

We can extend this to solve a more general version of the problem:

Given two sequences (S, T), generate all pairs (Si, Tj) such that j is greater than i.

def generate_pairs(s, t):
    "Generate all pairs s[i], t[j]  such that 0 <= i <= j"
    s_count = len(s)
    t_count = len(t)
    for i in xrange(s_count):
        for j in xrange(i, t_count):
            yield s[i], t[j]

Note:

  1. This function requires complete and bounded collections as inputs since it asks for their lengths and accesses elements by index — think “lists” or “tuples”. Yet it generates results lazily, giving clients control over just how many pairs they require. This is why I’ve chosen xrange in preference to range, since it too generates numbers, rather than returning a complete list. (Python 3000 simplifies things: range becomes xrange and xrange disappears.)

  2. There’s no need for extra logic to handle S being shorter than T or vice-versa: xrange(start, stop) is smart enough to handle the case when stop is less than start. I’d say adding such extra logic would count as premature optimisation.

Here’s an example of this function in action. We run the generator to exhaustion collecting its output in a list.

>>> pairs = list(generate_pairs('ABC', (1, 2, 3, 4)))
>>> print pairs
[('A', 1), ('A', 2), ('A', 3), ('A', 4), ('B', 2), ('B', 3), ('B', 4), ('C', 3), ('C', 4)]

A second example builds a generator expression from generate_pairs(). We plug this expression into the built in any function to determine whether the pair ('A', 2) is in the pair sequence. Note that any only pulls two elements from the sequence to produce an answer.

>>> any((p == 'A', 2) for p in generate_pairs('ABC', (1, 2, 3, 4)))
True

So far, so what? For the toy examples shown, lazy evaluation isn’t an important consideration. But what if our input sequences were very large? What if they were infinite?

Streams of Pairs

Generate_pairs() yields results row by row. Here’s a slideshow for the case when S and T have 5 and 20 elements respectively. Brightness indicates age, so the white circle represents the current head of the generated sequence, and the greying circles represent previous elements — the older the greyer. Remember, it’s up to clients just how many grey pairs to hold: the generator only ever yields a shiny white new pair.

Pair list slideshow

A text representation would be as follows, where, as usual, we read top to bottom, left to right.

S0T0, S0T1, S0T2 ...                             S0T20
      S1T1, S1T2, ...                            S1T20
            S2T2, ...                           S2T20
                  ...

It’s clear this mode of traversal isn’t suitable for generating pairs from very large inputs: we don’t advance through S quickly enough. In fact, if T were infinite, we’d be stuck on the top row, forever generating pairs of the form S0Ti.

A Recursive Scheme

“Structure and Interpretation of Computer Programs” describes an elegant recursive scheme for generating pairs:

Call the general stream of pairs (pairs S T), and consider it to be composed of three parts: the pair (S0,T0), the rest of the pairs in the first row, and the remaining pairs:

Stream of pairs structure

Based on this insight, we can generate our stream of pairs:

  1. yielding the pair, (S0,T0)
  2. yield pairs from the first row combined with the stream of remaining pairs

Note the recursion: the definition of the pairs stream refers to itself. That’s not a problem so long as we’re lazy about evaluating the tail of the stream. Since we always know how to pull the first pair from the stream, we can both bootstrap the process and keep going forever.

All we now need is a suitable method of combining two the first row and the remaining pairs.

Interleaving

The simplest way to combine the two streams is just to interleave them, pulling elements from each in turn. A simple (too simple!) implementation of interleave() reads:

Simple interleave
def interleave(s, t):
    while True:
        yield s.next()
        yield t.next()

(The problem here is that interleave() stops as soon as either of the the input streams is exhausted, which isn’t an issue for infinite streams, but isn’t ideal in the case when either of the inputs are bounded. You get the idea. I’ll provide better implementation later.)

We can now code our pairs generator:

Pairs generator
import itertools

def pairs(s, t):
    """ Generate a stream of pairs taken from s, t.

    Yields a stream of pairs (si, tj) where i <= j.
    The input streams may be finite or bounded, but must be iterable.
    """
    first = s.next(), t.next()
    yield first
    t, t_top = itertools.tee(t)
    s0 = first[0]
    top_row = ((s0, tt) for tt in t_top)
    for st in interleave(top_row, pairs(s, t)):
        yield st

Here’s a slideshow of this algorithm in action.

Pair stream slideshow

Even though Python (CPython, to be specific) and recursion don’t generally get along well (unlike in a proper functional programming language), this implementation doesn’t appear to fall foul of CPython’s maximum recursion limits even though pairs() repeatedly calls itself.

No maximum recursion depth exceeded error here!
>>> integers = itertools.count
>>> for ij in pairs(integers, integers):
...     print ij
... 
(0, 0)
(0, 1)
(1, 1)
(0, 2)
(1, 2)
(0, 3)
(2, 2)
....

When I dug a little deeper, I started to understand why: the function can’t run indefinitely since it eats up memory. It continually tees up iterators over t and creates new generator expressions.

That said, I was surprised just how quickly it consumes memory. After all, generator expressions and iterators shouldn’t take up much space. For the purposes of comparison, I compared how it fared against a Scheme program designed to do the same thing. Both had consumed a gigabyte of memory after running just a few minutes (and, for what it’s worth, the Python program had generated an order of magnitude more pairs by this point).

Scheme pair generator
(require (lib "40.ss" "srfi")) ;; Scheme stream support

(define (interleave s1 s2)
  (if (stream-null? s1)
      s2
      (stream-cons (stream-car s1)
                   (interleave s2 (stream-cdr s1)))))

(define (pairs s t)
  (stream-cons
   (list (stream-car s) (stream-car t))
   (interleave
    (stream-map (lambda (x) (list (stream-car s) x))
                (stream-cdr t))
    (pairs (stream-cdr s) (stream-cdr t)))))

(define ones (stream-cons 1 ones))

(define (stream-add s t) (stream-map + s t))

(define integers
  (stream-cons 0 (stream-add ones integers)))

(stream-for-each
  (lambda (x) (newline) (display x))
  (pairs integers integers))

Double Trouble

Let’s look a little further into the stream. Here’s the genesis of the first 120 pairs.

Pair stream slideshow

Now the pattern becomes clear. Evidently the interleave strategy gives a heavy bias towards advancing rightwards; downwards motion is relatively slow.

Let’s filter out the i == j diagonal to try and quantify this.

>>> ints = itertools.count
>>> izip = itertools.izip
>>> pp = pairs(ints(), ints())
>>> diagonal = ((ix, (i, j))
...             for ix, (i, j) in izip(ints(1), pp) if i == j))
>>> for d in itertools.islice(diagonal, 10):
...     print d
...
(1, (0, 0))
(3, (1, 1))
(7, (2, 2))
(15, (3, 3))
(31, (4, 4))
(63, (5, 5))
(127, (6, 6))
(255, (7, 7))
(511, (8, 8))
(1023, (9, 9))

The output shows that pair (0, 0) is the 1st element of the stream, (1, 1) the 3rd, (2, 2) the 7th, and so on.

Any computer scientist ought to quickly spot the pattern, and any mathematician ought to be able to prove why diagonal advances are exponentially infrequent.

Loosely speaking, if Di is the stream of positions in the pairs stream of elements which lie on the diagonal (i.e. the stream 1, 3, 7, 15, … shown in the interpreted Python session), then the elements of D must satisfy the recurrence relation:

Di+1 = 1 + 2×Di

This relation derives directly from the recursive definition of pairs(): yield a single element, then alternate the rest of yourself with the generated top row.

Here are some slides showing the points at which the first 5 diagonal elements appear. Watch the recursive pattern emerge.

diagonal pairs image diagonal pairs image diagonal pairs image diagonal pairs image diagonal pairs image

The solution to the recurrence relation is Di = 2i − 1. These diagonal advances correspond to recursive calls to pairs(). So if our Python session has a recursion limit set to 1000 (try sys.getrecursionlimit()), and our pairs generator were to yield a result every nanosecond, and we had a machine with infinite memory, it would take quite a few millenia before we run into this limit.

How many millenia until we hit the recursion limit?
>>> (2 ** 1000 - 1)/(1.e9 * 60 * 60 * 24 * 365 * 1000) 
3.3977315042689856e+281

Memory Use

It should also now be clear why this particular algorithm burns memory. Teeing an iterator doesn’t magically mean we don’t have to store any elements we’ve iterated past. The greater the distance between the teed iterators, the more that need storing.

If that’s not clear, run the following Python session and keep an eye on memory consumption (and be ready to interrupt the infinite loop!). You’ll see memory use rapidly increase as i and j diverge, since j needs to remember everything i produces.

>>> from itertools import *
>>> i, j = tee(repeat(0))
>>> for x in i: pass

By contrast, if we zip through i and j in parallel, memory use is stable.

>>> i, j = tee(repeat(0))
>>> for ij in izip(i, j): pass

Our pair generator has to remember more and more elements from the T stream (and also, to a lesser extent, from the S stream). That’s why it’s so hungry. Any general purpose infinite pair generator function must suffer this problem, I think, since we have to accumulate increasing portions of the streams into memory, but if it’s just integer pairs we want we can work around the issue:

def integer_pairs(i=0):
    "Generate a stream of integer pairs (i, j) with i <= j"
    yield i, i
    top_row = ((i, j) for j in itertools.count(i + 1))
    for ij in interleave(top_row, integer_pairs(i + 1)):
        yield ij

Here, rather than tee() streams of integers, we use itertools.count() to create new ones as required. This pair generator will run happily for extended periods. It will very gradually ask for more memory, as numbers get bigger and the number of rows increases. As before, it won’t hit any embarrassing recursion limits in the lifetime of this galaxy.

Problem Solved?

So, does integer_pairs() solve the problem of generating an infinite stream of pairs for the special case when both input streams are integers?

Not really, no!

The interleave strategy for combining the top row with the remaining pairs has the great merit of simplicity, but is otherwise far from ideal. I don’t think a one-size-fits-all strategy for streaming pairs exists, so we should allow clients to supply one. And that will have to be the subject of another article.

Tying up

It’s not hard to fix the problem we noticed with interleave().

def interleave(s, t):
    """Generate an interleaved stream of elements from s and t.

    If one of the streams finishes, elements continue to be generated
    from the other until it too is finished. 
    """
    for ss, tt in itertools.izip(s, t):
        yield ss
        yield tt
    for s_or_t in itertools.chain(s, t):
        yield s_or_t

Here’s the script I used to generate the animated GIFs. Although it’s very much geared to doing the single job which it was designed to do, it’s short enough to be used as a basis for similar tasks. I hope this article helps demonstrate the merits of visualising an algorithm in action.

pairs-animation.py
#! /usr/bin/env python
"""Script which accepts a list of (i, j) pairs, creating slides and
   an animated GIF showing the positions of these pairs.

Example use:
$ echo '((0, 0) (0, 1) (1, 1) (0, 2) (1, 2) (0, 3) (2, 2)' | pairs-animation.py

Dependencies:
- Python Imaging Library (PIL), http://effbot.org/imagingbook/
- aggdraw, anti-aliased graphics drawing package,
  http://effbot.org/zone/pythondoc-aggdraw.htm
- ImageMagick, http://www.imagemagick.org
"""
import Image
import aggdraw
import itertools

# Script wide drawing configuration
block_w = 16
block_pad = 2

def fading_brushes():
    "Generate a sequence of fading grey brushes."
    darkgrey = itertools.repeat(16)
    greyscale = itertools.chain(range(255, 0, -16), darkgrey)
    def brush(g):
        "Return a grey brush."
        return aggdraw.Brush('rgb(%d,%d,%d)' % (g, g, g))
    return itertools.imap(brush, greyscale)

def block_start(i):
    "Return the coordinate (X or Y) at which the ith block starts."
    return (block_w + block_pad) * i + block_pad

def read_ij(data):
    """Return pairs of integers found in the input string.

    Note: this function is extremely unfussy about the string format!
    >>> read_ij('1 2 3 4')
    [(1, 2), (3, 4)]
    >>> read_ij('(1, 2), (3, 4)')
    [(1, 2), (3, 4)]
    """
    import re
    ij = itertools.imap(int, re.compile(r'-?\d+').findall(data))
    return zip(ij, ij)

def dimensions(ij):
    "Return dimensions of an image big enough to contain blocks at ij positions."
    import operator
    max_i = max(map(operator.itemgetter(0), ij))
    max_j = max(map(operator.itemgetter(1), ij))
    return block_start(max_j + 1), block_start(max_i + 1)

def block_xy(ij):
    "Return the left, top, right, bottom coords of a block at ij."
    ylo, xlo = map(block_start, ij)
    return xlo, ylo, xlo + block_w, ylo + block_w

def draw_blocks(image, blocks):
    "Draws the supplied blocks onto the input image."
    draw = aggdraw.Draw(image)
    brushes = fading_brushes()
    # Reverse the blocks so the newest are brightest
    for block, brush in itertools.izip(reversed(blocks), brushes):
        # The 'ellipse' will be circular since the block is square
        draw.ellipse(block_xy(block), brush)
    draw.flush()

def main(ij_data):
    "Creates slides and an animation from the ij block positions."
    import os
    ij = read_ij(ij_data)
    dims = dimensions(ij)
    # Create the slides
    slides = ['pairs-%d.gif' % n for n in range(len(ij) + 1)]
    for n, slide in enumerate(slides):
        image = Image.new('L', dims)
        draw_blocks(image, ij[:n])
        image.save(slide)
    # Use ImageMagick to create an animation
    # -loop 0 => loop forever, -delay units are .01 seconds
    os.system('convert -delay 100 -loop 0 %s animation.gif'
              % ' '.join(slides))

if __name__ == '__main__':
    import sys
    main(sys.stdin.read())

Credits

The inspiration for this article comes directly from “Structure and Interpretation of Computer Programs”, which presents some great examples of how to develop and use these pair streams.