What’s going on in the animation below?
While you’re waiting for the graphic to load, recall a problem we encountered while hunting for maximum subsequences:
Given a number n, generate all the pairs (i, j) such that i and j are in the range 0 to n and j is greater than i.
A Python solution reads:
def generate_pairs(n): "Generate all pairs (i, j) such that 0 <= i <= j < n" for i in range(n): for j in range(i, n): yield i, j
We can extend this to solve a more general version of the problem:
Given two sequences (S, T), generate all pairs (Si, Tj) such that j is greater than i.
def generate_pairs(s, t): "Generate all pairs s[i], t[j] such that 0 <= i <= j" s_count = len(s) t_count = len(t) for i in xrange(s_count): for j in xrange(i, t_count): yield s[i], t[j]
This function requires complete and bounded collections as inputs since it asks for their lengths and accesses elements by index — think “lists” or “tuples”. Yet it generates results lazily, giving clients control over just how many pairs they require. This is why I’ve chosen
xrangein preference to
range, since it too generates numbers, rather than returning a complete list. (Python 3000 simplifies things:
There’s no need for extra logic to handle
Sbeing shorter than
xrange(start, stop)is smart enough to handle the case when
stopis less than
start. I’d say adding such extra logic would count as premature optimisation.
Here’s an example of this function in action. We run the generator to exhaustion collecting its output in a list.
>>> pairs = list(generate_pairs('ABC', (1, 2, 3, 4))) >>> print pairs [('A', 1), ('A', 2), ('A', 3), ('A', 4), ('B', 2), ('B', 3), ('B', 4), ('C', 3), ('C', 4)]
A second example builds a generator expression from
generate_pairs(). We plug this expression into the built in
any function to determine whether the pair
('A', 2) is in the pair sequence. Note that
any only pulls two elements from the sequence to produce an answer.
>>> any((p == 'A', 2) for p in generate_pairs('ABC', (1, 2, 3, 4))) True
So far, so what? For the toy examples shown, lazy evaluation isn’t an important consideration. But what if our input sequences were very large? What if they were infinite?
Generate_pairs() yields results row by row. Here’s a slideshow for the case when
T have 5 and 20 elements respectively. Brightness indicates age, so the white circle represents the current head of the generated sequence, and the greying circles represent previous elements — the older the greyer. Remember, it’s up to clients just how many grey pairs to hold: the generator only ever yields a shiny white new pair.
A text representation would be as follows, where, as usual, we read top to bottom, left to right.
S0T0, S0T1, S0T2 ... S0T20 S1T1, S1T2, ... S1T20 S2T2, ... S2T20 ...
It’s clear this mode of traversal isn’t suitable for generating pairs from very large inputs: we don’t advance through S quickly enough. In fact, if T were infinite, we’d be stuck on the top row, forever generating pairs of the form
“Structure and Interpretation of Computer Programs” describes an elegant recursive scheme for generating pairs:
Call the general stream of pairs (pairs S T), and consider it to be composed of three parts: the pair (S0,T0), the rest of the pairs in the first row, and the remaining pairs:
Based on this insight, we can generate our stream of pairs:
- yielding the pair, (S0,T0)
- yield pairs from the first row combined with the stream of remaining pairs
Note the recursion: the definition of the pairs stream refers to itself. That’s not a problem so long as we’re lazy about evaluating the tail of the stream. Since we always know how to pull the first pair from the stream, we can both bootstrap the process and keep going forever.
All we now need is a suitable method of combining two the first row and the remaining pairs.
The simplest way to combine the two streams is just to interleave them, pulling elements from each in turn. A simple (too simple!) implementation of
def interleave(s, t): while True: yield s.next() yield t.next()
(The problem here is that
interleave() stops as soon as either of the the input streams is exhausted, which isn’t an issue for infinite streams, but isn’t ideal in the case when either of the inputs are bounded. You get the idea. I’ll provide better implementation later.)
We can now code our pairs generator:
import itertools def pairs(s, t): """ Generate a stream of pairs taken from s, t. Yields a stream of pairs (si, tj) where i <= j. The input streams may be finite or bounded, but must be iterable. """ first = s.next(), t.next() yield first t, t_top = itertools.tee(t) s0 = first top_row = ((s0, tt) for tt in t_top) for st in interleave(top_row, pairs(s, t)): yield st
Here’s a slideshow of this algorithm in action.
Even though Python (CPython, to be specific) and recursion don’t generally get along well (unlike in a proper functional programming language), this implementation doesn’t appear to fall foul of CPython’s maximum recursion limits even though pairs() repeatedly calls itself.
>>> integers = itertools.count >>> for ij in pairs(integers, integers): ... print ij ... (0, 0) (0, 1) (1, 1) (0, 2) (1, 2) (0, 3) (2, 2) ....
When I dug a little deeper, I started to understand why: the function can’t run indefinitely since it eats up memory. It continually tees up iterators over
t and creates new generator expressions.
That said, I was surprised just how quickly it consumes memory. After all, generator expressions and iterators shouldn’t take up much space. For the purposes of comparison, I compared how it fared against a Scheme program designed to do the same thing. Both had consumed a gigabyte of memory after running just a few minutes (and, for what it’s worth, the Python program had generated an order of magnitude more pairs by this point).
(require (lib "40.ss" "srfi")) ;; Scheme stream support (define (interleave s1 s2) (if (stream-null? s1) s2 (stream-cons (stream-car s1) (interleave s2 (stream-cdr s1))))) (define (pairs s t) (stream-cons (list (stream-car s) (stream-car t)) (interleave (stream-map (lambda (x) (list (stream-car s) x)) (stream-cdr t)) (pairs (stream-cdr s) (stream-cdr t))))) (define ones (stream-cons 1 ones)) (define (stream-add s t) (stream-map + s t)) (define integers (stream-cons 0 (stream-add ones integers))) (stream-for-each (lambda (x) (newline) (display x)) (pairs integers integers))
Let’s look a little further into the stream. Here’s the genesis of the first 120 pairs.
Now the pattern becomes clear. Evidently the interleave strategy gives a heavy bias towards advancing rightwards; downwards motion is relatively slow.
Let’s filter out the
i == j diagonal to try and quantify this.
>>> ints = itertools.count >>> izip = itertools.izip >>> pp = pairs(ints(), ints()) >>> diagonal = ((ix, (i, j)) ... for ix, (i, j) in izip(ints(1), pp) if i == j)) >>> for d in itertools.islice(diagonal, 10): ... print d ... (1, (0, 0)) (3, (1, 1)) (7, (2, 2)) (15, (3, 3)) (31, (4, 4)) (63, (5, 5)) (127, (6, 6)) (255, (7, 7)) (511, (8, 8)) (1023, (9, 9))
The output shows that pair (0, 0) is the 1st element of the stream, (1, 1) the 3rd, (2, 2) the 7th, and so on.
Any computer scientist ought to quickly spot the pattern, and any mathematician ought to be able to prove why diagonal advances are exponentially infrequent.
Loosely speaking, if Di is the stream of positions in the pairs stream of elements which lie on the diagonal (i.e. the stream 1, 3, 7, 15, … shown in the interpreted Python session), then the elements of D must satisfy the recurrence relation:
Di+1 = 1 + 2×Di
This relation derives directly from the recursive definition of
pairs(): yield a single element, then alternate the rest of yourself with the generated top row.
Here are some slides showing the points at which the first 5 diagonal elements appear. Watch the recursive pattern emerge.
The solution to the recurrence relation is
Di = 2i − 1. These diagonal advances correspond to recursive calls to
pairs(). So if our Python session has a recursion limit set to 1000 (try
sys.getrecursionlimit()), and our pairs generator were to yield a result every nanosecond, and we had a machine with infinite memory, it would take quite a few millenia before we run into this limit.
>>> (2 ** 1000 - 1)/(1.e9 * 60 * 60 * 24 * 365 * 1000) 3.3977315042689856e+281
It should also now be clear why this particular algorithm burns memory. Teeing an iterator doesn’t magically mean we don’t have to store any elements we’ve iterated past. The greater the distance between the teed iterators, the more that need storing.
If that’s not clear, run the following Python session and keep an eye on memory consumption (and be ready to interrupt the infinite loop!). You’ll see memory use rapidly increase as
j diverge, since
j needs to remember everything
>>> from itertools import * >>> i, j = tee(repeat(0)) >>> for x in i: pass
By contrast, if we zip through
j in parallel, memory use is stable.
>>> i, j = tee(repeat(0)) >>> for ij in izip(i, j): pass
Our pair generator has to remember more and more elements from the
T stream (and also, to a lesser extent, from the
S stream). That’s why it’s so hungry. Any general purpose infinite pair generator function must suffer this problem, I think, since we have to accumulate increasing portions of the streams into memory, but if it’s just integer pairs we want we can work around the issue:
def integer_pairs(i=0): "Generate a stream of integer pairs (i, j) with i <= j" yield i, i top_row = ((i, j) for j in itertools.count(i + 1)) for ij in interleave(top_row, integer_pairs(i + 1)): yield ij
Here, rather than
tee() streams of integers, we use
itertools.count() to create new ones as required. This pair generator will run happily for extended periods. It will very gradually ask for more memory, as numbers get bigger and the number of rows increases. As before, it won’t hit any embarrassing recursion limits in the lifetime of this galaxy.
integer_pairs() solve the problem of generating an infinite stream of pairs for the special case when both input streams are integers?
Not really, no!
The interleave strategy for combining the top row with the remaining pairs has the great merit of simplicity, but is otherwise far from ideal. I don’t think a one-size-fits-all strategy for streaming pairs exists, so we should allow clients to supply one. And that will have to be the subject of another article.
It’s not hard to fix the problem we noticed with
def interleave(s, t): """Generate an interleaved stream of elements from s and t. If one of the streams finishes, elements continue to be generated from the other until it too is finished. """ for ss, tt in itertools.izip(s, t): yield ss yield tt for s_or_t in itertools.chain(s, t): yield s_or_t
Here’s the script I used to generate the animated GIFs. Although it’s very much geared to doing the single job which it was designed to do, it’s short enough to be used as a basis for similar tasks. I hope this article helps demonstrate the merits of visualising an algorithm in action.
#! /usr/bin/env python """Script which accepts a list of (i, j) pairs, creating slides and an animated GIF showing the positions of these pairs. Example use: $ echo '((0, 0) (0, 1) (1, 1) (0, 2) (1, 2) (0, 3) (2, 2)' | pairs-animation.py Dependencies: - Python Imaging Library (PIL), http://effbot.org/imagingbook/ - aggdraw, anti-aliased graphics drawing package, http://effbot.org/zone/pythondoc-aggdraw.htm - ImageMagick, http://www.imagemagick.org """ import Image import aggdraw import itertools # Script wide drawing configuration block_w = 16 block_pad = 2 def fading_brushes(): "Generate a sequence of fading grey brushes." darkgrey = itertools.repeat(16) greyscale = itertools.chain(range(255, 0, -16), darkgrey) def brush(g): "Return a grey brush." return aggdraw.Brush('rgb(%d,%d,%d)' % (g, g, g)) return itertools.imap(brush, greyscale) def block_start(i): "Return the coordinate (X or Y) at which the ith block starts." return (block_w + block_pad) * i + block_pad def read_ij(data): """Return pairs of integers found in the input string. Note: this function is extremely unfussy about the string format! >>> read_ij('1 2 3 4') [(1, 2), (3, 4)] >>> read_ij('(1, 2), (3, 4)') [(1, 2), (3, 4)] """ import re ij = itertools.imap(int, re.compile(r'-?\d+').findall(data)) return zip(ij, ij) def dimensions(ij): "Return dimensions of an image big enough to contain blocks at ij positions." import operator max_i = max(map(operator.itemgetter(0), ij)) max_j = max(map(operator.itemgetter(1), ij)) return block_start(max_j + 1), block_start(max_i + 1) def block_xy(ij): "Return the left, top, right, bottom coords of a block at ij." ylo, xlo = map(block_start, ij) return xlo, ylo, xlo + block_w, ylo + block_w def draw_blocks(image, blocks): "Draws the supplied blocks onto the input image." draw = aggdraw.Draw(image) brushes = fading_brushes() # Reverse the blocks so the newest are brightest for block, brush in itertools.izip(reversed(blocks), brushes): # The 'ellipse' will be circular since the block is square draw.ellipse(block_xy(block), brush) draw.flush() def main(ij_data): "Creates slides and an animation from the ij block positions." import os ij = read_ij(ij_data) dims = dimensions(ij) # Create the slides slides = ['pairs-%d.gif' % n for n in range(len(ij) + 1)] for n, slide in enumerate(slides): image = Image.new('L', dims) draw_blocks(image, ij[:n]) image.save(slide) # Use ImageMagick to create an animation # -loop 0 => loop forever, -delay units are .01 seconds os.system('convert -delay 100 -loop 0 %s animation.gif' % ' '.join(slides)) if __name__ == '__main__': import sys main(sys.stdin.read())
The inspiration for this article comes directly from “Structure and Interpretation of Computer Programs”, which presents some great examples of how to develop and use these pair streams.