Animated pair streams
Contents
Name that Series
What’s going on in the animation below?
Bounded Pairs
While you’re waiting for the graphic to load, recall a problem we encountered while hunting for maximum subsequences:
Given a number n, generate all the pairs (i, j) such that i and j are in the range 0 to n and j is greater than i.
A Python solution reads:
def generate_pairs(n): "Generate all pairs (i, j) such that 0 <= i <= j < n" for i in range(n): for j in range(i, n): yield i, j
We can extend this to solve a more general version of the problem:
Given two sequences (S, T), generate all pairs (S_{i}, T_{j}) such that j is greater than i.
def generate_pairs(s, t): "Generate all pairs s[i], t[j] such that 0 <= i <= j" s_count = len(s) t_count = len(t) for i in xrange(s_count): for j in xrange(i, t_count): yield s[i], t[j]
Note:
This function requires complete and bounded collections as inputs since it asks for their lengths and accesses elements by index — think “lists” or “tuples”. Yet it generates results lazily, giving clients control over just how many pairs they require. This is why I’ve chosen
xrange
in preference torange
, since it too generates numbers, rather than returning a complete list. (Python 3000 simplifies things:range
becomesxrange
andxrange
disappears.)There’s no need for extra logic to handle
S
being shorter thanT
or vice-versa:xrange(start, stop)
is smart enough to handle the case whenstop
is less thanstart
. I’d say adding such extra logic would count as premature optimisation.
Here’s an example of this function in action. We run the generator to exhaustion collecting its output in a list.
>>> pairs = list(generate_pairs('ABC', (1, 2, 3, 4))) >>> print pairs [('A', 1), ('A', 2), ('A', 3), ('A', 4), ('B', 2), ('B', 3), ('B', 4), ('C', 3), ('C', 4)]
A second example builds a generator expression from generate_pairs()
. We plug this expression into the built in any
function to determine whether the pair ('A', 2)
is in the pair sequence. Note that any
only pulls two elements from the sequence to produce an answer.
>>> any((p == 'A', 2) for p in generate_pairs('ABC', (1, 2, 3, 4))) True
So far, so what? For the toy examples shown, lazy evaluation isn’t an important consideration. But what if our input sequences were very large? What if they were infinite?
Streams of Pairs
Generate_pairs()
yields results row by row. Here’s a slideshow for the case when S
and T
have 5 and 20 elements respectively. Brightness indicates age, so the white circle represents the current head of the generated sequence, and the greying circles represent previous elements — the older the greyer. Remember, it’s up to clients just how many grey pairs to hold: the generator only ever yields a shiny white new pair.
A text representation would be as follows, where, as usual, we read top to bottom, left to right.
S_{0}T_{0}, S_{0}T_{1}, S_{0}T_{2} ... S_{0}T_{20} S_{1}T_{1}, S_{1}T_{2}, ... S_{1}T_{20} S_{2}T_{2}, ... S_{2}T_{20} ...
It’s clear this mode of traversal isn’t suitable for generating pairs from very large inputs: we don’t advance through S quickly enough. In fact, if T were infinite, we’d be stuck on the top row, forever generating pairs of the form S_{0}T_{i}
.
A Recursive Scheme
“Structure and Interpretation of Computer Programs” describes an elegant recursive scheme for generating pairs:
Call the general stream of pairs (pairs S T), and consider it to be composed of three parts: the pair (S_{0},T_{0}), the rest of the pairs in the first row, and the remaining pairs:
Based on this insight, we can generate our stream of pairs:
- yielding the pair, (S_{0},T_{0})
- yield pairs from the first row combined with the stream of remaining pairs
Note the recursion: the definition of the pairs stream refers to itself. That’s not a problem so long as we’re lazy about evaluating the tail of the stream. Since we always know how to pull the first pair from the stream, we can both bootstrap the process and keep going forever.
All we now need is a suitable method of combining two the first row and the remaining pairs.
Interleaving
The simplest way to combine the two streams is just to interleave them, pulling elements from each in turn. A simple (too simple!) implementation of interleave()
reads:
def interleave(s, t): while True: yield s.next() yield t.next()
(The problem here is that interleave()
stops as soon as either of the the input streams is exhausted, which isn’t an issue for infinite streams, but isn’t ideal in the case when either of the inputs are bounded. You get the idea. I’ll provide better implementation later.)
We can now code our pairs generator:
import itertools def pairs(s, t): """ Generate a stream of pairs taken from s, t. Yields a stream of pairs (si, tj) where i <= j. The input streams may be finite or bounded, but must be iterable. """ first = s.next(), t.next() yield first t, t_top = itertools.tee(t) s0 = first[0] top_row = ((s0, tt) for tt in t_top) for st in interleave(top_row, pairs(s, t)): yield st
Here’s a slideshow of this algorithm in action.
Even though Python (CPython, to be specific) and recursion don’t generally get along well (unlike in a proper functional programming language), this implementation doesn’t appear to fall foul of CPython’s maximum recursion limits even though pairs() repeatedly calls itself.
>>> integers = itertools.count >>> for ij in pairs(integers, integers): ... print ij ... (0, 0) (0, 1) (1, 1) (0, 2) (1, 2) (0, 3) (2, 2) ....
When I dug a little deeper, I started to understand why: the function can’t run indefinitely since it eats up memory. It continually tees up iterators over t
and creates new generator expressions.
That said, I was surprised just how quickly it consumes memory. After all, generator expressions and iterators shouldn’t take up much space. For the purposes of comparison, I compared how it fared against a Scheme program designed to do the same thing. Both had consumed a gigabyte of memory after running just a few minutes (and, for what it’s worth, the Python program had generated an order of magnitude more pairs by this point).
(require (lib "40.ss" "srfi")) ;; Scheme stream support (define (interleave s1 s2) (if (stream-null? s1) s2 (stream-cons (stream-car s1) (interleave s2 (stream-cdr s1))))) (define (pairs s t) (stream-cons (list (stream-car s) (stream-car t)) (interleave (stream-map (lambda (x) (list (stream-car s) x)) (stream-cdr t)) (pairs (stream-cdr s) (stream-cdr t))))) (define ones (stream-cons 1 ones)) (define (stream-add s t) (stream-map + s t)) (define integers (stream-cons 0 (stream-add ones integers))) (stream-for-each (lambda (x) (newline) (display x)) (pairs integers integers))
Double Trouble
Let’s look a little further into the stream. Here’s the genesis of the first 120 pairs.
Now the pattern becomes clear. Evidently the interleave strategy gives a heavy bias towards advancing rightwards; downwards motion is relatively slow.
Let’s filter out the i == j
diagonal to try and quantify this.
>>> ints = itertools.count >>> izip = itertools.izip >>> pp = pairs(ints(), ints()) >>> diagonal = ((ix, (i, j)) ... for ix, (i, j) in izip(ints(1), pp) if i == j)) >>> for d in itertools.islice(diagonal, 10): ... print d ... (1, (0, 0)) (3, (1, 1)) (7, (2, 2)) (15, (3, 3)) (31, (4, 4)) (63, (5, 5)) (127, (6, 6)) (255, (7, 7)) (511, (8, 8)) (1023, (9, 9))
The output shows that pair (0, 0) is the 1^{st} element of the stream, (1, 1) the 3^{rd}, (2, 2) the 7^{th}, and so on.
Any computer scientist ought to quickly spot the pattern, and any mathematician ought to be able to prove why diagonal advances are exponentially infrequent.
Loosely speaking, if D_{i} is the stream of positions in the pairs stream of elements which lie on the diagonal (i.e. the stream 1, 3, 7, 15, … shown in the interpreted Python session), then the elements of D must satisfy the recurrence relation:
D_{i+1} = 1 + 2×D_{i}
This relation derives directly from the recursive definition of pairs()
: yield a single element, then alternate the rest of yourself with the generated top row.
Here are some slides showing the points at which the first 5 diagonal elements appear. Watch the recursive pattern emerge.
The solution to the recurrence relation is D_{i} = 2^{i} − 1
. These diagonal advances correspond to recursive calls to pairs()
. So if our Python session has a recursion limit set to 1000 (try sys.getrecursionlimit()
), and our pairs generator were to yield a result every nanosecond, and we had a machine with infinite memory, it would take quite a few millenia before we run into this limit.
>>> (2 ** 1000 - 1)/(1.e9 * 60 * 60 * 24 * 365 * 1000) 3.3977315042689856e+281
Memory Use
It should also now be clear why this particular algorithm burns memory. Teeing an iterator doesn’t magically mean we don’t have to store any elements we’ve iterated past. The greater the distance between the teed iterators, the more that need storing.
If that’s not clear, run the following Python session and keep an eye on memory consumption (and be ready to interrupt the infinite loop!). You’ll see memory use rapidly increase as i
and j
diverge, since j
needs to remember everything i
produces.
>>> from itertools import * >>> i, j = tee(repeat(0)) >>> for x in i: pass
By contrast, if we zip through i
and j
in parallel, memory use is stable.
>>> i, j = tee(repeat(0)) >>> for ij in izip(i, j): pass
Our pair generator has to remember more and more elements from the T
stream (and also, to a lesser extent, from the S
stream). That’s why it’s so hungry. Any general purpose infinite pair generator function must suffer this problem, I think, since we have to accumulate increasing portions of the streams into memory, but if it’s just integer pairs we want we can work around the issue:
def integer_pairs(i=0): "Generate a stream of integer pairs (i, j) with i <= j" yield i, i top_row = ((i, j) for j in itertools.count(i + 1)) for ij in interleave(top_row, integer_pairs(i + 1)): yield ij
Here, rather than tee()
streams of integers, we use itertools.count()
to create new ones as required. This pair generator will run happily for extended periods. It will very gradually ask for more memory, as numbers get bigger and the number of rows increases. As before, it won’t hit any embarrassing recursion limits in the lifetime of this galaxy.
Problem Solved?
So, does integer_pairs()
solve the problem of generating an infinite stream of pairs for the special case when both input streams are integers?
Not really, no!
The interleave strategy for combining the top row with the remaining pairs has the great merit of simplicity, but is otherwise far from ideal. I don’t think a one-size-fits-all strategy for streaming pairs exists, so we should allow clients to supply one. And that will have to be the subject of another article.
Tying up
It’s not hard to fix the problem we noticed with interleave()
.
def interleave(s, t): """Generate an interleaved stream of elements from s and t. If one of the streams finishes, elements continue to be generated from the other until it too is finished. """ for ss, tt in itertools.izip(s, t): yield ss yield tt for s_or_t in itertools.chain(s, t): yield s_or_t
Here’s the script I used to generate the animated GIFs. Although it’s very much geared to doing the single job which it was designed to do, it’s short enough to be used as a basis for similar tasks. I hope this article helps demonstrate the merits of visualising an algorithm in action.
#! /usr/bin/env python """Script which accepts a list of (i, j) pairs, creating slides and an animated GIF showing the positions of these pairs. Example use: $ echo '((0, 0) (0, 1) (1, 1) (0, 2) (1, 2) (0, 3) (2, 2)' | pairs-animation.py Dependencies: - Python Imaging Library (PIL), http://effbot.org/imagingbook/ - aggdraw, anti-aliased graphics drawing package, http://effbot.org/zone/pythondoc-aggdraw.htm - ImageMagick, http://www.imagemagick.org """ import Image import aggdraw import itertools # Script wide drawing configuration block_w = 16 block_pad = 2 def fading_brushes(): "Generate a sequence of fading grey brushes." darkgrey = itertools.repeat(16) greyscale = itertools.chain(range(255, 0, -16), darkgrey) def brush(g): "Return a grey brush." return aggdraw.Brush('rgb(%d,%d,%d)' % (g, g, g)) return itertools.imap(brush, greyscale) def block_start(i): "Return the coordinate (X or Y) at which the ith block starts." return (block_w + block_pad) * i + block_pad def read_ij(data): """Return pairs of integers found in the input string. Note: this function is extremely unfussy about the string format! >>> read_ij('1 2 3 4') [(1, 2), (3, 4)] >>> read_ij('(1, 2), (3, 4)') [(1, 2), (3, 4)] """ import re ij = itertools.imap(int, re.compile(r'-?\d+').findall(data)) return zip(ij, ij) def dimensions(ij): "Return dimensions of an image big enough to contain blocks at ij positions." import operator max_i = max(map(operator.itemgetter(0), ij)) max_j = max(map(operator.itemgetter(1), ij)) return block_start(max_j + 1), block_start(max_i + 1) def block_xy(ij): "Return the left, top, right, bottom coords of a block at ij." ylo, xlo = map(block_start, ij) return xlo, ylo, xlo + block_w, ylo + block_w def draw_blocks(image, blocks): "Draws the supplied blocks onto the input image." draw = aggdraw.Draw(image) brushes = fading_brushes() # Reverse the blocks so the newest are brightest for block, brush in itertools.izip(reversed(blocks), brushes): # The 'ellipse' will be circular since the block is square draw.ellipse(block_xy(block), brush) draw.flush() def main(ij_data): "Creates slides and an animation from the ij block positions." import os ij = read_ij(ij_data) dims = dimensions(ij) # Create the slides slides = ['pairs-%d.gif' % n for n in range(len(ij) + 1)] for n, slide in enumerate(slides): image = Image.new('L', dims) draw_blocks(image, ij[:n]) image.save(slide) # Use ImageMagick to create an animation # -loop 0 => loop forever, -delay units are .01 seconds os.system('convert -delay 100 -loop 0 %s animation.gif' % ' '.join(slides)) if __name__ == '__main__': import sys main(sys.stdin.read())
Credits
The inspiration for this article comes directly from “Structure and Interpretation of Computer Programs”, which presents some great examples of how to develop and use these pair streams.