Takewhile drops one

2008-04-23, , , Comments

Here’s some naughty code.

from itertools import takewhile
def take_some(pred, xs):
    while True:
        for x in takewhile(pred, xs):
            yield x

This code abuses the “iterator building block” foundations of Python’s itertools module. Once you’ve chopped a stream’s head off using takewhile you can’t resume processing its tail … Or can you?

A casual inspection of this function suggests it does little more than heat up the machine: we return elements, x, from a stream, xs, for which pred(x) holds, then we spin at the first element for which the predicate does not hold.

When we actually run the code, things turn out rather differently:

>>> from itertools import count, islice
>>> def is_even(x):
... 	return x % 2 == 0
>>> xs = take_some(is_even, count())
>>> xs.next()
>>> xs.next()
>>> xs.next()
>>> list(islice(xs, 10))
[6, 8, 10, 12, 14, 16, 18, 20, 22, 24]

Dropwhile, ifilter, izip

Nothing overheats. In fact take_some behaves suspiciously like ifilter. Let’s explore that hypothesis by zipping together an ifilter stream and a take_some stream and seeing if they diverge.

>>> from itertools import dropwhile, ifilter, izip
>>> xs = take_some(is_even, count())
>>> ys = ifilter(is_even, count())
>>> diverge = dropwhile(lambda xy: xy[0] == xy[1], izip(xs, ys))
>>> diverge.next()
  C-c C-cTraceback (most recent call last):

Here itertools.dropwhile iterates through the zipped stream yielding items as soon as it detects a difference in the first and second element of a pair. This time, as you can see, we do start spinning, and we have to interrupt execution to regain control.

Small print

Our casual interpretation of take_some was wrong. The actual documentation for itertools.takewhile reads:

takewhile(predicate, iterable)

Make an iterator that returns elements from the iterable as long as the predicate is true. Equivalent to:

     def takewhile(predicate, iterable):
         for x in iterable:
             if predicate(x):
                 yield x

There you have it! Once a stream returned by takewhile has run its course, the original iterable is poised to yield the element immediately after the first element for which the predicate fails. That is, we drop the first element for which the predicate fails. So repeatedly applying takewhile to a stream drops the elements for which the predicate doesn’t hold, which is to say it generates the elements for which the predicate holds, which is of course ifilter.

Bug fixes

Yes, kind of. I could point out a couple of bugs in take_some. First, it doesn’t work for lists. Give it a list and each application of takewhile resumes iteration from the beginning of the list, meaning take_some either repeats the first element of the list forever, or it spins without yielding anything:

>>> ys = take_some(is_even, [1, 2, 3, 4])
>>> ys.next()
>>> ys = take_some(is_even, [0, 1, 2, 3])
>>> ys.next()
>>> ys.next()
>>> set(islice(ys, 1000000))

We can fix that defect easily by applying iter to the input iterable, but that exposes the second bug, that take_some only works for infinite streams. Once we bang into the end of an iterable, we stay there, stuck in the while loop. To fix both defects we might end up with something like:

from itertools import takewhile, tee
def take_some(pred, xs):
    while True:
        xs, ys = tee(xs)
        except StopIteration:
        for x in takewhile(pred, xs):
            yield x

The real bug fix

Actually, the real bug, which I admitted to at the outset, is in our thinking. This code abuses the iterator-building-blocks paradigm at the heart of the itertools module. Takewhile converts one stream into another stream; the original stream has gone and if we wanted it we should have teed it first.

Picture of a crab

The Unix shell embeds this concept at the core of the language to great effect. Once again our building block is the stream but our connector, the pipeline operator, |, doesn’t allow this kind of abuse; all you can do is put a stream to its left and another to its right. The syntax won’t allow you to get the head and tail of the same stream in a single pipeline.

Here’s an awkless variant of the recent shell history meme which shows a shell pipeline in action.

$ history | tr -s ' ' | cut -f 3 -d ' ' | sort | uniq -c | sort -rn
    172 cd
    147 svn
     73 bin/mheg
     57 make
     54 ls
     40 emacs
     37 pwd

Here’s a slightly more interesting variant which only shows commands appearing after a pipeline operator. (It’s not bombproof, but it’ll do for now.)

$ history | grep -Eo '\| *\w+' | tr -d '| ' | sort | uniq -c | sort -rn
     10 head
      8 cut
      7 grep
      6 tr
      5 xargs
      4 sort
      3 wc
      3 uniq
      3 less

Pipe Links

By way of an apology for wasting your time, here are some solid gold links.

  1. “Generator Tricks for Systems Programmers”, a presentation made by David M. Beazley at PyCon’08. I wasn’t there, but for once the slides (PDF) standalone well, and despite the title it’s neither tricksy nor just for systems programmers. Experienced Python programmers might choose to skip over the first few slides; by the end of the presentation, the material gets much more advanced[1].

  2. “Shell-like data processing” by Maxim Krikun in the online Python Cookbook, which overloads the bitwise or operator, |, to implement a Pythonic pipeline, an idea you can find extended in “Assembly Line Syntax” by Patrick Roberts and revised by Michael Foord, this time using the right shift operator as a connector.

Pipelined Python

Apache httpd log -  ... "GET /ply/ HTTP/1.1" 200 7587 -  ... "GET /favicon.ico HTTP/1.1" 404 133 -  ... "GET /ply/bookplug.gif HTTP/1.1" 200 23903 -  ... "GET /ply/ply.html HTTP/1.1" 200 97238 -  ... "GET /ply/example.html HTTP/1.1" 200 2359 - ... "GET /index.html HTTP/1.1" 200 4447 

In his presentation David Beazley shows some elegant and idiomatic Python code to sum the total number of bytes transferred in an Apache httpd server log (the final field on each line of the log file shown above). You’ll notice how clean and declarative it is. Each generator expression builds upon the one on the preceding line. The source of the stream, wwwlog, is a file object which, in the iterable context shown here, yields lines on demand. Nothing really happens until the final reduction, sum, at which point data flows smoothly through. Stream elements — lines, words, ints — are processed one at a time, and nothing accumulates except the final total.

wwwlog     = open("access-log") 
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) 
bytes      = (int(x) for x in bytecolumn if x != '-') 
print "Total", sum(bytes)

Here’s an alternative using the Python pipeline approach mentioned in the previous section. Note that in my server access logs it’s the 9th field (whitespace separated, counting from zero) which gives the number of bytes transferred, and for variety I’m pattern matching this field to a string of digits.

wwwlog = open("access-log") 
bytes = wwwlog | cut(9) | grep(r'\d+') | xlate(int)
print "Total", sum(bytes)

Cut, grep and xlate are simple classes which implement the numeric __ror__ method.

import itertools
import re

class xlate(object):
    "Translate the input stream by applying a function to each item". 
    def __init__(self, fn):
        self.fn = fn
    def __ror__(self, stream):
        return itertools.imap(self.fn, stream)
class cut(xlate):
    "Cuts a whitespace separated column from a stream of lines."
    def __init__(self, column):
        super(cut, self).__init__(lambda s: s.split()[column])

class grep(object):
    "Grep lines which match an re from a stream of lines."
    def __init__(self, pattern):
        self.match = re.compile(pattern).match
    def __ror__(self, stream):
        return itertools.ifilter(self.match, stream)

[1] It could be that I’m reading too much into the pipe metaphor, but I’m intrigued by the caption to the photo on David M. Beazley’s homepage. What can he mean?

David Beazley

Dave working on his latest project — “you know, it’s a series of tubes.”