## Takewhile drops one

Here’s some naughty code.

```from itertools import takewhile

def take_some(pred, xs):
while True:
for x in takewhile(pred, xs):
yield x

```

This code abuses the “iterator building block” foundations of Python’s itertools module. Once you’ve chopped a stream’s head off using `takewhile` you can’t resume processing its tail … Or can you?

A casual inspection of this function suggests it does little more than heat up the machine: we return elements, `x`, from a stream, `xs`, for which `pred(x)` holds, then we spin at the first element for which the predicate does not hold.

When we actually run the code, things turn out rather differently:

```>>> from itertools import count, islice
>>> def is_even(x):
...     return x % 2 == 0
...
>>> xs = take_some(is_even, count())
>>> xs.next()
0
>>> xs.next()
2
>>> xs.next()
4
>>> list(islice(xs, 10))
[6, 8, 10, 12, 14, 16, 18, 20, 22, 24]

```

### Dropwhile, ifilter, izip

Nothing overheats. In fact `take_some` behaves suspiciously like `ifilter`. Let’s explore that hypothesis by zipping together an `ifilter` stream and a `take_some` stream and seeing if they diverge.

```>>> from itertools import dropwhile, ifilter, izip
>>> xs = take_some(is_even, count())
>>> ys = ifilter(is_even, count())
>>> diverge = dropwhile(lambda xy: xy[0] == xy[1], izip(xs, ys))
>>> diverge.next()
C-c C-cTraceback (most recent call last):
...
KeyboardInterrupt
>>>

```

Here `itertools.dropwhile` iterates through the zipped stream yielding items as soon as it detects a difference in the first and second element of a pair. This time, as you can see, we do start spinning, and we have to interrupt execution to regain control.

### Small print

Our casual interpretation of `take_some` was wrong. The actual documentation for `itertools.takewhile` reads:

takewhile(predicate, iterable)

Make an iterator that returns elements from the iterable as long as the predicate is true. Equivalent to:

```     def takewhile(predicate, iterable):
for x in iterable:
if predicate(x):
yield x
else:
break
```

There you have it! Once a stream returned by `takewhile` has run its course, the original `iterable` is poised to yield the element immediately after the first element for which the predicate fails. That is, we drop the first element for which the predicate fails. So repeatedly applying `takewhile` to a stream drops the elements for which the predicate doesn’t hold, which is to say it generates the elements for which the predicate holds, which is of course `ifilter`.

### Bug fixes

Yes, kind of. I could point out a couple of bugs in `take_some`. First, it doesn’t work for lists. Give it a list and each application of `takewhile` resumes iteration from the beginning of the list, meaning `take_some` either repeats the first element of the list forever, or it spins without yielding anything:

```>>> ys = take_some(is_even, [1, 2, 3, 4])
>>> ys.next()
...
KeyboardInterrupt
>>> ys = take_some(is_even, [0, 1, 2, 3])
>>> ys.next()
0
>>> ys.next()
0
>>> set(islice(ys, 1000000))
set([0])

```

We can fix that defect easily by applying `iter` to the input iterable, but that exposes the second bug, that `take_some` only works for infinite streams. Once we bang into the end of an iterable, we stay there, stuck in the while loop. To fix both defects we might end up with something like:

```from itertools import takewhile, tee

def take_some(pred, xs):
while True:
xs, ys = tee(xs)
try:
ys.next()
except StopIteration:
return
for x in takewhile(pred, xs):
yield x

```

### The real bug fix

Actually, the real bug, which I admitted to at the outset, is in our thinking. This code abuses the iterator-building-blocks paradigm at the heart of the itertools module. `Takewhile` converts one stream into another stream; the original stream has gone and if we wanted it we should have teed it first.

The Unix shell embeds this concept at the core of the language to great effect. Once again our building block is the stream but our connector, the pipeline operator, |, doesn’t allow this kind of abuse; all you can do is put a stream to its left and another to its right. The syntax won’t allow you to get the head and tail of the same stream in a single pipeline.

Here’s an awkless variant of the recent shell history meme which shows a shell pipeline in action.

```\$ history | tr -s ' ' | cut -f 3 -d ' ' | sort | uniq -c | sort -rn
172 cd
147 svn
73 bin/mheg
57 make
54 ls
40 emacs
37 pwd
...

```

Here’s a slightly more interesting variant which only shows commands appearing after a pipeline operator. (It’s not bombproof, but it’ll do for now.)

```\$ history | grep -Eo '\| *\w+' | tr -d '| ' | sort | uniq -c | sort -rn
8 cut
7 grep
6 tr
5 xargs
4 sort
3 wc
3 uniq
3 less
...

```

By way of an apology for wasting your time, here are some solid gold links.

1. “Generator Tricks for Systems Programmers”, a presentation made by David M. Beazley at PyCon’08. I wasn’t there, but for once the slides (PDF) standalone well, and despite the title it’s neither tricksy nor just for systems programmers. Experienced Python programmers might choose to skip over the first few slides; by the end of the presentation, the material gets much more advanced1.

2. “Shell-like data processing” by Maxim Krikun in the online Python Cookbook, which overloads the bitwise or operator, `|`, to implement a Pythonic pipeline, an idea you can find extended in “Assembly Line Syntax” by Patrick Roberts and revised by Michael Foord, this time using the right shift operator as a connector.

### Pipelined Python

Apache httpd log
```81.107.39.38 -  ... "GET /ply/ HTTP/1.1" 200 7587
81.107.39.38 -  ... "GET /favicon.ico HTTP/1.1" 404 133
81.107.39.38 -  ... "GET /ply/bookplug.gif HTTP/1.1" 200 23903
81.107.39.38 -  ... "GET /ply/ply.html HTTP/1.1" 200 97238
81.107.39.38 -  ... "GET /ply/example.html HTTP/1.1" 200 2359
66.249.72.134 - ... "GET /index.html HTTP/1.1" 200 4447
...

```

In his presentation David Beazley shows some elegant and idiomatic Python code to sum the total number of bytes transferred in an Apache httpd server log (the final field on each line of the log file shown above). You’ll notice how clean and declarative it is. Each generator expression builds upon the one on the preceding line. The source of the stream, `wwwlog`, is a file object which, in the iterable context shown here, yields lines on demand. Nothing really happens until the final reduction, `sum`, at which point data flows smoothly through. Stream elements — lines, words, ints — are processed one at a time, and nothing accumulates except the final total.

```wwwlog     = open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes      = (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes)

```

Here’s an alternative using the Python pipeline approach mentioned in the previous section. Note that in my server access logs it’s the 9th field (whitespace separated, counting from zero) which gives the number of bytes transferred, and for variety I’m pattern matching this field to a string of digits.

```wwwlog = open("access-log")
bytes = wwwlog | cut(9) | grep(r'\d+') | xlate(int)
print "Total", sum(bytes)

```

`Cut`, `grep` and `xlate` are simple classes which implement the numeric __ror__ method.

```import itertools
import re

class xlate(object):
"Translate the input stream by applying a function to each item".
def __init__(self, fn):
self.fn = fn
def __ror__(self, stream):
return itertools.imap(self.fn, stream)

class cut(xlate):
"Cuts a whitespace separated column from a stream of lines."
def __init__(self, column):
super(cut, self).__init__(lambda s: s.split()[column])

class grep(object):
"Grep lines which match an re from a stream of lines."
def __init__(self, pattern):
self.match = re.compile(pattern).match
def __ror__(self, stream):
return itertools.ifilter(self.match, stream)

```

1 It could be that I’m reading too much into the pipe metaphor, but I’m intrigued by the caption to the photo on David M. Beazley’s homepage. What can he mean?

Dave working on his latest project — “you know, it’s a series of tubes.”