Run-length encoding in Python
Recently I discussed run-length encoding and DEFLATE compression. I never actually showed a Python implementation of a run-length encoder, so here’s one now.
import itertools as its def ilen(it): '''Return the length of an iterable. >>> ilen(range(7)) 7 ''' return sum(1 for _ in it) def runlength_enc(xs): '''Return a run-length encoded version of the stream, xs. The resulting stream consists of (count, x) pairs. >>> ys = runlength_enc('AAABBCCC') >>> next(ys) (3, 'A') >>> list(ys) [(2, 'B'), (3, 'C')] ''' return ((ilen(gp), x) for x, gp in its.groupby(xs))
The decoder is equally simple. Itertools.repeat
expands a (count, value)
pair into an iterable which will generate count
elements. Itertools.chain
flattens these iterables into a single stream.
def runlength_dec(xs): '''Expand a run-length encoded stream. Each element of xs is a pair, (count, x). >>> ys = runlength_dec(((3, 'A'), (2, 'B'))) >>> next(ys) 'A' >>> ''.join(ys) 'AABB' ''' return its.chain.from_iterable(its.repeat(x, n) for n, x in xs)
If you haven’t seen itertools.chain.from_iterable()
yet, it was introduced at Python 3.0/2.6. The important feature here is that it lazily works its way through a single iterable argument. If instead we’d written:
def runlength_dec(xs): .... return its.chain(*(its.repeat(x, n) for n, x in xs))
then our run-length decoder would need to consume all of xs
before yielding results (which is why we must interrupt the interpreter’s execution below).
>>> xs = its.cycle((3, 'A'), (2, 'B')) >>> runlength_dec(xs) C-c C-cTraceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 25, in runlength_dec File "<string>", line 25, in <genexpr> KeyboardInterrupt
Named tuples for clarity
Streams of pairs (as shown above) are perfectly Pythonic. If we run-length encode a stream of numbers, clients will just have to read the manual and remember that item[0]
is a repeat count and item[1]
is a value.
If this seems fragile, a new-ish member of the collections module can give the pair more structure.
>>> from collections import namedtuple >>> Run = namedtuple('Run', 'count value') >>> run1 = Run(count=10, value=2) >>> run2 = Run(value=2, count=10) >>> run1 Run(count=10, value=2) >>> run2 Run(count=10, value=2) >>> run1.count 10 >>> run1[0] 10
Here’s how we’d change runlength_enc()
to use the new type.
def runlength_enc(xs): '''Return a run-length encoded version of the stream, xs. >>> ys = runlength_enc('AAABBCCC') >>> next(ys) Run(count=3, value='A') >>> list(ys) [Run(count=2, value='B'), Run(count=3, value='C')] ''' return (Run(ilen(gp), x) for x, gp in its.groupby(xs))