Run-length encoding in Python

2009-06-01, 2 Comments

Recently I discussed run-length encoding and DEFLATE compression. I never actually showed a Python implementation of a run-length encoder, so here’s one now.

import itertools as its

def ilen(it):
   
'''Return the length of an iterable.

    >>> ilen(range(7))
    7
    '''

   
return sum(1 for _ in it)

def runlength_enc(xs):
   
'''Return a run-length encoded version of the stream, xs.

    The resulting stream consists of (count, x) pairs.

    >>> ys = runlength_enc('
AAABBCCC')
    >>> next(ys)
    (3, '
A')
    >>> list(ys)
    [(2, '
B'), (3, 'C')]
    '''

   
return ((ilen(gp), x) for x, gp in its.groupby(xs))

The decoder is equally simple. Itertools.repeat expands a (count, value) pair into an iterable which will generate count elements. Itertools.chain flattens these iterables into a single stream.

def runlength_dec(xs):
   
'''Expand a run-length encoded stream.

    Each element of xs is a pair, (count, x).

    >>> ys = runlength_dec(((3, '
A'), (2, 'B')))
    >>> next(ys)
    '
A'
    >>> ''.join(ys)
    '
AABB'
    '''

   
return its.chain.from_iterable(its.repeat(x, n) for n, x in xs)

If you haven’t seen itertools.chain.from_iterable() yet, it was introduced at Python 3.0/2.6. The important feature here is that it lazily works its way through a single iterable argument. If instead we’d written:

def runlength_dec(xs):
   
....
   
return its.chain(*(its.repeat(x, n) for n, x in xs))

then our run-length decoder would need to consume all of xs before yielding results (which is why we must interrupt the interpreter’s execution below).

>>> xs = its.cycle((3, 'A'), (2, 'B'))
>>> runlength_dec(xs)
  C
-c C-cTraceback (most recent call last):
 
File "<stdin>", line 1, in <module>
 
File "<string>", line 25, in runlength_dec
 
File "<string>", line 25, in <genexpr>
KeyboardInterrupt

Named tuples for clarity

Streams of pairs (as shown above) are perfectly Pythonic. If we run-length encode a stream of numbers, clients will just have to read the manual and remember that item[0] is a repeat count and item[1] is a value.

If this seems fragile, a new-ish member of the collections module can give the pair more structure.

>>> from collections import namedtuple
>>> Run = namedtuple('Run', 'count value')
>>> run1 = Run(count=10, value=2)
>>> run2 = Run(value=2, count=10)
>>> run1
Run(count=10, value=2)
>>> run2
Run(count=10, value=2)
>>> run1.count
10
>>> run1[0]
10

Here’s how we’d change runlength_enc() to use the new type.

def runlength_enc(xs):
   
'''Return a run-length encoded version of the stream, xs.

    >>> ys = runlength_enc('
AAABBCCC')
    >>> next(ys)
    Run(count=3, value='
A')
    >>> list(ys)
    [Run(count=2, value='
B'), Run(count=3, value='C')]
    '''

   
return (Run(ilen(gp), x) for x, gp in its.groupby(xs))