DEFLATE: run-length encoding, but better

2009-05-21, , , , Comments

Run-length encoding

Run-length encoding is a simple compression scheme in which runs of equal values are represented by the value and a repeat count. For example, a supermarket cashier might process this line of shopping

Fruit salad

as

  • 4 bananas
  • 3 apples
  • 2 bananas
  • 1 pineapple
  • 3 apples

Unix packs in its very own run length encoder, uniq -c. It works just fine — so long as the values you want to encode are newline separated byte strings, that is.

Let’s use a sequence of coin tosses as an example stream. $RANDOM generates random numbers. We use the least significant bit of these numbers as an index into an array containing the values heads, tails.

$ HT=(heads tails)
$ toss() { echo ${HT[$RANDOM&1]}; }
$ toss; toss; toss
heads
tails
tails
$ tosses() { while [ 1 ]; do toss; done; }
$ tosses | head
tails
tails
tails
heads
tails
heads
heads
heads
tails
tails

tails tails tails heads tails heads heads heads tails tails

Passing a fresh sample from this same stream through our run-length encoder we get:

$ tosses | uniq -c | head
   2 heads
   1 tails
   1 heads
   1 tails
   1 heads
   6 tails
   3 heads
   1 tails
   4 heads
   1 tails

An awk script can be used as a run-length decoder. (There must be a neater way, using sed maybe?)

$ runlendec() { awk '{ while ($1--) print $2 }'; }
$ tosses | head | tee orig.log | uniq -c | runlendec | tee encdec.log
heads
tails
heads
tails
heads
heads
tails
tails
heads
heads
$ diff orig.log encdec.log

Here, we toss a coin 10 times teeing the original sequence to a file. The next two links in the pipeline compress and decompress the sequence, teeing the results to another file. Finally, as a sanity check, we confirm the round trip results are the same.

Run-length encoding in Python

This Unix run-length codec is fun, but of limited practical use. One good feature, though, is the way it operates on streams of data (including infinite streams), leaving clients free to decide how best to slice and buffer these streams.

Python has a fine library of high-level stream transformation tools from which we can build a generic and flexible run-length codec in just a few lines. Since I want to progress from run-length coding to something more advanced, I’ll leave discussing how to implement this codec for now, but if you’d like to write your own version, here’s a description suitable for doctesting.

Import the run-length codec functions and compress a short string.
>>> from runlength import compress, decompress
>>> comp = compress('AABBBACC')

The returned compressor is a stream (an iterable).
>>> next(comp)
(2, 'A')

Pull the rest of the stream into memory.
>>> rest = list(comp)
>>> rest
[(3, 'B'), (1, 'A'), (2, 'C')]

Simple decompress example.
>>> concat = ''.join
>>> concat(decompress(rest))
'BBBACC'

Compress, decompress also work with infinite streams, like the 
a2b3 stream, which repeatedly cycles two pairs. 
>>> from itertools import cycle, islice
>>> a2b3 = cycle([(2, 'a'), (3, 'b')])
>>> dec = decompress(a2b3)

Pull 8 values from the decompressed stream.
>>> concat(islice(dec, 8))
'aabbbaab'

Now compress the decompressed stream, and explore a few items.
>>> comp = compress(dec)
>>> next(comp)
(2, 'b')
>>> list(islice(comp, 2))
[(2, 'a'), (3, 'b')]

DEFLATE

Chessboard

The Wikipedia page on run-length encoding identifies monochrome images as good candidates for run-length compression. The white and black pixels typically group into long runs. Indeed, any simple image using a limited palette should reduce well using this compression scheme.

The chessboard above is 256×256 pixels, each square being 32×32 pixels. We could run-length encode this 64K pixel image as 256×8 = 2K runs of 32 pixels, a decent saving. (Actually, we should do slightly better, noting that there are runs of length 64 at the chessboard rank boundaries, but you get the idea.)

(32,W)(32,B)(32,W)(32,B)(32,W)(32,B)(32,W)(32,B),
(32,W)(32,B)(32,W)(32,B)(32,W)(32,B)(32,W)(32,B),
....
(32,B)(32,W)(32,B)(32,W)(32,B)(32,W)(32,B)(32,W)

Like a paletted image, a block of text — the web page you’re reading now, for example — employs a limited alphabet. Although the characters in this text don’t usually group into long runs there’s plenty of repetition, especially in the raw HTML: all the occurrences of <div>, <span> and class used for CSS styling, for example. The DEFLATE compression algorithm uses a clever twist on run-length encoding to remove this redundancy:

The compressed data consists of a series of elements of two types: literal bytes (of strings that have not been detected as duplicated within the previous 32K input bytes), and pointers to duplicated strings, where a pointer is represented as a pair \<length, backward distance>. (RFC-1951)

(In addition, a multiple-level dynamic Huffman encoding scheme reduces the space needed for the strings, distances and lengths themselves.)

There’s more to these pointer elements than first appears: the length can exceed the backward distance. Thus the sequence:

heads
heads
heads
heads
heads

can be deflated as the literal type heads\n followed by the pointer type <24, 6>.

If you’ve spotted the potential for recursion, good! The inflating stream can reference itself, which can reference itself, which can … Confusing?

Zipping pixels

PNG images use DEFLATE compression (as implemented by zlib) to save on pixel storage space. Here’s a binary view of the raw data in the chessboard graphic shown above, all 137 bytes of it. The 64K pixels themselves compress into a 88 byte IDAT chunk, of which the final 8 bytes are a checksum and (I think?) some padding. Maybe the image could be squeezed harder, but I’m impressed!

8950 4e47 0d0a 1a0a 0000 000d 4948 4452  .PNG........IHDR
0000 0100 0000 0100 0100 0000 0074 0995  .............t..
cb00 0000 5049 4441 5468 81ed ceb1 0d00  ....PIDATh......
200c 0341 f65f 1a58 803a 2f74 6e52 e424   ..A._.X.:/tnR.$
7bed 9b75 f3ba cf07 0000 df83 ca0e 0000  {..u............
7a60 ba1f 0080 2ea8 ec00 00a0 07a6 fb01  z`..............
00e8 82ca 0e00 007a 60ba 1f00 802e a8ec  .......z`.......
0000 2007 0e8a 69f0 e2b9 9471 c700 0000  .. ...i....q....
0049 454e 44ae 4260 82                   .IEND.B`.

Here’s a trace of how zlib inflates the compressed pixels in this IDAT chunk. (Source code available via anonymous SVN at http://wordaligned.org/svn/etc/zlib_trace.)

inflate: allocated
inflate: reset
inflate:   zlib header ok
inflate:     dynamic codes block (last)
inflate:       table sizes ok
inflate:       code lengths ok
inflate:       codes ok
inflate:         literal 0x00
inflate:         literal 0xff
inflate:         length 3
inflate:         distance 1
inflate:         literal 0x00
inflate:         length 3
inflate:         distance 1
inflate:         length 24
inflate:         distance 8
inflate:         length 25
inflate:         distance 25
inflate:         length 258
inflate:         distance 33
....

I’ve attempted to show the first few stages of the genesis of the uncompressed stream in the picture below. The way the stream recursively inflates itself is quite beautiful.

Inflating pixels

  1. put 00
  2. put ff
  3. go back 1 (to ff), put 3
  4. put 00
  5. go back 1 (to 00), put 3
  6. go back 8 (to 00 00 00 00 ff ff ff ff)
  7. put 24

Two elements later, and the repeat length has grown to 258. In fact, the entire chessboard is generated from just 3 literal and 43 pointer elements.

(Not all graphics have such a regular pattern, of course, so we can’t always achieve such dramatic compression.)

Deflated HTML

Web servers can and do save on band-width by transferring gzip compressed HTML to gzip capable clients. (Gzip is a simple wrapper around DEFLATE.) Any PNG images transferred will also have their pixels DEFLATE compressed.

$ curl http://wordaligned.org --head --compress
HTTP/1.1 200 OK
Date: Sun, 17 May 2009 17:41:53 GMT
Server: lighttpd | Word Aligned
Content-Type: text/html; charset=UTF-8
....
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 20

The Word Aligned front page contains about 75Kb of HTML, which gzips to just 16Kb — a decent saving. Relevant lines from the lighttpd configuration file read:

lighttpd mod_compress
server.modules = (
    ....
    "mod_compress"
)
compress.cache-dir = basedir + "lighttpd/cache/compress/"
compress.filetype  = ("text/plain", "text/html", "text/css")

I uphold Gzip (built on zlib, which implements DEFLATE) as a hero of the web. As we’ve seen, it implements a powerful and elegant algorithm, but perhaps the best thing about it is that it’s free to use, a freedom worth fighting for. Check out this battle report from the FAQ.

What about patents?

gzip was developed as a replacement for compress because of the UNISYS and IBM patents covering the LZW algorithm used by compress.

I have probably spent more time studying data compression patents than actually implementing data compression algorithms. I maintain a list of several hundred patents on lossless data compression algorithms, and I made sure that gzip isn’t covered by any of them. In particular, the --fast option of gzip is not as fast it could, precisely to avoid a patented technique. — Jean-Loup Gailly, Gzip FAQ