DEFLATE: run-length encoding, but better
Run-length encoding
Run-length encoding is a simple compression scheme in which runs of equal values are represented by the value and a repeat count. For example, a supermarket cashier might process this line of shopping
as
- 4 bananas
- 3 apples
- 2 bananas
- 1 pineapple
- 3 apples
Unix packs in its very own run length encoder, uniq -c
. It works just fine — so long as the values you want to encode are newline separated byte strings, that is.
Let’s use a sequence of coin tosses as an example stream. $RANDOM
generates random numbers. We use the least significant bit of these numbers as an index into an array containing the values heads
, tails
.
$ HT=(heads tails) $ toss() { echo ${HT[$RANDOM&1]}; } $ toss; toss; toss heads tails tails $ tosses() { while [ 1 ]; do toss; done; } $ tosses | head tails tails tails heads tails heads heads heads tails tails
Passing a fresh sample from this same stream through our run-length encoder we get:
$ tosses | uniq -c | head 2 heads 1 tails 1 heads 1 tails 1 heads 6 tails 3 heads 1 tails 4 heads 1 tails
An awk
script can be used as a run-length decoder. (There must be a neater way, using sed
maybe?)
$ runlendec() { awk '{ while ($1--) print $2 }'; } $ tosses | head | tee orig.log | uniq -c | runlendec | tee encdec.log heads tails heads tails heads heads tails tails heads heads $ diff orig.log encdec.log
Here, we toss a coin 10 times teeing the original sequence to a file. The next two links in the pipeline compress and decompress the sequence, teeing the results to another file. Finally, as a sanity check, we confirm the round trip results are the same.
Run-length encoding in Python
This Unix run-length codec is fun, but of limited practical use. One good feature, though, is the way it operates on streams of data (including infinite streams), leaving clients free to decide how best to slice and buffer these streams.
Python has a fine library of high-level stream transformation tools from which we can build a generic and flexible run-length codec in just a few lines. Since I want to progress from run-length coding to something more advanced, I’ll leave discussing how to implement this codec for now, but if you’d like to write your own version, here’s a description suitable for doctesting.
Import the run-length codec functions and compress a short string. >>> from runlength import compress, decompress >>> comp = compress('AABBBACC') The returned compressor is a stream (an iterable). >>> next(comp) (2, 'A') Pull the rest of the stream into memory. >>> rest = list(comp) >>> rest [(3, 'B'), (1, 'A'), (2, 'C')] Simple decompress example. >>> concat = ''.join >>> concat(decompress(rest)) 'BBBACC' Compress, decompress also work with infinite streams, like the a2b3 stream, which repeatedly cycles two pairs. >>> from itertools import cycle, islice >>> a2b3 = cycle([(2, 'a'), (3, 'b')]) >>> dec = decompress(a2b3) Pull 8 values from the decompressed stream. >>> concat(islice(dec, 8)) 'aabbbaab' Now compress the decompressed stream, and explore a few items. >>> comp = compress(dec) >>> next(comp) (2, 'b') >>> list(islice(comp, 2)) [(2, 'a'), (3, 'b')]
DEFLATE
The Wikipedia page on run-length encoding identifies monochrome images as good candidates for run-length compression. The white and black pixels typically group into long runs. Indeed, any simple image using a limited palette should reduce well using this compression scheme.
The chessboard above is 256×256 pixels, each square being 32×32 pixels. We could run-length encode this 64K pixel image as 256×8 = 2K runs of 32 pixels, a decent saving. (Actually, we should do slightly better, noting that there are runs of length 64 at the chessboard rank boundaries, but you get the idea.)
(32,W)(32,B)(32,W)(32,B)(32,W)(32,B)(32,W)(32,B), (32,W)(32,B)(32,W)(32,B)(32,W)(32,B)(32,W)(32,B), .... (32,B)(32,W)(32,B)(32,W)(32,B)(32,W)(32,B)(32,W)
Like a paletted image, a block of text — the web page you’re reading now, for example — employs a limited alphabet. Although the characters in this text don’t usually group into long runs there’s plenty of repetition, especially in the raw HTML: all the occurrences of <div>
, <span>
and class
used for CSS styling, for example. The DEFLATE compression algorithm uses a clever twist on run-length encoding to remove this redundancy:
The compressed data consists of a series of elements of two types: literal bytes (of strings that have not been detected as duplicated within the previous 32K input bytes), and pointers to duplicated strings, where a pointer is represented as a pair \<length, backward distance>. (RFC-1951)
(In addition, a multiple-level dynamic Huffman encoding scheme reduces the space needed for the strings, distances and lengths themselves.)
There’s more to these pointer elements than first appears: the length can exceed the backward distance. Thus the sequence:
heads heads heads heads heads
can be deflated as the literal type heads\n
followed by the pointer type <24, 6>
.
If you’ve spotted the potential for recursion, good! The inflating stream can reference itself, which can reference itself, which can … Confusing?
Zipping pixels
PNG images use DEFLATE compression (as implemented by zlib) to save on pixel storage space. Here’s a binary view of the raw data in the chessboard graphic shown above, all 137 bytes of it. The 64K pixels themselves compress into a 88 byte IDAT chunk, of which the final 8 bytes are a checksum and (I think?) some padding. Maybe the image could be squeezed harder, but I’m impressed!
8950 4e47 0d0a 1a0a 0000 000d 4948 4452 .PNG........IHDR 0000 0100 0000 0100 0100 0000 0074 0995 .............t.. cb00 0000 5049 4441 5468 81ed ceb1 0d00 ....PIDATh...... 200c 0341 f65f 1a58 803a 2f74 6e52 e424 ..A._.X.:/tnR.$ 7bed 9b75 f3ba cf07 0000 df83 ca0e 0000 {..u............ 7a60 ba1f 0080 2ea8 ec00 00a0 07a6 fb01 z`.............. 00e8 82ca 0e00 007a 60ba 1f00 802e a8ec .......z`....... 0000 2007 0e8a 69f0 e2b9 9471 c700 0000 .. ...i....q.... 0049 454e 44ae 4260 82 .IEND.B`.
Here’s a trace of how zlib inflates the compressed pixels in this IDAT chunk. (Source code available via anonymous SVN at http://wordaligned.org/svn/etc/zlib_trace.)
inflate: allocated inflate: reset inflate: zlib header ok inflate: dynamic codes block (last) inflate: table sizes ok inflate: code lengths ok inflate: codes ok inflate: literal 0x00 inflate: literal 0xff inflate: length 3 inflate: distance 1 inflate: literal 0x00 inflate: length 3 inflate: distance 1 inflate: length 24 inflate: distance 8 inflate: length 25 inflate: distance 25 inflate: length 258 inflate: distance 33 ....
I’ve attempted to show the first few stages of the genesis of the uncompressed stream in the picture below. The way the stream recursively inflates itself is quite beautiful.
- put 00
- put ff
- go back 1 (to ff), put 3
- put 00
- go back 1 (to 00), put 3
- go back 8 (to 00 00 00 00 ff ff ff ff)
- put 24
Two elements later, and the repeat length has grown to 258. In fact, the entire chessboard is generated from just 3 literal and 43 pointer elements.
(Not all graphics have such a regular pattern, of course, so we can’t always achieve such dramatic compression.)
Deflated HTML
Web servers can and do save on band-width by transferring gzip compressed HTML to gzip capable clients. (Gzip is a simple wrapper around DEFLATE.) Any PNG images transferred will also have their pixels DEFLATE compressed.
$ curl http://wordaligned.org --head --compress HTTP/1.1 200 OK Date: Sun, 17 May 2009 17:41:53 GMT Server: lighttpd | Word Aligned Content-Type: text/html; charset=UTF-8 .... Vary: Accept-Encoding Content-Encoding: gzip Content-Length: 20
The Word Aligned front page contains about 75Kb of HTML, which gzips to just 16Kb — a decent saving. Relevant lines from the lighttpd configuration file read:
server.modules = ( .... "mod_compress" ) compress.cache-dir = basedir + "lighttpd/cache/compress/" compress.filetype = ("text/plain", "text/html", "text/css")
I uphold Gzip (built on zlib, which implements DEFLATE) as a hero of the web. As we’ve seen, it implements a powerful and elegant algorithm, but perhaps the best thing about it is that it’s free to use, a freedom worth fighting for. Check out this battle report from the FAQ.
What about patents?
gzip was developed as a replacement for compress because of the UNISYS and IBM patents covering the LZW algorithm used by compress.
I have probably spent more time studying data compression patents than actually implementing data compression algorithms. I maintain a list of several hundred patents on lossless data compression algorithms, and I made sure that gzip isn’t covered by any of them. In particular, the
--fast
option of gzip is not as fast it could, precisely to avoid a patented technique. — Jean-Loup Gailly, Gzip FAQ