Part One - Design and Test

Motivation

The motivation for this article comes from The Art of UNIX Programming by Eric S. Raymond. This is one of the most inspiring books I've read on how to write software: although firmly rooted in the traditions of the UNIX operating system, the culture and philosophy it describes applies far more widely. It has reinforced my belief that software development can indeed be an art.

Having read this book, wanted to put some of its ideas into practice. So I set myself a mini-project.

An Idea for a Mini-Project

As a starting point, I'd like to summarise two of Raymond's most important lessons.

Data structures, not algorithms, are central to programming.
Prefer text file formats: they're human-readable and extensible. If you must use a binary format then invest in a tool which converts from this format to a textual one and back again. This will facilitate working with your data.

I work in a domain where the need for compression requires the use of binary file formats: namely digital television (DTV). Digital video is typically MPEG-2 encoded. This is a highly compressed encoding designed to squeeze as much content as possible into a limited bandwidth. MPEG-2 encoding also allows video and audio content to be combined with metadata: for example, a television programme might be accompanied by a text description of itself and of what's showing next.

The proper term for these particular items of metadata is Event Information, present and following. Since this article is not primarily about digital television I shall avoid such jargon as far as possible.

To get to grips with this metadata a conversion tool is required. The specific task I set myself, then, was to write a tool to convert MPEG-2 metadata from binary to text, and, if required, to reverse the process. Such an encode/decode tool is commonly referred to as a codec.

My project, then, was to implement a digital television codec.

The MPEG-2 Bit Stream Syntax: An Example of a Minilanguage

The metadata format is specified using the MPEG-2 bit stream syntax which is defined in ISO/IEC 13818-1. Data items are described by name and length in bits using a C-like procedural syntax. An example makes this clear:

program_map_section() {            
       table_id                        8 
       section_syntax_indicator        1 
       '0'                             1 
       reserved                        2 
       section_length                 12 
       program_number                 16 
       reserved                        2  
       version_number                  5 
       current_next_indicator          1  
       section_number                  8 
       last_section_number             8 
       reserved                        3  
       PCR_PID                        13 
       reserved                        4  
       program_info_length            12 
       for (i=0; i<N; i++) {             
           descriptor()                  
       }                                 
       for (i=0;i<N1;i++) {              
           stream_type                 8 
           reserved                    3  
           elementary_PID             13 
           reserved                    4  
           ES_info_length             12 
           for (i=0; i<N2; i++) {        
               descriptor()              
           }                             
       }                                 
       CRC_32                         32
   }

What we have here is the bit stream syntax for a section of the Program Map Table. The first 8 bits give the table id (which happens to be 2, for this particular table); the single bit which follows provides the section syntax indicator; the next bit is always set to zero; the next two bits are reserved (and should each be set to one); and so on, until we get to the 32 bit CRC.

I assume the reserved parts of the section are included to provide room for a degree of future extensibility. But not much room. This is one of the reasons why Raymond advocates text file formats: if you need a larger value in a text format, just write it.

To provide a little context: the Program Map Table (PMT) supplies basic information about the digital television services present in an MPEG-2 transport stream. A section of this table – as shown above – defines the elementary streams which comprise a single television service. For example, the PMT for the BBC1 digital television service consists of a video stream, an audio stream, a subtitle stream and some data streams. Of course, ISO/IEC 13818-1 defines the format of many other tables and sections, and related specifications – such as EN 300468 define many more.

This textual specification of a binary format is an example of what Raymond terms a mini-language. In fact we have a Turing-complete mini-language: that is, it allows for loops and conditionals. The particular example shown here does not include any conditionals, though we do have nested loops. Note also the referenced descriptor() items. To fully parse the program_map_section() we'll need the descriptor() format specified too:

descriptor() {              
       descriptor_tag         8
       descriptor_length      8
       for (i=0; i<N; i++) {   
            private_data_byte 8
       }                       
   }

Complications

The syntax is easy to read, particularly to anyone familiar with C. However, if we look more closely at the for-loop in the descriptor, although it's apparent that i must be an integral loop counter, it's less clear where N is defined. Similarly, in the program_map_section, how do we find the values of N1 and N2?

ISO/IEC 13818-1 answers these questions. In the descriptor, the descriptor_length data element encodes an unsigned integer which tells us how many bytes of data are to follow: so N is simply the value obtained by decoding descriptor_length. The program_map_section is not quite so simple. N is easy enough – it's as many variable-length descriptors as it takes to fill the total length specified by program_info_length. N2 is similarly the number of descriptors to fill the length specified by the most recent occurrence of ES_info_length. For N1, however, we have to keep decoding elementary streams until the following is true:

Sum of elementary stream lengths (in bytes)
     ==  section_length - 
         - (2 + 5 + 1 + 8 + 8 
            + 3 + 13 + 4 + 12) / 8
         - program_info_length
         - 32 / 8

We have to divide by 8 since field widths of values within the PMT section are given in bits – but length fields give values in bytes.

Despite these complications, the encoding is well-designed: we can parse these binary data structures sequentially without needing to look ahead; and we can skip over any bits we're not interested in.

In fact, the more closely we inspect our examples, the more we notice. This is good. Recall that data structures, not algorithms, are central to programming. Already we're getting stuck into the data.

While we're in this positive frame of mind, let's review the full range of control structures required by the ISO/IEC 13818-1 bitstream syntax:

ISO/IEC 13818-1 Control Structures

while (condition) { 
        data_element
        ...
    }   

    do {
        data_element
        ... 
    }
    while (condition)


    if (condition) {
        data_element
        ...
    }
    else {
        data_element
        ...
    }
    
    for (i=0;i<n;i++) {
        data_element
        ...
    }

So, we've got pretty much C's control structures, excepting switch, break, return and continue.

This is starting to look alarming. How complex can a condition be? How shall we handle three different looping constructs? Our mini-project has become rather bigger than we imagined.

Back to the Data

The thing to do at this point is to shelve these concerns and get back to specifics. So, I got hold of some PMT section data and parsed it by hand. I used two types of data:

PMT sections pulled out of recorded digital TV broadcasts
a simple PMT section synthesised by hand.

I shall spare you the details. Note though that in parsing by hand we're already starting work on our text output format. For example, given the binary contents of a synthesised descriptor:

0a 04 0a 0b 0a 0b

and recalling the descriptor syntax:

descriptor() {              
        descriptor_tag         8
        descriptor_length      8
        for (i=0; i<N; i++) {   
             private_data_byte 8
        }                       
    }

a suitable output might be:

descriptor() {              
        descriptor_tag         8 = 0x0a
        descriptor_length      8 = 0x04
        for (i=0; i<N; i++) {   
             private_data_byte 8 = 0x0a
             private_data_byte 8 = 0x0b
             private_data_byte 8 = 0x0a
             private_data_byte 8 = 0x0b
        }                       
    }

I have deliberately chosen an output format which closely resembles the syntax definition. The loop has been unrolled, but I have retained the loop control to indicate the structure and origin of the data. I have chosen a hexadecimal representation for the data values – always a good choice for binary data – and explicitly indicate the numeric base used by prefixing these values with the string 0x. Finally, I have retained the bit widths for convenience: this will mean that when converting from text to binary, there will be no need to refer back to the descriptor syntax.

Referrring back to our motivating reference, we see we have instinctively followed one of Raymond's recommendations:

When filtering, never throw away information you don't need to.

Here, the term filter is used in its Unix sense, and applies well to a codec; and the reasoning is that any discarded information can never be used in any program further down the Unix pipeline. In our example, we can see that the output includes all the information carried by both the descriptor syntax definition and by the example descriptor.

Handling Failures

Suppose our descriptor was too short:

0a 04 0a 0b 0a

What should our codec make of such data?

Maybe something like this:

descriptor() {              
       descriptor_tag         8 = 0x0a
       descriptor_length      8 = 0x04
       for (i=0; i<N; i++) {   
            private_data_byte 8 = 0x0a
            private_data_byte 8 = 0x0b
            private_data_byte 8 = 0x0a 
   >>> ERROR: end of data reached. descriptor() incomplete.

It's perhaps premature to tie down how errors should be handled, other than to say that they should draw attention to themselves, that they shouldn't crash the codec, and that they should provide useful diagnostics. But it certainly isn't premature to include some malformed data in our test set.

Another point Raymond makes about data conversion tools is that they should be generous in what they accept (as input) but rigorous in what they emit (as output). In our case, this means that a user might change the layout of a text version of a descriptor to read like this:

descriptor() 
    {   
        descriptor_tag 8 = 0xA
        descriptor_length 8 = 0x4
        for (i = 0; i < N; i++) 
        {    
             private_data_byte 8 = 0xA
             private_data_byte 8 = 0xB
             private_data_byte 8 = 0xA
             private_data_byte 8 = 0xB
        }                       
    }

and still expect the codec to convert this to binary as:

0a 04 0a 0b 0a 0b

One of the great benefits of having a codec is that we can generate binary data from an easy-to-edit textual form: it would be a severe limitation if the encoding process was over-sensitive about whitespace, layout, or the capitalisation of hexadecimal numbers.

Reducing Project Scope

Whilst tinkering around with my test data set, I've also been paging through the MPEG-2 bitstream syntax. The bad news is that the expressions which appear in conditionals may be quite complex, making use of all the usual C arithmetic, bitwise, logical and relational operators as well as a few domain-specific additions.

The good news is that we can make good progress if we restrict our scope as follows:

restrict the control structures to for() {...} and if (condition) {...} else {...}
restrict conditions to the form field == value

These restrictions will not make a lot of sense to end users. In end user terms, we can aim for a first release of our codec which will only support sections from the following four tables:

Program Association Table (PAT)
Conditional Access Table (CAT)
Network Information Table (NIT)
Program Map Table (PMT)

This reduced scope may seem rather limiting. Note however that these four tables – collectively termed the Program Specific Information (PSI) Tables – contain the necessary and sufficient information to demultiplex and present programs (ISO 13818-1). Note further that our syntactic restrictions will not stop us from extending our codec to handle the complete set of DVB Service Information (SI) tables, which contain just about all of the metadata used in European digital broadcasts. Note finally that we remain faithful to our aims: Raymond emphasises the need to release early and often, so that users can drive (and implement, even, in an open source world) future developments. By reducing scope, we allow for this early feedback. We must take care, though, to follow another Unix maxim, and keep our design extensible.

A Prototype Descriptor Decoder


/**
 * @brief Decode a descriptor.
 * @param begin The start of the descriptor data
 * @param end One past the end of the descriptor data
 * @param out Output stream for the decoded data
 */
bool 
decodeDescriptor(desc_iter begin, desc_iter end, std::ostream & out)
{
    out << "descriptor() {\n";

    if (begin != end) {
        out << "    descriptor_tag 8 = "
            << *begin++ << "\n";
    }
   
    // We don't know yet how much data we need to decode.
    // Use a special non-zero value to indicate this.
    unsigned int to_decode = 0xff;
   
    if (begin != end) {
        to_decode = *begin++;
        
        out << "    descriptor_length 8 = "
            << to_decode << "\n";
            
        out << "    for(i=0; i<N; i++) {\n";
       
        while (begin != end && to_decode != 0) {
            out << "        "
                << "private_data_byte 8 = "
                << *begin++ << "\n";
            --to_decode;
        }
        out << "    }\n"; 
    }
   
    if (begin != end) {
        out << "ERROR: descriptor() too long.\n";
    } else if (to_decode != 0) {
        out << "ERROR: end of data reached. "
            << "descriptor() incomplete.\n";
    } else {
        out << "}\n"; 
    }

    return begin == end && to_decode == 0;
}

The function above is a direct first attempt at writing a descriptor decoder. Whilst there may be some mileage in this approach, some weaknesses are apparent:

The indentation is fixed. This won't do if we are decoding a descriptor in the broader context of a PMT section, when it can appear at two different levels.
The error handling is clumsy.
Data – in this case, the descriptor's syntax – has become muddled with control flow.

Now is not the time to deal with these weaknesses. We shall simply note that the first is simple to fix and the second can easily be improved on. It's the third weakness which, in the longer term, will lead to code which is harder to understand, maintain and extend. On the other hand, this function demonstrably works on our test data set, which is encouraging; and it's not hard to see how the approach taken could be extended to decode PAT, CAT, NIT and PMT sections – which is all we've decided to do.

Design Alternatives

We are now in a good position to consider the design of our codec. Three alternatives spring to mind:

Implement a descriptor-codec, a pmt-codec, and so on as required. Here, each mini-codec understands its own designated part of the syntax. Then the dtv-codec simply farms out work as appropriate. This extends the direct approach described above.
Implement a general dtv-codec which understands the full bitstream syntax and can use it to parse an arbitrary section format. All that then remains is to prime this codec with the required section formats.
Devise a code generator which, given a section format, will generate a program to code/decode that particular format.

All three alternatives are good, and all seem in line with our motivating aims. The third, in particular, exemplifies Raymond's:

Rule of Generation:
Avoid hand-hacking; write programs to write programs when you can.

In chosing which route to take, we should remember the XP mantra: Do the simplest thing possible; which, in UNIX-speak, becomes the more prosaic: Keep it Simple, Stupid!

More Details

For those who want to implement their own codec, you can find a link zip archive here. This archive contains:

binary PAT, CAT, NIT and PMT sections
synthesised text sections
alternative text versions
malformed binary sections
the relevant section syntax definitions
table_id values
a README

For those who'd prefer to see my attempt; proceed to part two of this article.

Conclusions

Progress has been made, and without the need to compromise our artistic aims. Even before we've completed the project, we've started to receive the main benefit: of understanding our data.

The second part of this article is where we compromise, get our hands dirty, bite off more than we can chew – that is, we write some code.