The motivation for this article comes from The Art of UNIX Programming by Eric S. Raymond. This is one of the most inspiring books I've read on how to write software: although firmly rooted in the traditions of the UNIX operating system, the culture and philosophy it describes applies far more widely. It has reinforced my belief that software development can indeed be an art.
Having read this book, wanted to put some of its ideas into practice. So I set myself a mini-project.
As a starting point, I'd like to summarise two of Raymond's most important lessons.
I work in a domain where the need for compression requires the use of binary file formats: namely digital television (DTV). Digital video is typically MPEG-2 encoded. This is a highly compressed encoding designed to squeeze as much content as possible into a limited bandwidth. MPEG-2 encoding also allows video and audio content to be combined with metadata: for example, a television programme might be accompanied by a text description of itself and of what's showing next.
![]() |
To get to grips with this metadata a conversion tool is required. The specific task I set myself, then, was to write a tool to convert MPEG-2 metadata from binary to text, and, if required, to reverse the process. Such an encode/decode tool is commonly referred to as a codec.
My project, then, was to implement a digital television codec.
The metadata format is specified using the MPEG-2 bit stream syntax which is defined in ISO/IEC 13818-1. Data items are described by name and length in bits using a C-like procedural syntax. An example makes this clear:
program_map_section() {
table_id 8
section_syntax_indicator 1
'0' 1
reserved 2
section_length 12
program_number 16
reserved 2
version_number 5
current_next_indicator 1
section_number 8
last_section_number 8
reserved 3
PCR_PID 13
reserved 4
program_info_length 12
for (i=0; i<N; i++) {
descriptor()
}
for (i=0;i<N1;i++) {
stream_type 8
reserved 3
elementary_PID 13
reserved 4
ES_info_length 12
for (i=0; i<N2; i++) {
descriptor()
}
}
CRC_32 32
}
What we have here is the bit stream syntax for a section of the Program Map
Table
. The first 8 bits give the table id (which happens to be 2, for this
particular table); the single bit which follows provides the section syntax
indicator; the next bit is always set to zero; the next two bits are reserved
(and should each be set to one); and so on, until we get to the 32 bit CRC.
![]() |
To provide a little context: the Program Map Table (PMT)
supplies basic
information about the digital television services present in an MPEG-2
transport stream. A section of this table – as shown above – defines the
elementary streams which comprise a single television service. For example,
the PMT
for the BBC1 digital television service consists of a video stream,
an audio stream, a subtitle stream and some data streams. Of course,
ISO/IEC 13818-1 defines the format of many other tables and
sections, and related specifications – such as EN 300468
define many more.
This textual specification of a binary format is an example of what
Raymond terms a mini-language. In fact we have a Turing-complete
mini-language: that is, it allows for loops and conditionals. The particular
example shown here does not include any conditionals, though we do have nested
loops. Note also the referenced descriptor()
items. To fully parse the
program_map_section()
we'll need the descriptor()
format specified too:
descriptor() {
descriptor_tag 8
descriptor_length 8
for (i=0; i<N; i++) {
private_data_byte 8
}
}
The syntax is easy to read, particularly to anyone familiar with C. However,
if we look more closely at the for-loop in the descriptor
, although it's
apparent that i
must be an integral loop counter, it's less clear where N
is defined. Similarly, in the program_map_section
, how do we find the
values of N1
and N2
?
ISO/IEC 13818-1 answers these questions. In the descriptor
, the
descriptor_length
data element encodes an unsigned integer which tells us
how many bytes of data are to follow: so N
is simply the value obtained by
decoding descriptor_length
. The program_map_section
is not quite so simple. N
is easy
enough – it's as many variable-length descriptors as it takes to fill the
total length specified by program_info_length
. N2
is similarly the
number of descriptors to fill the length specified by the most recent
occurrence of ES_info_length
. For N1
, however, we have to keep decoding
elementary streams until the following is true:
Sum of elementary stream lengths (in bytes)
== section_length -
- (2 + 5 + 1 + 8 + 8
+ 3 + 13 + 4 + 12) / 8
- program_info_length
- 32 / 8
We have to divide by 8 since field widths of values within the PMT section are given in bits – but length fields give values in bytes.
Despite these complications, the encoding is well-designed: we can parse these binary data structures sequentially without needing to look ahead; and we can skip over any bits we're not interested in.
In fact, the more closely we inspect our examples, the more we notice. This is good. Recall that data structures, not algorithms, are central to programming. Already we're getting stuck into the data.
While we're in this positive frame of mind, let's review the full range of control structures required by the ISO/IEC 13818-1 bitstream syntax:
while (condition) {
data_element
...
}
do {
data_element
...
}
while (condition)
if (condition) {
data_element
...
}
else {
data_element
...
}
for (i=0;i<n;i++) {
data_element
...
}
So, we've got pretty much C's control structures, excepting switch
,
break
, return
and continue
.
This is starting to look alarming. How complex can a condition
be? How shall
we handle three different looping constructs? Our mini-project has become rather
bigger than we imagined.
The thing to do at this point is to shelve these concerns and get back to specifics. So, I got hold of some PMT section data and parsed it by hand. I used two types of data:
I shall spare you the details. Note though that in parsing by hand we're already starting work on our text output format. For example, given the binary contents of a synthesised descriptor:
0a 04 0a 0b 0a 0b
and recalling the descriptor syntax:
descriptor() {
descriptor_tag 8
descriptor_length 8
for (i=0; i<N; i++) {
private_data_byte 8
}
}
a suitable output might be:
descriptor() {
descriptor_tag 8 = 0x0a
descriptor_length 8 = 0x04
for (i=0; i<N; i++) {
private_data_byte 8 = 0x0a
private_data_byte 8 = 0x0b
private_data_byte 8 = 0x0a
private_data_byte 8 = 0x0b
}
}
I have deliberately chosen an output format which closely resembles the syntax
definition. The loop has been unrolled, but I have retained the loop control
to indicate the structure and origin of the data. I have chosen a hexadecimal
representation for the data values – always a good choice for binary data –
and explicitly indicate the numeric base used by prefixing these values with the
string 0x
. Finally, I have retained the bit widths for convenience: this
will mean that when converting from text to binary, there will be no need to
refer back to the descriptor syntax.
Referrring back to our motivating reference, we see we have instinctively followed one of Raymond's recommendations:
When filtering, never throw away information you don't need to.
Here, the term filter is used in its Unix sense, and applies well to a codec; and the reasoning is that any discarded information can never be used in any program further down the Unix pipeline. In our example, we can see that the output includes all the information carried by both the descriptor syntax definition and by the example descriptor.
Suppose our descriptor was too short:
0a 04 0a 0b 0a
What should our codec make of such data?
Maybe something like this:
descriptor() {
descriptor_tag 8 = 0x0a
descriptor_length 8 = 0x04
for (i=0; i<N; i++) {
private_data_byte 8 = 0x0a
private_data_byte 8 = 0x0b
private_data_byte 8 = 0x0a
>>> ERROR: end of data reached. descriptor() incomplete.
It's perhaps premature to tie down how errors should be handled, other than to say that they should draw attention to themselves, that they shouldn't crash the codec, and that they should provide useful diagnostics. But it certainly isn't premature to include some malformed data in our test set.
Another point Raymond makes about data conversion tools is that they should
be generous in what they accept (as input) but rigorous in what they emit (as
output). In our case, this means that a user might change the layout of a text
version of a descriptor
to read like this:
descriptor()
{
descriptor_tag 8 = 0xA
descriptor_length 8 = 0x4
for (i = 0; i < N; i++)
{
private_data_byte 8 = 0xA
private_data_byte 8 = 0xB
private_data_byte 8 = 0xA
private_data_byte 8 = 0xB
}
}
and still expect the codec to convert this to binary as:
0a 04 0a 0b 0a 0b
One of the great benefits of having a codec is that we can generate binary data from an easy-to-edit textual form: it would be a severe limitation if the encoding process was over-sensitive about whitespace, layout, or the capitalisation of hexadecimal numbers.
Whilst tinkering around with my test data set, I've also been paging through the MPEG-2 bitstream syntax. The bad news is that the expressions which appear in conditionals may be quite complex, making use of all the usual C arithmetic, bitwise, logical and relational operators as well as a few domain-specific additions.
The good news is that we can make good progress if we restrict our scope as follows:
for() {...}
and
if (condition) {...} else {...}
field == value
These restrictions will not make a lot of sense to end users. In end user terms, we can aim for a first release of our codec which will only support sections from the following four tables:
This reduced scope may seem rather limiting. Note however that these four tables – collectively termed the Program Specific Information (PSI) Tables – contain the necessary and sufficient information to demultiplex and present programs (ISO 13818-1). Note further that our syntactic restrictions will not stop us from extending our codec to handle the complete set of DVB Service Information (SI) tables, which contain just about all of the metadata used in European digital broadcasts. Note finally that we remain faithful to our aims: Raymond emphasises the need to release early and often, so that users can drive (and implement, even, in an open source world) future developments. By reducing scope, we allow for this early feedback. We must take care, though, to follow another Unix maxim, and keep our design extensible.
/**
* @brief Decode a descriptor.
* @param begin The start of the descriptor data
* @param end One past the end of the descriptor data
* @param out Output stream for the decoded data
*/
bool
decodeDescriptor(desc_iter begin, desc_iter end, std::ostream & out)
{
out << "descriptor() {\n";
if (begin != end) {
out << " descriptor_tag 8 = "
<< *begin++ << "\n";
}
// We don't know yet how much data we need to decode.
// Use a special non-zero value to indicate this.
unsigned int to_decode = 0xff;
if (begin != end) {
to_decode = *begin++;
out << " descriptor_length 8 = "
<< to_decode << "\n";
out << " for(i=0; i<N; i++) {\n";
while (begin != end && to_decode != 0) {
out << " "
<< "private_data_byte 8 = "
<< *begin++ << "\n";
--to_decode;
}
out << " }\n";
}
if (begin != end) {
out << "ERROR: descriptor() too long.\n";
} else if (to_decode != 0) {
out << "ERROR: end of data reached. "
<< "descriptor() incomplete.\n";
} else {
out << "}\n";
}
return begin == end && to_decode == 0;
}
The function above is a direct first attempt at writing a descriptor decoder. Whilst there may be some mileage in this approach, some weaknesses are apparent:
Now is not the time to deal with these weaknesses. We shall simply note that the first is simple to fix and the second can easily be improved on. It's the third weakness which, in the longer term, will lead to code which is harder to understand, maintain and extend. On the other hand, this function demonstrably works on our test data set, which is encouraging; and it's not hard to see how the approach taken could be extended to decode PAT, CAT, NIT and PMT sections – which is all we've decided to do.
We are now in a good position to consider the design of our codec. Three alternatives spring to mind:
All three alternatives are good, and all seem in line with our motivating aims. The third, in particular, exemplifies Raymond's:
Rule of Generation:
Avoid hand-hacking; write programs to write programs when you can.
In chosing which route to take, we should remember the XP mantra: Do the simplest thing possible; which, in UNIX-speak, becomes the more prosaic: Keep it Simple, Stupid!
For those who want to implement their own codec, you can find a link zip archive here. This archive contains:
For those who'd prefer to see my attempt; proceed to part two of this article.
Progress has been made, and without the need to compromise our artistic aims. Even before we've completed the project, we've started to receive the main benefit: of understanding our data.
The second part of this article is where we compromise, get our hands dirty, bite off more than we can chew – that is, we write some code.
Copyright © 2004 Thomas Guest |