This article appears in two parts. Part one described the preliminary stages of a miniproject to write a codec for a minilanguage, delivering:
Part two – this part – continues the project and presents the final implementation.
Part one of this article drew inspiration from The Art of UNIX Programming, by Eric S. Raymond. Part two continues to draw from this same source, which applies as readily to implementation as it did to design.
At this point, I can reveal a second motivating source, The Tale of a Struggling Template Programmer, by Stephan Heinzmann, which served to remind me how frustrating software development can be: sometimes the tools are to blame, sometimes the languages appear faulty, and sometimes the poor programmer takes a wrong turn. On a more personal note, it reminded me that I ought to experiment with modern C++.
![]() |
Anyone familiar with both sources will appreciate there's a degree of tension between them. In what follows, I document my attempts to resolve this tension. Along the way, we shall revisit the world of MPEG-2 video encoding and get started with the Boost Spirit library.
To briefly recap, then, our goal is to write a tool to convert the binary
format used in MPEG-2 digital video broadcasting into a textual form and
back again – to write a dvbcodec
. For example, we would like to convert a
section of the Program Association Table (PAT),
whose syntax is as follows:
program_association_section() {
table_id 8
section_syntax_indicator 1
'0' 1
reserved 2
section_length 12
transport_stream_id 16
reserved 2
version_number 5
current_next_indicator 1
section_number 8
last_section_number 8
for (i=0; i<N;i++) {
program_number 16
reserved 3
if(program_number == '0') {
network_PID 13
}
else {
program_map_PID 13
}
}
CRC_32 32
}
The numerical values here represent field widths in bits: the first byte of
the section encodes the table_id
, the next bit the
section_syntax_indicator
, and so on until the final four bytes which encode
the cyclic redundancy check.
The PAT is just one of the tables we would like to decode. There are many others, the next three most important being the the Conditional Access Table, the Program Map Table and the Network Information Table (CAT, PMT and NIT).
The textual output format we decided on should reflect the syntax description as follows:
program_association_section() {
table_id 8 = 0x0
section_syntax_indicator 1 = 0x1
'0' 1 = 0x0
...
CRC_32 32 = 0xcae52d9f
}
We came up with three possible
implementation strategies for our dvbcodec
:
The first strategy holds little appeal: it risks being a recipe for cut-and-paste code and boring repetition. We reject it.
The second and third strategies look to have more going from them,
particularly since we have restricted our scope to a subset of the full
bitstream syntax. Although these strategies appear rather different, they both
require us to parse syntax descriptions of the general form exemplified by the
program_association_section
.
So, we need a parser. We need one capable of handling conditionals and loops: one capable, that is, of handling a Turing-complete minilanguage. Raymond can advise. In general terms, he suggests:
C/C++
, and the library support
offered by these languages is often superior.
On the more specific subject of parsers, Raymond recommends lex
and yacc
though, in keeping with the Unix philosophy of documenting weaknesses, he
admits these tools are not perfect. He also suggests:
If you can implement your parser in a higher-level language than C (which we recommend you do ...) then look for equivalent facilities like Python's PLY ...
I tend to agree with Raymond but I'm not convinced PLY is the way to go here. Of course, it won't get me very far with my aim of finding out about modern C++, but it's also not part of the standard Python distribution. In fact, a web search reveals several other Python parser frameworks – it's unclear which will prevail.
The C++ standard library doesn't provide a parser either. We might make some
progress tokenising our data with std::strtok
or even std::sscanf
,
but this won't suffice. Lex
and yacc
are of course a time-tested
combination, but I'd rather not have to learn two more minilanguages.
The next place to look is in the next best thing to the C++ standard library, namely Boost. Three clicks from the front page takes us to the Spirit parser, which claims to be a scalable parser framework written in C++. We trust the source, the documentation is good, the examples work first time: let's try some code.
The code below is a complete program to recognise lines of the form:
reserved 2 = 0x3
this being the format we arrived at for fields of our text sections.
#include <boost/spirit/core.hpp>
#include <iostream>
#include <string>
using namespace boost;
/**
* @brief Parse a string representing a field
* @returns True if the field matches the format:
* <field_name> <bitwidth> = <value>, false otherwise.
*/
bool
parseField(std::string const & str) {
return spirit::parse(
str.begin(),
str.end(),
spirit::lexeme_d[+spirit::graph_p]
>> spirit::uint_p
>> '='
>> spirit::hex_p,
spirit::space_p).full;
}
int main() {
std::cout
<< "Enter text.\n"
<< "Lines will be matched against: \n"
<< "<field_name> <bitwidth> = <hexvalue>\n"
<< "Type 'q' to quit\n";
std::string str;
std::string const quit("q");
while (std::getline(std::cin, str) &&
str != quit) {
std::cout
<< (parseField(str) ? "hit" : "miss")
<< std::endl;
}
return 0;
}
Here, the action is concentrated in the function parseField()
, which
wraps a call to spirit::parse()
. spirit::parse()
accepts as arguments:
We have used spirit::space_p
directly as our skip parser. This primitive
parser recognizes whitespace and tells spirit::parse()
which characters it
should skip past in the input. A more sophisticated skip parser might be used
to skip comments.
The parser itself is a sequence of sub-parsers which can be read: recognise input consisting of a block one or more printable characters, followed by an unsigned integer, followed by the equals sign, followed by a hexadecimal integer.
Operator overloading is used by Spirit to make such expressions into readable
approximations of EBNF syntax descriptions (see also Antonsen for more on
this technique). Here, we see that operator>>()
has been overloaded as a
sequencing operator, prefix operator+()
has been overloaded to mean "one or
more of", and operator[]()
is overloaded to adapt the behaviour of a
sub-parser – in this case using spirit::lexeme_d
to turn off whitespace
skipping.
I should also mention that the '='
sub-parser is a shorthand for
spirit::ch_p('=')
, which in turn is a parser generator
returning the character literal parser spirit::chlit<char>('=')
.
Similarly, spirit::hex_p
and spirit::uint_p
are parser generator functions
which return suitable specialisations of the spirit::uint_parser
template
struct. The full template parameters of this struct are as follows:
template <
typename T = unsigned,
int Radix = 10,
unsigned MinDigits = 1,
int MaxDigits = -1>
struct uint_parser { /* */ };
The helper functions hex_p
and uint_p
are often good enough, but it's also
useful to have the full flexibility of the base parser. For example, if we
need to match larger hex values, and long long
is available, we could create
an alternative hex parser:
uint_parser<unsigned long long, 16> const
long_long_hex_p
= uint_parser<unsigned long long, 16>();
In fact, the uint_parser
should work with any user defined scalar type.
(You've probably noticed I'm now working in the boost::spirit
namespace. I
continue to do so for the remainder of this article.)
One thing I cannot do with the hex parser, unfortunately, is get it to accept
the 0x
we've used to prefix hex digits. We can fix the bug
in our program by introducing a new parser rule.
with_base_hex_p
= lexeme_d
[
as_lower_d["0x"]
>> hex_p
];
Note here:
as_lower_d
directive, which converts all characters from the input
to lowercase, and therefore recognising both 0x
and 0X
.
"0x"
, which in this context becomes yet another parser.
Simply recognising fields is not enough: we need to act on their contents.
That is, we must associate semantic actions with the sub-parsers. This can be
done using another overload of operator[]()
, which enables us to link an
action to a parser.
Here, then, is an encoder which will convert text versions of sections to
binary. I have omitted #include
directives etc. for brevity. The full
implementation is available with the source distribution for
this article.
typedef std::string::const_iterator iter;
/**
* @brief Put the input value to the output stream
* using the specified bitwidth
*/
void
putBits(std::ostream &, unsigned w, unsigned v)
{ /* */ }
/**
* @brief Parse a field of the form:
* <field_name> <bitwidth> = <value>
*/
bool
parseField(iter const & begin, iter const & end,
unsigned & bitwidth, unsigned & value) {
return parse(
begin,
end,
lexeme_d[+graph_p]
>> uint_p[assign_a(bitwidth)]
>> '='
>> lexeme_d
[
! as_lower_d["0x"]
>> hex_p[assign_a(value)]
]
,
space_p).full;
}
int main() {
std::string str;
int line = 0;
try {
while (std::getline(std::cin, str)) {
++line;
unsigned bitwidth, value;
if (parseField(str.begin(), str.end(),
bitwidth, value)) {
putBits(std::cout,
bitwidth,
value);
}
}
}
catch (std::exception & exc) {
std::cerr
<< "Error parsing line " << line
<< "\n" << str << "\n"
<< exc.what() << std::endl;
return -1;
}
return 0;
}
Note here:
0x
preceding hexadecimal values optional, using
Spirit's overload of operator!()
assign_a
actor for our semantic action. We could have used
any function accepting an unsigned integer or any functor providing
operator()(unsigned int) const
. Again, it's simpler to use one of Spirit's
off-the-peg actors.
std::cout
to binary mode.
Another important point to note about our simple encoder is the way it handles failure conditions using C++ exceptions rather than C-style error codes. There are plenty of failure conditions to handle: a value might not fit in the available bitwidth, the output stream might not be in a suitable state, and so on.
In this simple parser we might equally well have passed error codes around, but a more complex parser is likely to involve recursion and/or nested function calls. Exceptions perform well in both the simple and the complex case, offering a scalable solution.
The Spirit parser framework itself uses exceptions internally for similar reasons. To quote the documentation:
C++'s exception handling mechanism is a perfect match for Spirit due to its highly recursive functional nature. C++ exceptions are used extensively by this module for handling errors.
Like our program, Spirit should not leak any such exceptions to its users.
The simple encoder presented above follows Postel's prescription, to a degree . It doesn't mind too much about whitespace; it allows any sequence of printable characters as a field name; and it isn't fussy about the presentation of hexadecimal numbers.
Its main flaw is that it does not look at the text format of our sections as a
whole: it simply skips the lines which close blocks or start loops, for
example. This means the encoding will quietly do the wrong thing given input
where a new-line has gone missing, or where the data has been truncated. This
is dangerous. It also means the encoder cannot check the integrity of our text
data – for example, to confirm the section_length
field contains the actual
section length, or to validate a CRC.
When we start thinking along these lines, we realise that perhaps the encoder should calculate the values of these fields for us. We'll need a CRC generator anyway – why not embed it in the encoder?
These are important points. However, we never considered data validation when we planned our codec and I'm not going to worry about it just yet – I need to get started on the decoder. Data validation, though valuable, would need to be optional since an encoder must let us generate broken data for test purposes. Also, Raymond encourages us to limit options whenever possible: if we can release code earlier then our users can tell us which options they really want. Ideally, he suggests we make the release open source, and allow users to (submit patches which) implement these options.
We've used Spirit to write a micro-parser to drive the encoder. We're ready to start on the decoder. Spirit's scalability will be tested.
I decided to attempt the second implementation strategy: to develop a codec which understands the bitstream syntax and can use it to parse an arbitrary section format. I had no good reason for preferring this to the code-generator strategy.
As already noted, this is a parsing task. We will use Spirit to define the grammar used by the bitstream syntax. We can then parse our static program data – the section formats we're interested in – which gives us the basis we need to parse the run-time program inputs, that is, actual instances of binary encoded sections.
I do not propose to dwell on the practical use of Spirit for much longer: we've already seen enough of what it can do, so for full implementation details please refer to Spirit and the codec source distribution.
For the decoder, note that simply parsing the data once and associating semantic actions to the various lexical elements is not enough. For instance, to process descriptor loops we need to revisit the same parser node several times. Spirit provides abstract syntax trees for exactly this purpose.
I do think it is worth presenting here a portion of the section grammar. To me, this is a quite remarkable application of C++. For even more remarkable transcriptions of EBNF syntax definitions into Spirit grammars – including a C++ tokenizer and a C parser – I recommend a visit to the Spirit Applications Repository.
/**
* @brief MPEG-2 Section grammar defined using Boost Spirit.
*
* Reference:
* - ISO/IEC 13818-1, MPEG-2 Transport Stream
*/
struct Section :
public boost::spirit::grammar<Section>
{
template <typename ScannerT>
struct definition
{
definition(Section const & /*self*/)
{
section_
= section_ref_
>> section_body_
;
section_ref_
= text_id_
>> '('
>> ')'
;
text_id_
= leaf_node_d[
lexeme_d[
alpha_p
>> * (alnum_p | '_')
]
]
;
quoted_binary_
= leaf_node_d[
lexeme_d[
'\''
>> bin_p
>> '\''
]
]
;
section_body_
= ch_p('{')
>> *( field_
| loop_
| conditional_
| section_ref_
)
>> '}'
;
field_
= identifier_
>> uint_p
;
identifier_
= text_id_
| quoted_binary_
;
conditional_
= str_p("if")
>> condition_
>> section_body_
>> ! (str_p("else")
>> section_body_)
;
condition_
= inner_node_d['('
>> text_id_
>> "=="
>> quoted_binary_
>> ')'
]
;
loop_
= loop_control_
>> section_body_
;
loop_control_
= leaf_node_d[str_p("for")
>> '('
>> * (anychar_p - ')')
>> ')'
]
;
}
...
boost::spirit::rule<ScannerT> const &
start() const
{
return section_;
}
};
};
Many of the decisions taken when writing our naive encoder scale up to the decoder: limited options; exceptions used in preference to error codes; Spirit style guide for grammar definitions; typedefs for iterators.
Some decisions were ones we haven't yet faced. The main one was: where should we put section format definitions (for the PAT, CAT, PMT and NIT)? There are two obvious alternatives:
The second alternative is perhaps most faithful to our original aims. Program logic and program data are nicely separated, and extending the decoder to handle other sections is a simple matter of editing the text file. No recompilation necessary.
Despite these attractions, I went for the first option – partly because it's easier to implement and partly because I didn't want to work out where to put the text file.
The other corner I cut concerns determining how and when to exit loops. The
issue is fully described in the first part of this article (see the subsection
headed Complications. My resolution was to
notice that loops always exit when we've used up the number of bytes encoded
in a length
field – with the single exception of the outermost loop, which
may end four bytes early in order to leave space for a CRC. So, the decoder
maintains a stack of length
fields, testing the top value after each loop
iteration, popping it on loop exit. The first item to be stacked may need
adjusting to allow for the four byte CRC. Again, the source
distribution should make this clear.
I could find no official statement regarding what could be used as a field
name in the section format definitions: inspection suggested that these names
were rather similar to C identifiers, with the important addition of quoted
binary values for fixed fields (e.g. the '0'
which is the third named field in
the program_association_section
format definition).
A few more parse tree node directives might have resulted in a leaner decoder, but I wanted the syntax grammar to be as simple as possible. I am inexperienced in writing grammars and preferred a small amount of extra code in my application. The application is quite compact anyway.
Raymond has lots of practical advice on how to ship a source code distribution, going down to details of file naming conventions. Some of the suggestions I have followed are:
README
BUGS
file, listing known defects and limitations
The BUGS
file is a strangely satisfying thing to write, particularly if
you've ever delivered software which doesn't admit to defects, let alone
limitations (or indeed if you've ever used such software). In this case it is
essential to document both.
Version 0.1 of the dvbcodec
features the naive encoder described in the
preceding – really only of use for system testing (we can check that
decoding then encoding a file recreates the original file). The decoder is
rather better – in fact, I've extended it beyond the original specification
to handle a few more section formats: dvbcodec -l
gives details.
Having done the hard work and shipped our beta release, the rest of this part of the article is dedicated to some closing thoughts.
My criteria for language selection were somewhat artificial. If I had given myself a free hand I almost certainly would have been biased towards Python. However, having gone the C++ route – the modern C++ route, even – it would seem a good point to step back and review my selection.
Raymond has reservations about C++, which I summarise here:
(Incidentally, as already mentioned, Raymond is not knocking C++ to promote C. His recommendation is to adopt scripted and mixed language solutions.)
To fully address all these points is outside the scope of this article. Instead I shall consider each in the context of the development of our codec:
map
, vector
, stack
, string
etc – I have avoided a single direct call to operator new
. If I've got my
exception handling and my use of Spirit right, there should be no leaks.
Regarding C compatibility, I have benefited from the C standard library in a
few places.
RAII
class idiom is
usefully employed. The Spirit framework itself has moved with the times: what
was implemented with run-time polymorphic classes is now a complete rewrite
... using expression templates and static polymorphism.
![]() auto keyword for type deduced variables. See also
Colvin for a radical take on this issue. |
Despite all this, Spirit has proven itself flexible, scalable, capable of expressing grammars clearly and of writing efficient parsers without the need to step outside C++. Indeed, it could never have been done without C++. I am sure I will use Spirit again.
The future of Unix is Linux is open source. Raymond is passionate about software quality and argues convincingly that open source is the best way to deliver the highest quality software. I do not propose to rehearse these arguments here: Raymond's writings are available both in print form and online. (See, for example The Cathedral and the Bazaar).
What does interest me is that I cannot see how the full power of Spirit could be realised using anything other than a full source code distribution. It's all done with header files. Maybe with some reworking the implementation could be delivered in pre-built libraries, multiplied up by the various operating system, platform, version permutations – but wouldn't this make it even harder for compilers to build applications based on Spirit?
Of course, open source means more than just access to source: but it's still notable that this style of C++ favours open source distribution.
I'm still not sure if it would have been better to write a code generator, to generate our codec from the section formats.
Maybe I'll try using Spirit and C++ to generate a C codec.
Copyright © 2004 Thomas Guest |