A Generated Text Codec

The program which follows is a short but non-trivial Python script. It makes use of a couple of text codecs from the Python standard library to generate a C++ function. This C++ function converts a single character from ISO 8859-9 encoding into UTF-8 encoded Unicode.

def warnGenerated():
   '''Return a standard "generated code" warning.'''
   import sys, time
   return (
       '// generated by %s, %s' %
       (' '.join(sys.argv),

def functionHeader(codec):
   '''Return the decode function header.'''
   return '''\
 * @brief Convert from %(codec)s into UTF-8 encoded Unicode
 * @param %(codec)s An %(codec)s encoded character
 * @param it Reference to an output iterator
 * @note If the input character is invalid, the Unicode 
 * replacement character U+FFFD will be returned.
template <typename output_iterator>
   unsigned char %(codec)s,
   output_iterator & it)''' % { 'codec' : codec }

def convertCh(ch, codec):
   '''Return the 'case' statement converting
   the input character using the supplied codec'''

   from unicodedata import name

   ucs = chr(ch).decode(codec, 'replace')
   utf = ucs.encode('utf-8')
   ucname = name(ucs, 'Control code')
   action = '; '.join(['*it++ = 0x%02x' % ord(c)
                       for c in utf])

   return '''case 0x%02x: // %s
   break;''' % (ch, ucname, action)

def codeBlock(prefix, body, indent = ' ' * 4):
   '''Return an indented code block.

   This code block will be formatted:
   import re
   indent_re = re.compile('^', re.MULTILINE)
   return '''%s
}''' % (prefix, indent_re.sub(indent, body))

codec = 'iso8859_9'

print warnGenerated()

print codeBlock(
       'switch(%s)' % codec,
       # iso8859-* encodings are 8-bit
       '\n'.join([convertCh(ch, codec)
                  for ch in range(0x100)]),
       indent = '' # don't indent case: labels

By now, it should go without saying that this script is a metaprogram. Before discussing why I think it's a good use of metaprogramming, some notes:

I like this script since it makes use of the standard Python library to create code we can use in a C++ program. The hard work goes on in the calls to encode() and decode() and we don't even have to look at the implementations of these functions, let alone maintain them. Their speed does not affect the speed of our C++ function and I am willing to trust their correctness, meaning I don't have to locate or purchase the ISO 8859 standards.

The second big win is that all the boilerplate code is generated without effort. If, at some point in the future, we need a fuller range of ISO 8859 text converters, then we tweak the script so the final section reads, for example:

codecs = ['iso8859_%d' % n for n in range(1, 10)]

print warnGenerated()

for codec in codecs:
    print codeBlock(

and let it run. And should we decide on a different strategy for handling invalid input data, again, the metaprogram is our friend.

Copyright © 2005 Thomas Guest