Low level languages like C have little opinion about what goes in a string, which is simply a null-terminated sequence of bytes. Those bytes could be ASCII or UTF-8 encoded text, or they could be raw data — object code, for example. It’s quite possible and legal to have a C string with mixed content.
char const * mixed = "EURO SIGN " // ASCII "UTF-8 \xE2\x82\xAC " // UTF-8 encoded EURO SIGN "Latin-9 \xA4"; // Latin-9 encoded EURO SIGN
This might seem indisciplined and risky but it can be useful. Environment variables are notionally text but actually C strings, for example, meaning they can hold whatever data you want. Similarly filenames and command line parameters are only loosely text.
A higher level language like Python makes a strict distinction between bytes and strings. Bytes objects contain raw data — a sequence of octets — whereas strings are Unicode sequences. Conversion between the two types is explicit: you encode a string to get bytes, specifying an encoding (which defaults to UTF-8); and you decode bytes to get a string. Clients of these functions should be aware that such conversions may fail, and should consider how failures are handled.
Simply put, a string in Python is a valid Unicode sequence. Real world text data may not be. Programmers need to take charge of reconciling any discrepancies.
We faced such problems recently at work. We’re in the business of extracting meaning from clinical narratives — text data stored on medical records systems in hospitals, for example. These documents may well have passed through a variety of systems. They may be unclear about their text encoding. They may not be encoded as they claim. So what? They can and do contain abbreviations, mispellings, jargon and colloquialisms. Refining the signal from such noise is our core business: if we can correctly interpret positional and temporal aspects of a sentence such as:
Previous fracture of left neck of femur
then we can surely deal with text which claims to be UTF-8 encoded but isn’t really.
Our application stack is server-based: a REST API to a Python application handles document ingest; lower down, a C++ engine does the actual document processing. The problem we faced was supporting a modern API capable of handling real world data.
It’s both undesirable and unnecessary to require clients to clean their text before submitting it. We want to make the ingest direct and idiomatic. Also, we shouldn’t penalise clients whose data is clean. Thus document upload is an HTTP POST request, and the document content is a JSON string — rather than, say, base64 encoded binary data. Our server, however, will be permissive about the contents of this string.
So far so good. Postel’s prescription advises:
Be liberal in what you accept, and conservative in what you send.
This would suggest accepting messy text data but presenting it in a cleaned up form. In our case, we do normalise the input data — a process which includes detecting and standardising date/time information, expanding abbreviations, fixing typos and so on — but this normalised form links back to a faithful copy of the original data. What gets presented to the user is their own text annotated with our findings. That is, we subscribe to a more primitive prescription than Postel’s:
Garbage in, garbage out
with the caveat that the garbage shouldn’t be damaged in transit.
Happily, there is a simple way to pass dodgy strings through Python. It’s used in the standard library to handle text data which isn’t guaranteed to be clean — those environment variables, command line parameters, and filenames for example.
surrogateescape error handler smuggles non-decodable bytes into the (Unicode) Python string in such a way that the original bytes can be recovered on encode, as described in PEP 383:
On POSIX systems, Python currently applies the locale’s encoding to convert the byte data to Unicode, failing for characters that cannot be decoded. With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF.
This workaround is possible because Unicode surrogates are intended for use in pairs. Quoting the Unicode specification, they “have no interpretation on their own”. The lone trailing surrogate code — the half-a-pair — can only be the result of a
surrogateescape error handler being invoked, and the original bytes can be recovered by using the same error handler on encode.
In conclusion, text data is handled differently in C++ and Python, posing a problem for layered applications. The
surrogateescape error handler provides a standard and robust way of closing the gap.
Unicode Surrogate Pairs
>>> mixed = b"EURO SIGN \xE2\x82\xAC \xA4" >>> mixed b'EURO SIGN \xe2\x82\xac \xa4' >>> mixed.decode() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 14: invalid start byte >>> help(mixed.decode) Help on built-in function decode: decode(encoding='utf-8', errors='strict') method of builtins.bytes instance Decode the bytes using the codec registered for encoding. encoding The encoding with which to decode the bytes. errors The error handling scheme to use for the handling of decoding errors. The default is 'strict' meaning that decoding errors raise a UnicodeDecodeError. Other possible values are 'ignore' and 'replace' as well as any other name registered with codecs.register_error that can handle UnicodeDecodeErrors. >>> mixed.decode(errors='surrogateescape') 'EURO SIGN € \udca4' >>> s = mixed.decode(errors='surrogateescape') >>> s.encode() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udca4' in position 12: surrogates not allowed >>> s.encode(errors='surrogateescape') b'EURO SIGN \xe2\x82\xac \xa4'