TIMTOWTDI vs TSBO-APOO-OWTDI

2018-04-19, Comments

TIMTOWTDI

TIMTOWTDI stands for “There is more than one way to do it”, an approach promoted by the Perl community.

The mindset behind it gets explored in more detail by the language’s creator, Larry Wall, in a talk given in 1999: “Perl, the first postmodern computer language”. He attributes the slogan to his daughter, Heidi, who says it’s a strategy which works well in her maths class; and she associates it with another saying used at school: “Tsall Good”. This doesn’t mean everything is good, or even everything has good bits. It means, overall, things are good. See the big picture.

Perl epitomises this. It’s eclectic and inclusive, supporting a variety of styles. One-liner? Fine! Like a shell script? Sure! Structured programming, object-oriented, functional? Why not! Tsall good.

I like that.

But do I feel that way about programming?

TSBO-APOO-OWTDI

A contrasting mantra appears in the Zen of Python, a list of aphorisms which summarise the guiding principles behind Python’s design. Item number 13 states “There should be one — and preferably only one — obvious way to do it.”

Perhaps realising this sounds overly prescriptive, this rule is tempered by item 14, “Although that way may not be obvious at first unless you’re Dutch.”

Guido van Rossum, Python’s BDFL — Benevolent Dictator For Life — would be the Dutch person who finds things obvious. That’s right: Dictator. Programmers don’t like being told what to do any more than two year olds. How then has Python become so popular?

Maybe emphasis falls on should. There should be only one obvious way to do it: it’s just that — Dutch or otherwise — we haven’t got there yet.

TIMTOP

For example, there is more than one Python. Obviously there’s Python 2 and Python 3, but it’s less obvious which to use. Don’t forget PyPy. Increasingly Python comes packaged with data processing and visualisation extensions, served up as a Jupyter notebook.

TIMTOPOM

There is more than one program options module.

When I started with Python there was getopt, the one and only command line handler. Coming from a C/C++ background I was quite happy to use something resembling GNU’s getopt. Then optparse appeared. Now there’s argparse. All of these libraries are readily available. Which should I use? Not optparse, that’s deprecated, unless I’m already using it and it works, that is. Regarding the other contenders, the documentation archly notes:

Users who are unfamiliar with the C getopt() function or who would like to write less code and get better help and error messages should consider using the argparse module instead.

There are other non-standard Python options for parsing a command line too: ones which generate code from the usage notes, or by inspecting the code you want to expose.

There is more than one way to do it.

TIMTOUTF

There is more than one unit test framework. The obvious one, unittest, like getopt, draws inspiration from elsewhere — in this case Java’s Junit. Unfortunately the port is too faithful, and you’ll have to inherit from super classes etc to test something. I much prefer PyTest, which flexes the language itself to deliver test assertions as asserts.

There’s also a doctest module in the standard library which executes and checks code found in strings (hold that thought!), and there are many other non-standard testing frameworks.

There is more than one way to do it.

TIMTOWOFS

There is more than one way of formatting strings.

As we’ve seen there’s more than one Python, and libraries are always up for reinvention. This is arguably evolution rather than a multiplicity of options. That is, the most recent way to do it should be preferred.

When it comes to string formatting, though, there has always been more than one way to do it, and more ways are still being added.

Do you use 'single' or "double" quotes for a string? """Triple""" quotes. Raw strings? Raw with an r or Raw with an R? TIMTOWTDI.

What if you want to embed the value of a variable in a string? Users familiar with C’s printf() function might prefer % formatting. Fans of $shell $parameter $expansion can use template strings.

Advanced string formattingstr.format — appeared in Python 3.0, backported to Python 2.6. No doubt it has advantages over % formatting, but for me it’s a little more obscure and a little less obvious. Python 3.6 introduces f-strings which build on str.format and knock down my reservations. The syntax allows you to evaluate expressions in strings: evidently Python is heading in Perl’s direction.

APSDOTADIW

Let’s finish by returning to Perl, and to Larry Wall’s 1999 talk.

How many times have we heard the mantra that a program should do one thing and do it well?

Perl is not that program. Perl wants to do everything well. It integrates features and makes no attempt to homogenise them.

You’ve all heard the saying: If all you have is a hammer, everything starts to look like a nail.

Perl is no hammer: it has memorably been described as a Swiss army chainsaw, but Larry Wall likens it to a more conventional tool.

If all you have is duct tape, everything starts to look like a duct. Right. When’s the last time you used duct tape on a duct?

Python may aspire to offer a single obvious way to do something. It fails splendidly, being more duct tape than hammer.


I presented this blog post as a lightning talk at PyDiff a couple of days ago. The slides are here. The talk was recorded too: I appear about 24 minutes in.

DDD Wales, 2018

2018-03-25, Comments

The first ever DDD Wales was held yesterday at TechHub Swansea. It was a free-to-attend one day event comprising 5 full one hour sessions split into 3 parallel tracks; that makes 15 sessions to choose from. Additionally, there were lightning talks in the lunch break.

I enjoyed Kevin Jones’ introduction to Kotlin, the more so since it was almost entirely coded live. Kevin ably demonstrated Kotlin to be “Java without the ceremony”. I could see connections with other modern compiled languages — Swift for example — languages which aren’t feature-shy, but which aim for a light, clean syntax; languages which build on existing systems and libraries. It was interesting to see his use of the JetBrains IDE as a teaching aid, and indeed to pick up on audience thoughts on the use of IDEs to flesh out code.

Chris Cundill’s talk on “release flow” was another highlight. You may not have heard of release flow but you’ll know what it is: a tried and tested strategy for code branching. Chris used his talk to challenge and call out some more recent alternatives — Gitflow being the prime target. The session got me thinking. One dimension Chris didn’t cover was people: personalities, roles and permissions. Who can merge to which branch? Which developers work in private then push bulk updates? Git has won the version control system battle. The fight has moved into surrounding areas: branching, merging, reviewing, continuous integration, and the competition is bringing improvements in tooling and best practice.

The final talk I attended was David Carboni’s session on creating a minimal Docker container to run a microservice written in Go. David started off by explaining why simplicity matters. I agree. I couldn’t agree more. The rest of the session was, again, live coding, replaying a demo which uses the techniques described in a couple of blog posts to whittle a Docker container down from a base size of ~700MB to a scratch size ~7MB.

All in all, a great day. The split-level venue suited the three track conference well. The speakers delivered terrific sessions which the audiences engaged with. I’d like to thank the organisers, sponsors, speakers, and other attendees.

Perec @IgniteSwansea #3

2018-02-03, , , Comments

Puzzle

At Ignite Swansea #3 I spoke about Georges Perec’s masterpiece, Life A User’s Manual.

Perec was — and indeed still is — a member of OuLiPo, a Parisian literary group interested in exploring the effects of applying mathematical patterns to text. His work seemed an appropriate subject for a presentation constrained to fit the ignite formula:

20 slides × 15 seconds = 5 minutes

It’s material I’ve spoken about before but the slides are new. The talk was recorded. I enjoyed attempting just-a-minute × 5, though on the night I could have sworn I was subject to a cruel powerpoint bug which sped up the playback.

OuLiPo fans might like to see if they can find the clinamen. The moustache is what happens in the fourth week of Movember. My thanks to all @IgniteSwansea and Cinema & Co for putting on such a great evening.

Bugwards Compatible

2018-02-02Comments

Chris Oldwood recently tweeted about “TFW you write some tests for a bit of legacy code before making a change and you unearth a bunch of bugs”

He doesn’t elaborate on what exactly “that feeling” is, but I’ll bet it’s not surprise. Writing tests for code almost invariably shakes something out — perhaps some undocumented assumptions about the inputs; perhaps a failure to match the documented behaviour; perhaps an access violation which will crash the application.

“That feeling” can include relief: the code is legacy code and evidently the bugs have gone unnoticed or at least unreported. Often, any such relief may be accompanied by a sense of wonder. The bugs may be so severe — the code so broken — that the maintainer questions how it ever worked.

“That feeling” may also be dismay, since the legacy code requires changing. If the existing behaviour is buggy but predictable it could well be that clients have adapted to this behaviour and wouldn’t welcome a fix. In other words, the change will have to be both backwards and bugwards compatible. Chris will have to tread carefully.

Such delicate decisions are not limited to the code’s runtime behaviour. It might seem that, once the code is under test, Chris can refactor mercilessly — renaming variables, updating idioms, tidying layout. Again, tread carefully! Make sure the code is under test. Be aware of the differences which reviewers must evaluate. Consider the wider context. Respect the original authors.

Meetup? Turn Up!

2018-01-18Comments

Sell out

Monday’s Agile Bath & Bristol meetup was a sell out. All 50 available spaces were taken. I myself was lucky to get a place — I was on the waiting list until a spot opened up at 3pm on the day. And I was the speaker!

The reality turned out to be different: of the 50 who’d claimed spaces, roughly one in four actually showed up.

I know, it was Blue Monday. I know, there are bugs going round — my daughter has been running a fever and didn’t go in to college, and if I hadn’t been presenting I myself would have cancelled to stay in with her. I know, your work day over-runs, you’re hungry, something else comes up. I know, it’s a free event, so it’s not as though you’ve lost anything.

Despite all these things a 25% turnout reflects badly on us all. It’s unfair on the sponsors and organisers, especially when refreshments are offered. It’s impolite to those who turn up at the advertised time, and must then sit waiting in case more people who’ve said they’re coming actually show up. It’s tough on the speakers: planning a session for an audience of 50 is different to one you’d plan for 12.

I realise 25% is egregiously low, but — in my experience — 50% is far from unusual, and even considered acceptable. I think it’s shabby. The one excuse noone has is forgetting the event is on — Meetup etc. integrate with your calendar and issue repeated reminders to attend and requests to cancel if you cannot.

So, my thanks to those who turned up and participated. I enjoyed it. Smaller numbers allowed for a more collaborative session. Ironically, topics discussed included punctuality, respect, commitment.

Please, don’t sign up to a meetup you don’t plan to attend. If you decide to cancel, release your place so someone else can have it. Otherwise, arrive on time.

Advent of Code 2017

2018-01-12, Comments

A big thanks to Eric Wastl for another great Advent of Code. Inspired by Peter Norvig, I’ve published my solutions as a Jupyter notebook.

Done!

Computer World

2017-09-30, , Comments

The Hitchhiker's Guide to the Galaxy

Douglas Adams’ “Hitchhiker’s Guide to the Galaxy” tells the story of the two most powerful computers ever made. The first, Deep Thought, was designed to figure out the meaning of Life the Universe and Everything. After 7,500,000 years of processing it came up with the concise but unedifying Ultimate Answer of 42. It then undertook the task of designing its successor, a computer sophisticated enough to calculate the Ultimate Question:

“… A computer which can calculate the Question to the Ultimate Answer, a computer of such infinite and subtle complexity that organic life itself shall form part of its operational matrix … Yes! I shall design this computer for you. And I shall name it also unto you. And it shall be called … The Earth.”

When I first heard this it seemed ridiculous. Now, almost 40 years on, I’ve realised it’s true.

It’s no longer correct to think of computers as discrete units. Computers have the property that when you connect two of them you get another computer, and so on. The network is the computer. The Apple in your hand, the Echo on your shelf, the chip in your shopping trolley — all combine to form a global connected device. And as Adams predicted, we ourselves form part of the operating system, constantly feeding data back in.

Planet Earth

Douglas Adams’ insight puts software development into perspective. True: we no longer print our product on silicon or ship it in boxes, and yes: we accept construction is not the right metaphor, but: nor is production. Professor Dave Snowden talks about entanglement — think of a system growing like brambles in a thicket. He emphasises what’s natural, evolutionary and human. Object oriented design lost out when it narrowed its focus. Remember, people are objects too. The world is our platform.

SwanseaCon 2017

2017-09-26, , Comments

I’m just back from a packed two days at SwanseaCon and would like to thank the organisers, speakers and participants for making such a welcoming and diverse conference happen right where I live.

For me, highlights included:

  • Professor Dave Snowden’s erudite and slide-free talk. It was a privilege to listen: although I may not quite have kept up, I will certainly follow up
  • Irina Tsyganok’s sensitive and inspirational presentation on pair-programming
  • Lawrence Weetman’s dishwashing demo — with three on-stage helpers, no less
  • Scott Fulton’s honest report on some personal agile life lessons

The venue wasn’t too shabby either.

The view from SwanseaCon

Having attended a fair few technical conferences, it felt refreshing and important to be part of something a little softer. Software development is about community, communication and culture, and SwanseaCon scored top marks on all three.

So, again, thanks!

Pay rise please

2017-07-25, , , Comments

In 1967 Georges Perec wrote a radio play, L’Augmentation, in which you, the protagonist, make some delicate decisions whilst negotiating a pay rise. Is your boss, Mr X, in a good mood? Did he have fish for lunch? Does one of his daughters have measles?

Flowchart

The story takes the form of a flow chart. It’s no coincidence it was written when computers were becoming a part of office and laboratory work, and when flow charts became a popular way to represent the algorithms they were programmed with. 1967 is also the year Perec joined Oulipo, a literary organisation whose members seek to create works based around mathematical constraints and structures. The loops and branches of Perec’s flow chart perfectly embody the frustrating routines of office politics.

Fifty years on, I’ve created a version to run on Alexa, which you can download from Amazon. It may not get me a pay rise, but I should qualify for a freebie Echo Dot.

Georges Perec

Follow me follow me

2017-07-04, , Comments

Leap

There were 12 of us in the room, plus Jim the instructor.

We had moved the tables and chairs to one side. Jim asked us to stand in the space we’d created. He asked each of us to pick two other people in the room, without telling anyone who we’d chosen.

The object of this exercise, Jim said, is to move around until each person is at an equal distance from the two people they’ve chosen.

Jim appointed Jo project manager.

You’re the project manager, Jim said. Get your team organised.

Jo had no particular instructions but as a team we instinctively started moving, stepping and turning. I was tracking Mark and Jo — who I’d chosen before her appointment as PM. I imagined a straight line on the floor of the room between the two of them and equidistant from both and walked towards it, adjusting my trajectory as they too moved. Ruth and Paul must have been following me: as I moved they turned and moved too.

Quite quickly we slowed down, making fine adjustments, shuffling into a final position. Was everyone happy? Yes? We’d done it.

Good work Jo!

What had Jo done? She’d let us get on with it. We were a self-organising team. What had I done? I’d suppressed my mathematical instinct to think before moving — I’d leapt rather than looked.

Look

This is a great team training exercise. It encourages exploration and trust. Evidently it works, too — however people choose who to follow, there will be a solution which can be discovered using a simple and sensible strategy. It made me think of shoal behaviour in animals, where there is no leader and each individual follows the same small set of rules, but the apparently sophisticated resulting behaviour suggests the shoal has a mind of its own.

animation

I experimented with a computer simulation. The picture above is a static snap shot — click it to access the live page and explore different configurations. I plan to experiment with different shoaling strategies and to expose more controls. The source code is here github.com/wordaligned/followme.

Unleash the test army

2017-05-29, Comments

Are the tests adequate?

Recently I described a solution to the problem of dividing a list into evenly sized chunks. It’s a simple enough problem with just two inputs: the list (or other sliceable container) xs and the number of chunks n. Nonetheless, there are traps to avoid and special cases to consider — what if n is larger than the list, for example? Must the chunks comprise contiguous elements from the original list?

The tests I came up with are straightforward and uninspiring. They were developed within the context of my own assumptions about the solution and the special cases I could imagine. They were written after the implementation — which is to say, development wasn’t driven by tests. They are whitebox tests, designed to cover the various paths through the code based on my insider knowledge.

Are these tests adequate? Certainly they don’t accurately represent the data which will hit the algorithm in practice. Can we be sure we haven’t missed anything? Would the tests still cover all paths if the implementation changed?

Property based testing

David R MacIver described another, complementary, approach at a talk I attended at ACCU 2016. In the talk abstract he characterises the (class of) tests I’d written as anecdotal — “let me tell you about this time I called a function … and then it returned this .. and then it raised an exception … etc. etc.”

How about if the test suite instead describes the properties required of the system under test, and then conjures up inputs designed to see if these properties hold under stress? So, rather than our test suite being a limited set of input/output pairs, it becomes an executable specification validated by a robot army.

China's Robot Army

Lazy sequences working hard

2017-05-16, Comments

I gave a talk @PyDiff this evening in the computer science department at Cardiff University.

Lazy Sequences working hard

Python has no problem handling large and even infinite streams of data. Just write lazy programs — code which defers data access until the last minute. This talk examines Python’s language and library support for such delaying tactics. There will be live coding, and we’ll draw parallels with similar features in other languages, in particular the Unix shell.

Being unsure where to pitch it, I started off easy and kept going until I’d lost everybody — including myself.

The room was well set up with a good quality projector and whiteboard, along with a desk to sit down when I wanted to write and run code and plenty of space to move around in otherwise. I did feel a bit like a jack-in-the-box by the end.

Me @PyDiff

I’d based the talk on a Jupyter notebook which I replayed with the ingenious RISE reveal.js extension written by Damian Avila. This worked well, since I got the pretty graphics along with the interactive coding. A static version of the slides is available here.

Thanks to everyone who came. Sorry I had to rush off after. If anyone would like to talk at Swansea, please let me know: you’d be most welcome.

Slicing a list evenly with Python

2017-05-14Comments

Sliced Python

Here’s a problem I came up against recently.

The task was to chop a list into exactly n evenly slized chunks. To give a little more context, let’s suppose we want to divide a list of jobs equally between n workers, where n might be the number of CPU cores available.

We can build the result by repeatedly slicing the input:

def chunk(xs, n):
    '''Split the list, xs, into n chunks'''
    L = len(xs)
    assert 0 < n <= L
    s = L//n
    return [xs[p:p+s] for p in range(0, L, s)]

This looks promising

>>> chunk('abcdefghi', 3)
['abc', 'def', 'ghi']

but if the size of the list is not an exact multiple of n, the result won’t contain exactly n chunks.

>>> chunk('abcde', 3)
['a', 'b', 'c', 'd', 'e']
>>> chunk('abcdefgh', 3)
['ab', 'cd', 'ef', 'gh']
>>> chunk('abcdefghij', 3)
['abc', 'def', 'ghi', 'j']

(By the way, I’m using strings rather than lists in the examples. The code works equally well for both types, and strings make it slightly easier to see what’s going on.)

One way to fix the problem is to group the final chunks together.

def chunk(xs, n):
    '''Split the list, xs, into n chunks'''
    L = len(xs)
    assert 0 < n <= L
    s, r = divmod(L, n)
    chunks = [xs[p:p+s] for p in range(0, L, s)]
    chunks[n-1:] = [xs[-r-s:]]
    return chunks

Now we have exactly n chunks, but they may not be evenly sized, since the last chunk gets padded with any surplus.

>>> chunk('abcde', 3)
['a', 'b', 'cde']
>>> chunk('abcdefgh', 3)
['ab', 'cd', 'efgh']
>>> chunk('abcdefghij', 3)
['abc', 'def', 'ghij']

What does “evenly sized” actually mean? Loosely speaking, we want the resulting chunks as closely sized as possible.

More precisely, if the result of dividing the length of the list L by the number of chunks n gives a size s with remainder r, then the function should return r chunks of size s+1 and n-r chunks of size s. There are choose(n, r) ways of doing this. Here’s a solution which puts the longer chunks to the front of the results.

def chunk(xs, n):
    '''Split the list, xs, into n evenly sized chunks'''
    L = len(xs)
    assert 0 < n <= L
    s, r = divmod(L, n)
    t = s + 1
    return ([xs[p:p+t] for p in range(0, r*t, t)] +
            [xs[p:p+s] for p in range(r*t, L, s)])

Here’s a second implementation, this time using itertools. Chaining r copies of s+1 and n-r copies of s gives us the n chunk widths. Accumulating the widths gives us the list offsets for slicing — though note we need to prepend an initial 0. Now we can form a (this, next) pair of iterators over the offsets, and the result is the accumulation of repeated (begin, end) slices taken from the original list.

from itertools import accumulate, chain, repeat, tee

def chunk(xs, n):
    assert n > 0
    L = len(xs)
    s, r = divmod(L, n)
    widths = chain(repeat(s+1, r), repeat(s, n-r))
    offsets = accumulate(chain((0,), widths))
    b, e = tee(offsets)
    next(e)
    return [xs[s] for s in map(slice, b, e)]

This version does something sensible in the case when the number of slices, n, exceeds the length of the list.

>>> chunk('ab', 5)
['a', 'b', '', '', '']

Finally, some tests.

def test_chunk():
    assert chunk('', 1) == ['']
    assert chunk('ab', 2) == ['a', 'b']
    assert chunk('abc', 2) == ['ab', 'c']
    
    xs = list(range(8))
    assert chunk(xs, 2) == [[0, 1, 2, 3], [4, 5, 6, 7]]
    assert chunk(xs, 3) == [[0, 1, 2], [3, 4, 5], [6, 7]]
    assert chunk(xs, 5) == [[0, 1], [2, 3], [4, 5], [6], [7]]
    
    rs = range(1000000)
    assert chunk(rs, 2) == [range(500000), range(500000, 1000000)]

Agile at a distance 👍

2017-04-14, Comments

We are here

I’m happy to be part of a team which supports remote working. This post collects a few notes on how agile practices fare when people may not be colocated. I don’t claim what’s written to be generally true; rather, it’s specific to me, my team, and how we work.

Remote team

We’re by no means entirely distributed. There are seven of us in the engineering team, all UK based. We have a dedicated office space and of the seven, four are office-based and three are remote workers. Office-based staff are free to work from home when it suits. I’m office-based but work from home around 40% of the time, for example. Remote workers typically visit the office every couple of weeks to attend the sprint ceremonies — review, retrospective, planning. The rest of the company is more distributed and mobile, comprising product and marketing, medical experts, operations engineers, sales and admin.

WFH

From bytes to strings in Python and back again

2017-03-24, Comments

Low level languages like C have little opinion about what goes in a string, which is simply a null-terminated sequence of bytes. Those bytes could be ASCII or UTF-8 encoded text, or they could be raw data — object code, for example. It’s quite possible and legal to have a C string with mixed content.

char const * mixed =
    "EURO SIGN "          // ASCII
    "UTF-8 \xE2\x82\xAC " // UTF-8 encoded EURO SIGN
    "Latin-9 \xA4";       // Latin-9 encoded EURO SIGN

This might seem indisciplined and risky but it can be useful. Environment variables are notionally text but actually C strings, for example, meaning they can hold whatever data you want. Similarly filenames and command line parameters are only loosely text.

A higher level language like Python makes a strict distinction between bytes and strings. Bytes objects contain raw data — a sequence of octets — whereas strings are Unicode sequences. Conversion between the two types is explicit: you encode a string to get bytes, specifying an encoding (which defaults to UTF-8); and you decode bytes to get a string. Clients of these functions should be aware that such conversions may fail, and should consider how failures are handled.

Simply put, a string in Python is a valid Unicode sequence. Real world text data may not be. Programmers need to take charge of reconciling any discrepancies.

We faced such problems recently at work. We’re in the business of extracting meaning from clinical narratives — text data stored on medical records systems in hospitals, for example. These documents may well have passed through a variety of systems. They may be unclear about their text encoding. They may not be encoded as they claim. So what? They can and do contain abbreviations, mispellings, jargon and colloquialisms. Refining the signal from such noise is our core business: if we can correctly interpret positional and temporal aspects of a sentence such as:

Previous fracture of left neck of femur

then we can surely deal with text which claims to be UTF-8 encoded but isn’t really.

Our application stack is server-based: a REST API to a Python application handles document ingest; lower down, a C++ engine does the actual document processing. The problem we faced was supporting a modern API capable of handling real world data.

It’s both undesirable and unnecessary to require clients to clean their text before submitting it. We want to make the ingest direct and idiomatic. Also, we shouldn’t penalise clients whose data is clean. Thus document upload is an HTTP POST request, and the document content is a JSON string — rather than, say, base64 encoded binary data. Our server, however, will be permissive about the contents of this string.

So far so good. Postel’s prescription advises:

Be liberal in what you accept, and conservative in what you send.

This would suggest accepting messy text data but presenting it in a cleaned up form. In our case, we do normalise the input data — a process which includes detecting and standardising date/time information, expanding abbreviations, fixing typos and so on — but this normalised form links back to a faithful copy of the original data. What gets presented to the user is their own text annotated with our findings. That is, we subscribe to a more primitive prescription than Postel’s:

Garbage in, garbage out

with the caveat that the garbage shouldn’t be damaged in transit.

Happily, there is a simple way to pass dodgy strings through Python. It’s used in the standard library to handle text data which isn’t guaranteed to be clean — those environment variables, command line parameters, and filenames for example.

The surrogateescape error handler smuggles non-decodable bytes into the (Unicode) Python string in such a way that the original bytes can be recovered on encode, as described in PEP 383:

On POSIX systems, Python currently applies the locale’s encoding to convert the byte data to Unicode, failing for characters that cannot be decoded. With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF.

This workaround is possible because Unicode surrogates are intended for use in pairs. Quoting the Unicode specification, they “have no interpretation on their own”. The lone trailing surrogate code — the half-a-pair — can only be the result of a surrogateescape error handler being invoked, and the original bytes can be recovered by using the same error handler on encode.

In conclusion, text data is handled differently in C++ and Python, posing a problem for layered applications. The surrogateescape error handler provides a standard and robust way of closing the gap.

Unicode Surrogate Pairs

Surrogates

Code Listing

>>> mixed = b"EURO SIGN \xE2\x82\xAC \xA4"
>>> mixed
b'EURO SIGN \xe2\x82\xac \xa4'
>>> mixed.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 14:
  invalid start byte
>>> help(mixed.decode)
Help on built-in function decode:

decode(encoding='utf-8', errors='strict') method of builtins.bytes instance
    Decode the bytes using the codec registered for encoding.

encoding
      The encoding with which to decode the bytes.
    errors
      The error handling scheme to use for the handling of decoding errors.
      The default is 'strict' meaning that decoding errors raise a
      UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
      as well as any other name registered with codecs.register_error that
      can handle UnicodeDecodeErrors.

>>> mixed.decode(errors='surrogateescape')
'EURO SIGN € \udca4'
>>> s = mixed.decode(errors='surrogateescape')
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udca4' in position 12:
  surrogates not allowed
>>> s.encode(errors='surrogateescape')
b'EURO SIGN \xe2\x82\xac \xa4'