The list of experts who’ve cautioned against optimising computer programs too soon reads like a who’s who of computer science. Michael A Jackson famously reduced the advice to a couple of succinct rules:
The First and Second Rules of Program Optimisation
- Don’t do it.
- (For experts only!): Don’t do it yet.
The first rule guards us against writing complicated code in the belief it will run faster — it’s simple, readable code we want, and humans are notoriously bad at guessing where the bottlenecks in a program will be. The second rule notes that computers get faster and tools (such as compilers) improve, so a program which may stress a computer today will run comfortably on another computer in six months.
It’s the second of these rules I’m going to discuss. Of course I agree with it, but there are caveats.
Since computers get exponentially faster, linear optimisations to a program are a waste of effort in the long run. This is true, but …
- It’s not always simple to replace a computer with a more up to date version: you may be able to download the latest drivers, but you can’t download the latest processor.
- Users’ expectations increase in line with computers’ power: so in six months time you may find they want their computer to do rather more.
- In an agile development environment, you’ll probably want to demonstrate your product to a customer today, even if the final delivery is scheduled for some time next year.
- In a typical project, it’s preferable to settle on a hardware platform sooner rather than later.
Recently, the second and third of these caveats bit me. I can’t go into details about the project here — let’s say I was working on a dedicated server which acted bit like a traffic junction. Several roads fed into the junction and a single road fed out of it. Traffic should flow through this junction smoothly in real time with no user intervention. And the server had to run reliably even when (here’s the specialised hardware requirement) subjected to a high level of vibration.
During preliminary development we used standard off-the-shelf machines running suitably stripped versions of Linux. We chose to divide the job into three core processes: one, written in C, addressed the drivers and platform directly; another, in C++, did the bulk of the heavy processing; and finally, a Python program performed the more fiddly logical operations. To use our road-junction analogy: the C++ program dealt with incoming traffic, the C program with outgoing traffic, and Python managed the junction itself.
We soon had something we could demonstrate to the customer. We explained the demonstration was merely a proof-of-concept, and in particular used unrepresentative hardware (please don’t shake it!). During the demo the server ran comfortably, in a steady state consuming about 10% of the available processor power, and of this percentage the C:C++:Python ratio was about 6:3:1. At peak times we saw processor use double to around 20% — no cause for concern.
On the strength of this success, we arranged a date for a second demonstration.
In parallel with this ongoing development the customer was working on the requirements and we were investigating a more suitable hardware platform (one which would work reliably when shaken, that is).
As I mentioned, the server’s job was to process a steady flow of traffic. A critical factor in performing this job is the rate of flow of this traffic — how many vehicles per second would the server need to handle? It turned out that the customer wanted to double the original specified rate. Could our prototype handle the new peak rate?
We found a supplier who could provide a server guaranteed to function over the specified vibration range — and with some room to spare. The only problem was that the processor speed was roughly half that of those we’d used in the original demonstration. We ordered a trial machine and waited.
It took a while for the trial machine to be delivered. It took a while for us to install a suitable flavour of Linux. And it took a while for us to learn we had performance problem with our software on the new platform at the new bitrate.
In the steady state the three processes were now typically consuming 50% of the available CPU — which I suppose we could have predicted. Less predictably, the ratio of C:C++:Python had changed; it was now about 3:4:3 (compared to the earlier 6:3:1). At peak times, the load reached 100%.
This new load on the machine was unacceptable. Running at 100% capacity caused the server function to degrade to the extent that we couldn’t demonstrate it; and the next customer demonstration was scheduled in three days time.
I don’t have an answer. You only have to look at real-life traffic flow to understand that a system can behave very differently when running at high rates; and in particular, at maximum capacity, things can go wrong. I’m guessing that the junction got gridlocked.
This is a good example of why you shouldn’t optimize programs, or at least not yet. If we’d optimised the original demonstration on the original platform, we’d have concentrated on the C program — since it appeared to be half as hungry again as the Python and C++ programs combined. Now it was the C++ program we’d have to look at first.
- reschedule the second demo
- run the demo on the original platform
- try and optimise the software so it would perform adequately on the new platform
Let me restate the first two rules of program optimization: don’t do it, and (for experts only) don’t do it yet. I’ll add a third rule — a rule which applies to software development in general — don’t rush it.
The last thing we wanted to do was complicate the software just to speed it up. Given more time, couldn’t we source a faster machine?
Despite these rules of software optimisation we decided it was at least worth investigating the third option: we had enough time to implement some opportunistic optimisation (that is, to find the few places in the code which were slowing the server down the most, and which might be easy to optimise).
The thing to do, of course, was to record current performance figures, and then run a profiler to try and determine where time was being spent. Actually, I ran two profilers: callgrind and shark. Shark proved better for this real-time analysis since callgrind slowed the program down to the extent that it was no longer operating in real-time and consequently spent much of its time in recovery mode. Both profilers, however, told much the same story: the program was processing a lot of data and it was spending a considerable amount of time in the C++ standard library simply walking through collections and comparing iterators.
The first part of this story was to be expected (there was lots of data to be processed). The second part was more surprising: the C++ standard library provides generic solutions with no drop in performance — a vector should be pretty much as efficient as an array, for example, and advancing an iterator should be as efficient as incrementing a pointer.
Easy. Let the compiler do it for you. GCC comes with a wide range
of optimisation options. I tried
-O3, which applies them all, and to
my surprise saw the program apparently run 5 times faster. This, to
me, was an unprecedented speed up. (We tried the same trick on the C
code, with almost no measurable effect.)
Why were we not routinely using optimized code? Because, until now, we hadn’t needed to, and we preferred to have just one build variant; and because a debug build can flag problems and offer useful diagnostics, while an optimised build is less helpful. As the GCC manual says:
Turning on optimization flags makes the compiler attempt to improve the performance and/or code size at the expense of compilation time and possibly the ability to debug the program.
In our case, compilation times and code size weren’t an issue — it wasn’t a huge program since the tricky code was all written in Python. An extra build variant was a problem, as was any loss of ability to debug.
Python comes with a standard profiling library. It’s as easy to use as:
$ python -m cProfile -o prof.dump <program to profile>
You then load the
prof.dump output file and analyse it using the
pstats module. For example, to find which 10 functions took the
>>> import pstats >>> prof = pstats.Stats("prof.dump") >>> prof.sort_stats("time").print_stats(10)
Armed with this evidence, I soon found a quick way to speed things up: the Python program was sticking debug labels onto traffic passing through the junction which the C program would then remove and analyse to confirm the junction was behaving correctly. It turned out that these debug labels — which were inessential to the server’s visible functionality — took a considerable amount of time to compute. By simply eliminating them, we got the Python program running at twice the speed.
Again, this was less than ideal. We weren’t happy to remove the debug labels. In the event of things going wrong, we no longer had all the information we needed. Of course, we could parametrise the labelling decision but this again would mean introducing a variant we didn’t particularly want to introduce.
Within a few hours, we’d got the server running at something closer to the desired capacity, and without really compromising the code. We still ended up selecting the second option — of running the demo on the original hardware — for reasons I won’t go into here. I don’t regard the time spent profiling the program as wasted: it’s always useful watch and measure your code in action, even if you need to rein in the urge to act on the results. The third rule of program optimisation applies: don’t rush it.