Hunting down globals with nm

2008-04-08, Comments

It was an old library, in need of attention — we all knew that — but it worked well, and we saw no reason to change it. Until, that is, we wanted more than one of it. The problem being, it was riddled with globals. A typical file looked something like this:

Too many globals
#include <string.h>
#define MSG_BUF_SIZE 256

char const * g_libname = "TOO.MANY.GLOBALS";

void initialise()
    static int s_initialised = 0;    
    if (s_initialised == 0)
        s_initialised = 1;

char g_msg_buf[MSG_BUF_SIZE];

static void clear_message()
    memset(g_msg_buf, 0, sizeof(g_msg_buf));


In the snippet above, the g_msg_buf has external linkage. Other files in the library accessed it freely. The local static int, s_initialised, is better contained, but still stood in our way. How could we initialise two library instances?

Don’t worry, we’re not about to discuss the evils of globals and singletons. We all know what needs doing here: initialising the library should return clients a handle, and each client would use its returned handle for subsequent access to the library. Internally, the handle would be a pointer to a struct, the details of which would be private to the library, packaging its internal state.

Sadly no refactoring IDE could cope with this job. Our immediate problem was simply sizing up the task. So we had to count up the s_initialised’s and g_msg_buf’s and so on. One obvious way of getting a number would be to browse the code and build a list of these globals. Indeed, this approach has some merit: we’re building familiarity with the code, code we’ll ultimately have to change. An exact answer isn’t really needed at this stage.

Shell hackers might attempt an instant estimate by combining grep, sort and uniq — if we’re confident that the s_ and g_ prefixes are consistently used in the library.

$ grep -Eioh "\b[sg]_[[:alnum:]_]*\b" nm.c | sort | uniq

Such text based approaches are better than nothing. We can review the result for false hits, inspect the code to see if we’ve missed anything obvious, adapt the pattern if required, and pipe the result to wc -l for a final count.

But the best route to an accurate answer is easier and quicker. The compiler has to know exactly what’s global, what’s local, what’s data and what’s missing, and that information gets put in the object code it generates. Since reading object code is tough, we’ll ask nm to do it for us. Here’s what I get if I compile the snippet above and inspect the output object file with nm. (What you get should be similar, but the details will depend on your platform.)

$ gcc -c too_many_globals.c && nm too_many_globals.o
00000018 t clear_message
00000000 D g_libname
00000100 C g_msg_buf
00000000 T initialise
         U memset
00000000 b s_initialised.0

The nm manual tells us how to interpret the output:

Nm displays the name list (symbol table) of each object file in the argument list … Each symbol name is preceded by its value (blanks if undefined) … this value is followed by one of the following characters, representing the symbol type: U (undefined), A (absolute), T (text section symbol), D (data section symbol), B (bss section symbol), C (common symbol) …. If the symbol is local (non-external), the symbol’s type is instead represented by the corresponding lowercase letter.

Nm works on object files, libraries (static and dynamic) and executables. You don’t have to be an expert on object code to correlate the nm output shown above with the source code. It’s telling us:

  • clear_message is a local function
  • g_libname is constant global data
  • initialise is an external function
  • memset is undefined (it’s part of the standard C library)
  • g_msg_buf and s_initialised are the bad boys we’re hunting down

Once we’ve discovered nm we can pick out the globals accurately and swiftly. Running nm libtoo_many_globals.a outputs text which we can pipe through standard Unix tools as before to get exact metrics.

The GNU version of nm has some bells and whistles — it can demangle C++ symbols, for example. Object code is platform dependent and the details of nm’s output will similarly vary across platforms, so I suggest you look at the manual, but most of the time nm OBJECTFILE is all you’ll need.

Global constants

Note that nm has nothing to say about the preprocessor definition, MSG_BUF_SIZE, which vanishes well before the object file gets written. Since MSG_BUF_SIZE can’t be changed at run time or even after compilation, it won’t stop us from safely using more than one library instance. Nm does tell us about g_libname, a string constant has been placed in the data section of the object file. Like MSG_BUF_SIZE, multiple library instances can safely share this read-only data.

Just because something can be done doesn’t make it good practice. I don’t think there’s enough information here to definitively rule against these two “safe” globals, but they certainly look suspect. At the very least, the scope of the library name string should be reduced. It would be better to review use of constant data throughout the library; by passing this data in, perhaps as an initialisation parameter, the library may become more flexible and easier to test.