String literals and regular expressions

2006-09-02, , , , , , Comments

According to the “Draft Technical Report on C++ Library Extensions” (more commonly known as TR1) regular expressions are making their way into the C++ standard library. Actually, Boost users have had a regular expression library for a while now. The library is well designed and easy to use but is let down by the limitations of string literals.

String Literals

Let’s go back to basics and examine a C++ string literal:

char const * s = "string literal";

Here, the string literal comprises the sequence of characters s, t, … l. The double-quotes " serve to delimit the contents of the string.

All’s fine until we need a double-quote inside the string:

Broken string literal
char const * s = "The "x" in C++0x will probably be 9";

This line of code gives a compilation error:

error: expected `,' or `;' before "x"

since the first internal double-quote closed the string. But how can we include a double-quote without closing the string?

Escape Sequences

Here’s how: the backslash, \, is treated as an escape character. That is to say, normal interpretation of the string is suspended for a while — in this case for a single character — allowing us to write:

Fixed string literal
char const * s = "The \"x\" in C++0x will probably be 9";

Here, the internal double-quotes have been escaped, so they don’t close the string literal but are in fact interpreted as double-quote characters within the string itself. Yes, it’s confusing.

Literal Backslashes

Now, if the backslash takes on a special meaning, how are we to insert a literal backslash into the string? Simple — we must escape that too:

String literal containing a single backslash
char const * s = "A backslash \\ starts an escape sequence";

Here, despite first appearances, the string contains just a single backslash character. We did say it was confusing! Which leads us on to …

Regular Expressions

Put string literals aside for now. We’re going to talk about regular expressions (let’s call them regexes from now on). Regexes are used to find and match patterns in blocks of text. Like string literals, regexes are composed of sequences of characters, and, also like string literals, we need to escape the usual meaning of characters in regexes.

Once again, the backslash, \, is used as the escape sequence prefix.

Ruby embeds a powerful regex engine, so let’s use Ruby for our regex examples:

Some Ruby regex patterns
/w/
/w+/
/\w+/
/"\w*"/
/\\/

Notice here that the forward slash, /, is used as a delimiter and is not part of the body of the regex pattern — just like the double-quote, ", was not part of the body our string literals.

What do these regex patterns mean?

  1. /w/ matches the character w.
  2. /w+/ matches a sequence of one or more adjacent w’s.
  3. /\w+/ matches one or more adjacent “word” characters.
  4. /"\w*"/ matches a double-quote delimited sequence of zero or more “word” characters.
  5. /\\/ matches a single backslash.

Did you notice that the backslash, \, gives the succeding w a special meaning? Did you notice that the + has a special meaning within a regex (it means one or more)? To match a literal +, we’d need to escape it like this: /\+/. And did you notice that to match a literal backslash we must escape it? Good — but that was the easy bit!

Attempting to Match a C++ String Literal

Let’s suppose we want to use our regex pattern matching on some C++ code. Now, matching a C++ string literal is going to be tricky. A first attempt, /".*"/, just won’t do because the .* is greedy and will eat up everything until the final " in the text to be matched. So we might match too much:

char * s1 = "string", * s2 = "literal";
            ^---------match----------^

A non-greedy second attempt, /".*?"/, won’t do either since it gets confused by an escaped double-quote in a string literal. So we might match too little:

char const * s = "The \"x\" in C++0x will probably be 9";
                 ^match^

Correctly Matching a C++ String Literal

To properly match a C++ string literal we need to apply the following pattern: start with a double-quote; continue with a sequence of either characters which aren’t the double-quote or the backslash or escape sequences; then finish with a double-quote.

Precisely what makes up a valid escape sequence is a little fiddly; there are octal and hexadecimal escapes, there are various whitespace characters, and there are unicode values. We can however compose a pattern using a suitable short-cut as follows:

A regex to match a C++ string literal
/"([^"\\]|\\.)*"/

We can read this as: a string literal starts with a double quote, followed by any number of items which are:

  • either not a double-quote or a backslash
  • or are a backslash followed by any single character

and then finishes with a closing double-quote.

As you’ve probably spotted, we have to double up the backslashes in the regex pattern because the backslash is used as an escape sequence; i.e. a literal backslash is matched by the pattern \\\\.

Now let’s do it in C++

I’ll use the Boost implementation since the compilers I have available don’t support TR1 yet. We’re going to need to construct a boost::regex using a pattern represented by a string literal. Which is where the problems start. Of course we can’t write:

This regex won’t compile!
boost::regex const
    string_matcher(/"([^"\\]|\\.)*"/);

because we haven’t passed a string literal to the boost::regex constructor. In order to pass a string literal we’ll need to use double-quotes instead of forward-slashes and we’ll have to escape the internal double-quotes. Let’s try again:

This regex throws an exception!
boost::regex const
    string_matcher("\"([^\"\\]|\\.)*\"");

Oh dear — the error moves to run-time. We get an exception: Unmatched [ or [^. This is because the closing square bracket ] has been escaped by the time it gets to the regex engine. Unfortunately the \\’s in the string literal contract to just single backslashes. We need to redouble them.

This regex is just right!
boost::regex const
    string_matcher("\"([^\"\\\\]|\\\\.)*\"");

Here, each pair of backslashes has contracted to a single backslash by the time the regex engine sees it, which — believe it or not — is what’s required.

This string_matcher works, but as code it is rather more cryptic than communicative.

A complete C++ string literal matcher

Here’s a complete program for you to try.

A C++ string matcher
#include <boost/regex.hpp>
#include <iostream>
#include <stdexcept>
#include <string>

int main(int argc, char * argv[])
{
  try
  {
    boost::regex const
      string_matcher("\"([^\"\\\\]|\\\\.)*\"");
    std::string line;
    while (std::getline(std::cin, line))
    {
      if (boost::regex_match(line, string_matcher))
      {
        std::cout << line << " is a C++ string literal\n";
      }
    }
  }
  catch (std::exception & exc)
  {
    std::cerr << "An error occurred: " << exc.what();
  }
  catch (...)
  {
    std::cerr << "An error occurred\n";
  }
  return 0;
}

Raw Strings in Python

Unlike Ruby, Python doesn’t include support for regexes in the language itself. Instead, regex support is provided by the standard regular expression library.

Python’s flexible string literals allow us to simplify the pattern, though. Here, we use a raw string, and we chose to delimit it with single-quotes so we don’t need to escape the internal double-quotes.

string_literal_pattern = r'"([^"\\]|\\.)*"'

This is nice. Basically, raw strings leave the backslashes unprocessed. Raw strings aren’t just restricted to regex patterns, though perhaps that’s their most common use.

Raw Strings in C++?

C++ doesn’t support raw strings (at least, it doesn’t support them yet, and I haven’t found them mentioned in TR1) — but it does support wide-strings, indicated by the L prefix.

cpp_wide_string = L"this is a wide string";

Maybe if we switched the L for an R we could allow raw strings into C++? It would make regex patterns far more readable.

Verbatim Strings in C++?

Alternatively …

I’ve never used C# but googling suggests raw strings are supported and rather nicely named “verbatim string literals”. C# uses the @ prefix to indicate that a string literal is a verbatim string. Now, @ isn’t even part of the C++ source character set, so maybe this too would be possible.

There’s no escape

The proliferation of backslashes when we combine regexes and string literals is unfortunate. It could be worse. What if the backslash key had fallen off our keyboard? Remarkably – and, as far as I know, uniquely – C++ caters for this situation. A number source characters can be written as “trigraphs” — sequences of three characters starting ??. The backslash is one such character: it can be written as ??/.

regex using trigraphs
boost::regex const
    string_matcher("??/"([^??/"??/??/??/??/]|??/??/??/??/.)*??/"");

For completeness, we could also lose the |, [ and ] keys.

regex using even more trigraphs
boost::regex const
    string_matcher("??/"(??(^??/"??/??/??/??/??)??!??/??/??/??/.)*??/"");

The string literal used to initialise string_matcher is valid, but the regex wouldn’t match it properly. I’ll leave the fix as an exercise for the reader.

Feedback