String literals and regular expressions
According to the “Draft Technical Report on C++ Library Extensions” (more commonly known as TR1) regular expressions are making their way into the C++ standard library. Actually, Boost users have had a regular expression library for a while now. The library is well designed and easy to use but is let down by the limitations of string literals.
String Literals
Let’s go back to basics and examine a C++ string literal:
char const * s = "string literal";
Here, the string literal comprises the sequence of characters s
,
t
, … l
. The double-quotes "
serve to delimit the contents of
the string.
All’s fine until we need a double-quote inside the string:
char const * s = "The "x" in C++0x will probably be 9";
This line of code gives a compilation error:
error: expected `,' or `;' before "x"
since the first internal double-quote closed the string. But how can we include a double-quote without closing the string?
Escape Sequences
Here’s how: the backslash, \
, is treated as an escape
character. That is to say, normal interpretation of the string is suspended
for a while — in this case for a single character — allowing us to write:
char const * s = "The \"x\" in C++0x will probably be 9";
Here, the internal double-quotes have been escaped, so they don’t close the string literal but are in fact interpreted as double-quote characters within the string itself. Yes, it’s confusing.
Literal Backslashes
Now, if the backslash takes on a special meaning, how are we to insert a literal backslash into the string? Simple — we must escape that too:
char const * s = "A backslash \\ starts an escape sequence";
Here, despite first appearances, the string contains just a single backslash character. We did say it was confusing! Which leads us on to …
Regular Expressions
Put string literals aside for now. We’re going to talk about regular expressions (let’s call them regexes from now on). Regexes are used to find and match patterns in blocks of text. Like string literals, regexes are composed of sequences of characters, and, also like string literals, we need to escape the usual meaning of characters in regexes.
Once again, the backslash, \
, is used as the escape sequence prefix.
Ruby embeds a powerful regex engine, so let’s use Ruby for our regex examples:
/w/ /w+/ /\w+/ /"\w*"/ /\\/
Notice here that the forward slash, /
, is used as a delimiter and
is not part of the body of the regex pattern — just like the
double-quote, "
, was not part of the body our string literals.
What do these regex patterns mean?
/w/
matches the characterw
./w+/
matches a sequence of one or more adjacentw
’s./\w+/
matches one or more adjacent “word” characters./"\w*"/
matches a double-quote delimited sequence of zero or more “word” characters./\\/
matches a single backslash.
Did you notice that the backslash, \
, gives the succeding w
a special
meaning? Did you notice that the +
has a special meaning within a regex
(it means one or more)? To match a literal +
, we’d need to escape
it like this: /\+/
. And did you notice that to match a literal backslash
we must escape it? Good — but that was the easy bit!
Attempting to Match a C++ String Literal
Let’s suppose we want to use our regex pattern matching on some C++ code.
Now, matching a C++ string literal is going to be tricky. A first attempt,
/".*"/
, just won’t do because the .*
is greedy and will eat up
everything until the final "
in the text to be matched. So
we might match too much:
char * s1 = "string", * s2 = "literal"; ^---------match----------^
A non-greedy second attempt, /".*?"/
, won’t do either since it gets
confused by an escaped double-quote in a string literal. So
we might match too little:
char const * s = "The \"x\" in C++0x will probably be 9"; ^match^
Correctly Matching a C++ String Literal
To properly match a C++ string literal we need to apply the following pattern: start with a double-quote; continue with a sequence of either characters which aren’t the double-quote or the backslash or escape sequences; then finish with a double-quote.
Precisely what makes up a valid escape sequence is a little fiddly; there are octal and hexadecimal escapes, there are various whitespace characters, and there are unicode values. We can however compose a pattern using a suitable short-cut as follows:
/"([^"\\]|\\.)*"/
We can read this as: a string literal starts with a double quote, followed by any number of items which are:
- either not a double-quote or a backslash
- or are a backslash followed by any single character
and then finishes with a closing double-quote.
As you’ve probably spotted, we have to double up the backslashes in
the regex pattern because the backslash is used as an escape sequence;
i.e. a literal backslash is matched by the pattern \\\\
.
Now let’s do it in C++
I’ll use the Boost implementation since the compilers I
have available don’t support TR1 yet. We’re going to need to
construct a boost::regex
using a pattern represented by a string
literal. Which is where the problems start. Of course we can’t write:
boost::regex const string_matcher(/"([^"\\]|\\.)*"/);
because we haven’t passed a string literal to the boost::regex
constructor. In order to pass a string literal we’ll need to use
double-quotes instead of forward-slashes and we’ll have to escape
the internal double-quotes. Let’s try again:
boost::regex const string_matcher("\"([^\"\\]|\\.)*\"");
Oh dear — the error moves to run-time. We get an exception:
Unmatched [ or [^
. This is because the closing square bracket ]
has been escaped by the time it gets to the regex engine. Unfortunately
the \\
’s in the string literal contract to just single backslashes. We need
to redouble them.
boost::regex const string_matcher("\"([^\"\\\\]|\\\\.)*\"");
Here, each pair of backslashes has contracted to a single backslash by the time the regex engine sees it, which — believe it or not — is what’s required.
This string_matcher
works, but as code it is rather more cryptic
than communicative.
A complete C++ string literal matcher
Here’s a complete program for you to try.
#include <boost/regex.hpp> #include <iostream> #include <stdexcept> #include <string> int main(int argc, char * argv[]) { try { boost::regex const string_matcher("\"([^\"\\\\]|\\\\.)*\""); std::string line; while (std::getline(std::cin, line)) { if (boost::regex_match(line, string_matcher)) { std::cout << line << " is a C++ string literal\n"; } } } catch (std::exception & exc) { std::cerr << "An error occurred: " << exc.what(); } catch (...) { std::cerr << "An error occurred\n"; } return 0; }
Raw Strings in Python
Unlike Ruby, Python doesn’t include support for regexes in the language itself. Instead, regex support is provided by the standard regular expression library.
Python’s flexible string literals allow us to simplify the pattern, though. Here, we use a raw string, and we chose to delimit it with single-quotes so we don’t need to escape the internal double-quotes.
string_literal_pattern = r'"([^"\\]|\\.)*"'
This is nice. Basically, raw strings leave the backslashes unprocessed. Raw strings aren’t just restricted to regex patterns, though perhaps that’s their most common use.
Raw Strings in C++?
C++ doesn’t support raw strings (at least, it doesn’t support them
yet, and I haven’t found them mentioned in TR1) — but it does
support wide-strings, indicated by the L
prefix.
cpp_wide_string = L"this is a wide string";
Maybe if we switched the L
for an R
we could allow raw strings into
C++? It would make regex patterns far more readable.
Verbatim Strings in C++?
Alternatively …
I’ve never used C# but googling
suggests raw strings are supported and rather nicely named
“verbatim string literals”. C# uses the @
prefix to indicate that a
string literal is a verbatim string. Now, @
isn’t even part of the C++
source character set, so maybe this too would be possible.
There’s no escape
The proliferation of backslashes when we combine regexes and string
literals is unfortunate. It could be worse. What if the backslash key
had fallen off our keyboard? Remarkably – and, as far as I know,
uniquely – C++ caters for this situation. A number source characters
can be written as “trigraphs” — sequences of three characters
starting ??
. The backslash is one such character: it can be
written as ??/
.
boost::regex const string_matcher("??/"([^??/"??/??/??/??/]|??/??/??/??/.)*??/"");
For completeness, we could also lose the |
, [
and ]
keys.
boost::regex const string_matcher("??/"(??(^??/"??/??/??/??/??)??!??/??/??/??/.)*??/"");
The string literal used to initialise string_matcher
is valid, but
the regex wouldn’t match it properly. I’ll leave the fix as an exercise
for the reader.
Feedback
-
I wrote this article using Markdown - which uses the backslash as an escape character. Apologies for any errors.
-
You could have used Markback, which uses backspace as an escape character...
-
Thanks for the warning Could be. I'll steer clear of Markback.