Skip to content

String Localization and Regular Expressions

This chapter starts with a discussion of localization, which allows you to write software that can be localized to different regions around the world. An application that is properly localized displays numbers, dates, currencies, and so on in the appropriate format according to rules for a specific country or region.

The second part of this chapter introduces the regular expressions library, which makes it easy to perform pattern matching on strings. It allows you to search for substrings matching a given pattern, but also to validate, parse, and transform strings. Regular expressions are powerful. I recommend that you use them, as they are less error prone than manually writing your own string processing code.

When you’re learning how to program in C or C++, it’s useful to think of a character as equivalent to a byte and to treat all characters as members of the American Standard Code for Information Interchange (ASCII) character set. ASCII is a 7-bit set usually stored in an 8-bit char type. In reality, experienced C++ programmers recognize that successful programs are used throughout the world. Even if you don’t initially write your program with international audiences in mind, you shouldn’t prevent yourself from localizing, or making the software locale aware, at a later date.

The problem with viewing a character as a byte is that not all languages, or character sets, can be fully represented in 8 bits, or 1 byte. C++ has a built-in type called wchar_t that holds a wide character. Languages with non-ASCII (US) characters, such as Japanese and Arabic, can be represented in C++ with wchar_t. However, the C++ standard does not define the size for wchar_t. Some compilers use 16 bits, while others use 32 bits. Most of the time, it matches the size of the native Unicode character type on the underlying operating system. To write cross-platform code, it is not safe to assume that wchar_t is of a particular size.

If there is any chance that your program will be used in a non-Western character set context (hint: there is!), you should use wide characters from the beginning. When working with wchar_t, string and character literals are prefixed with the letter L to indicate that a wide-character encoding should be used. For example, to initialize a wchar_t character to the letter m, you write it like this:

wchar_t myWideCharacter { L'm' };

There are wide-character versions of most of your favorite types and classes. The wide string class is wstring. The “prefix letter w” pattern applies to streams as well. Wide-character file output streams are handled with wofstream, and input is handled with wifstream. The joy of pronouncing these class names (woof-stream? whiff-stream?) is reason enough to make your programs locale aware! Streams are discussed in detail in Chapter 13, “Demystifying C++ I/O.”

There are also wide-versions of cout, cin, cerr, and clog available, called wcout, wcin, wcerr, and wclog. Using them is no different than using the non-wide versions:

wcout << L"I am a wide-character string literal." << endl;

print() and println() don’t support wchar_t string literals, but they do support UTF-8 string literals, discussed later in this chapter. On the other hand, format() does support wide-character strings:

wcout << format(L"myWideCharacter is {}", myWideCharacter) << endl;

Wide characters are a great step forward because they increase the amount of space available to define a single character. The next step is to figure out how that space is used. In wide character sets, just like in ASCII, characters are represented by numbers, now called code points. The only difference is that each number does not fit in 8 bits. The map of characters to code points is quite a bit larger because it handles many different character sets in addition to the characters that English-speaking programmers are familiar with.

The Universal Character Set (UCS)—defined by the International Standard ISO 10646—and Unicode are both standardized sets of characters. They both identify characters by an unambiguous name and a code point. The same characters with the same numbers exist in both standards. At the time of this writing, the latest version of Unicode was version 15, which defines 149,186 characters. Both UCS and Unicode have specific encodings that you can use to represent specific code points. This is important: a code point is just a number; an encoding specifies how to represent that number as one or more bytes. For example, UTF-8 is an example of a Unicode encoding where Unicode characters are encoded using one to four 8-bit bytes. UTF-16 encodes Unicode characters as one or two 16-bit values, and UTF-32 encodes Unicode characters as exactly 32 bits.

Different applications can use different encodings. Unfortunately, as mentioned earlier in this chapter, the C++ standard does not specify a size for wide characters (wchar_t). On Windows it is 16 bits, while on other platforms it could be 32 bits. You need to be aware of this when using wide characters for character encoding in cross-platform code. To help solve this issue, there are other character types: char8_t, char16_t, and char32_t. The following list gives an overview of the available character types:

  • char: Stores 8 bits. This type can be used to store ASCII characters or as a basic building block for storing UTF-8 encoded Unicode characters, where one Unicode character is encoded with up to four chars.
  • charx_t: Stores at least x bits where x can be 8, 16, or 32. This type can be used as the basic building block for UTF-x encoded Unicode characters, encoding one Unicode character with up to four char8_ts, up to two char16_ts, or one char32_t.
  • wchar_t: Stores a wide character of a compiler-specific size and encoding.

The benefits of using the charx_t types instead of wchar_t is that the standard guarantees minimum sizes for the charx_t types, independent of the compiler. There is no minimum size guaranteed for wchar_t.

String literals can have a string prefix to turn them into a specific type. The complete set of supported string prefixes is as follows:

  • u8: A char8_t string literal with UTF-8 encoding
  • u: A char16_t string literal with UTF-16 encoding
  • U: A char32_t string literal with UTF-32 encoding
  • L: A wchar_t string literal with a compiler-dependent encoding

All of these string literals can be combined with the raw string literal prefix, R, discussed in Chapter 2, “Working with Strings and String Views.” Here are some examples:

const char8_t* s1 { u8R"(Raw UTF-8 string literal)" };
const wchar_t* s2 { LR"(Raw wide string literal)" };
const char16_t* s3 { uR"(Raw UTF-16 string literal)" };
const char32_t* s4 { UR"(Raw UTF-32 string literal)" };

You can insert specific Unicode code points in non-raw string literals using several different escape sequences. The following table gives an overview of your options. The last column shows the encoding of the superscript two, ², character.

ESCAPE SEQUENCEDESCRIPTIONEXAMPLE: 2
\nnn1 to 3 octal digits\262
\o{n…}Arbitrary number of octal digits\o{262}
\xn…Arbitrary number of hexadecimal digits\xB2 or \x00B2
\x{n…}Arbitrary number of hexadecimal digits\x{B2} or \x{00B2}
\unnnn4 hexadecimal digits\u00B2
\u{n…}Arbitrary number of hexadecimal digits\u{B2} or \u{00B2}
\Unnnnnnnn8 hexadecimal digits\U000000B2
\N{name}Universal character name\N{SUPERSCRIPT TWO}

The \o{n…}, \x{n…}, and \u{n…} notations introduced with C++23 are useful to avoid problems when the next character in a string literal happens to be a valid octal or hexadecimal digit. For the \N{name} notation, the name must be the official Unicode name of the character, which you can look up in any Unicode character reference.

Here are some more examples representing the formula π r2. The π character has code 3C0, and the superscript two character has code B2.

const char8_t* formula1 { u8"\x3C0 r\xB2" };
const char8_t* formula2 { u8"\u03C0 r\u00B2" };
const char8_t* formula3 { u8"\N{GREEK SMALL LETTER PI} r\N{SUPERSCRIPT TWO}" };

Besides string literals, character literals can also have a prefix to turn them into specific types. The prefixes u8, u, U, and L are supported, for example: u'a', U'a', L'a', and u8'a'.

In addition to the std::string class, there is also support for wstring, u8string, u16string, and u32string. They are defined as follows:

  • using string = basic_string<char>;
  • using wstring = basic_string<wchar_t>;
  • using u8string = basic_string<char8_t>;
  • using u16string = basic_string<char16_t>;
  • using u32string = basic_string<char32_t>;

Similarly, the Standard Library provides std::string_view, wstring_view, u8string_view, u16string_view, and u32string_view, all based on basic_string_view.

Multibyte strings are strings with characters composed of one or more bytes using a locale-dependent encoding. Locales are discussed later in this chapter. A multibyte string could use Unicode encoding, or any other kind of encoding such as Shift-JIS, EUC-JP, and so on. Conversion functions are available to convert between char8_t/char16_t/char32_t and multibyte strings, and vice versa: mbrtoc8() and c8rtomb(), and mbrtoc16(), c16rtomb(), mbrtoc32(), and c32rtomb().

Unfortunately, the support for char8_t, char16_t, and char32_t doesn’t go much further. There are some conversion classes available (see later in this chapter), but, for example, there is nothing like a version of cout, cin, println(), format(), and so on, that supports these character types; this makes it difficult to print such strings to a console or to read them from user input. If you want to do more with such strings, you need to resort to third-party libraries. International Components for Unicode (ICU) is one well-known library that provides Unicode and globalization support for your applications. (See icu-project.org.)

C++23 improves things slightly. It allows a u8 UTF-8 string literal to initialize an array of type const char or const unsigned char, and functions like std::format() and print() do support const char[]. For example, the following initializes a const char[] array with a UTF-8 string literal and then prints it using println(). If your environment is set up to handle Japanese characters, then the output is “Hello world” in Japanese.

const char hello[] { u8"こんにちは世界" };
println("{}", hello);

If you would use char8_t[] instead of char[] as follows, you will get a compilation error as println() doesn’t understand the char8_t type.

const char8_t hello[] { u8"こんにちは世界" };
println("{}", hello); // Error: doesn't compile!

A critical aspect of localization is that you should never put any native-language string literals in your source code, except maybe for debug strings targeted at the developer. In Microsoft Windows applications, this is accomplished by putting all strings for an application in STRINGTABLE resources. Most other platforms offer similar capabilities. If you need to translate your application to another language, translating those resources should be all you need to do, without requiring any source changes. There are tools available that will help you with this translation process.

To make your source code localizable, you should not compose sentences out of string literals, even if the individual literals can be localized. Here is an example:

unsigned n { 5 };
wstring filename { L"file1.txt" };
wcout << n << L" bytes read from " << filename << endl;

This statement cannot be localized to, for example, German because it requires a reordering of the words. The German translation is as follows:

wcout << n << L" Bytes aus " << filename << L" gelezen" << endl;

To make sure you can properly localize such strings, you could implement it as follows:

vprint_unicode(loadResource(IDS_TRANSFERRED), make_format_args(n, filename));

IDS_TRANSFERRED is the name of an entry in a string resource table. For the English version, IDS_TRANSFERRED could be defined as “{0} bytes read from {1}”, while the German version of the resource could be defined as “{0} Bytes aus {1} gelezen”. The loadResource() function loads the string resource with the given name, and vprint_unicode() (see Chapter 2) substitutes {0} with the value of n and {1} with the value of filename.

Character sets are only one of the differences in data representation between countries. Even countries that use similar character sets, such as Great Britain and the United States, still differ in how they represent certain data, such as dates and monetary values.

The standard C++ mechanism that groups specific data about a particular set of cultural parameters is called a locale. An individual component of a locale, such as date format, time format, number format, and so on, is called a facet. An example of a locale is US English. An example of a facet is the format used to display a date. Several built-in facets are common to all locales. C++ also provides a way to customize or add facets.

There are third-party libraries available that make it easier to work with locales. One example is boost.locale (span Start cssStyle=“text-decoration:underline”?boost.org), which is able to use ICU as its backend, supporting collations and conversions, converting strings to uppercase (instead of converting character by character to uppercase), and so on.

When using I/O streams, data is formatted according to a particular locale. Locales are objects that can be attached to a stream, and they are defined in <locale>. Locale names are implementation specific. The POSIX standard is to separate a language and an area into two-letter sections with an optional encoding. For example, the locale for the English language as spoken in the United States is en_US, while the locale for the English language as spoken in Great Britain is en_GB. The locale for Japanese spoken in Japan with Japanese Industrial Standard encoding is ja_JP.jis.

Locale names on Windows can have two formats. The preferred format is similar to the POSIX format but uses a dash instead of an underscore. The second, old format looks as follows where everything between square brackets is optional:

lang[_country_region[.code_page]]

The following table shows some examples of the POSIX, preferred Windows, and old Windows locale formats:

LANGUAGEPOSIXWINDOWSWINDOWS OLD
US Englishen_USen-USEnglish_United States
Great Britain Englishen_GBen-GBEnglish_Great Britain

Most operating systems have a mechanism to determine the locale as defined by the user. In C++, you can pass an empty string to the std::locale constructor to create a locale from the user’s environment. Once this object is created, you can use it to query the locale, possibly making programmatic decisions based on it.

The std::locale::global() function can be used to replace the global C++ locale in your application with a given locale. The default constructor of std::locale returns a copy of this global locale. Keep in mind, though, that the C++ Standard Library objects that use locales, for example streams such as cout, store a copy of the global locale at construction time. Changing the global locale afterward does not impact objects that were already created before. If needed, you can use the imbue() member function on streams (see the next section) to change their locale after construction.

Here is an example outputting a number with the default locale, changing the global locale to US English and outputting the same number again:

void print()
{
stringstream stream;
stream << 32767;
println("{}", stream.str());
}
int main()
{
print();
locale::global(locale { "en-US" }); // "en_US" for POSIX
print();
}

The output is as follows:

32767
32,767

The following code demonstrates how to use the user’s locale for a stream by calling the imbue() member function on the stream. The result is that everything that is sent to cout is formatted according to the formatting rules of the user’s environment:

cout.imbue(locale { "" });
cout << "User's locale: " << 32767 << endl;

This means that if your system locale is English United States and you output the number 32767, the number is displayed as 32,767; however, if your system locale is Dutch Belgium, the same number is displayed as 32.767.

The default locale is the classic/neutral locale, and not the user’s locale. The classic locale uses ANSI C conventions and has the name C. The classic C locale is similar to US English, but there are slight differences. For example, numbers are handled without any punctuation.

cout.imbue(locale { "C" });
cout << "C locale: " << 32767 << endl;

The output of this code is as follows:

C locale: 32767

The following code manually sets the US English locale, so the number 32767 is formatted with US English punctuation, independent of your system locale:

cout.imbue(locale { "en-US" }); // "en_US" for POSIX
cout << "en-US locale: " << 32767 << endl;

The output of this code is as follows:

en-US locale: 32,767

By default, std::print() and println() use the C locale. For example, the following prints 32767:

println("println(): {}", 32767);

You can specify the L format specifier, in which case the global locale is used.

println("println() using global locale: {:L}", 32767);

std::format() also supports locales by using the L format specifier and optionally accepts a locale as first argument. When the L format specifier is used and a locale is passed to format(), that locale is used for formatting. If the L format specifier is used without passing a locale to format(), the global locale is used. For example, the following prints 32,767 according to English formatting rules:

cout << format(locale { "en-US" }, "format() with en-US locale: {:L}", 32767);

A locale object allows you to query information about the locale. For example, the following program creates a locale matching the user’s environment. The name() member function is used to get a C++ string that describes the locale. Then, the find() member function is used on the string object to find a given substring, which returns string::npos when the given substring is not found. The code checks for the Windows name and the POSIX name. One of two messages is printed, depending on whether the locale appears to be US English.

locale loc { "" };
if (loc.name().find("en_US") == string::npos &&
loc.name().find("en-US") == string::npos) {
println("Welcome non-US English speaker!");
} else {
println("Welcome US English speaker!");
}

<locale> contains the following character classification functions: std::isspace(), isblank(), iscntrl(), isupper(), islower(), isalpha(), isdigit(), ispunct(), isxdigit(), isalnum(), isprint(), and isgraph(). They all accept two parameters: the character to classify and the locale to use for the classification. The exact meaning of the different character classes is discussed later in this chapter in the context of regular expressions. Here is an example of using isupper() with a French locale to verify whether a letter is uppercase or not:

println("É {}", isupper(L'É', locale{ "fr-FR" }));
println("é {}", isupper(L'é', locale{ "fr-FR" }));

The output is as follows:

É true
é false

<locale> also defines two character conversion functions: std::toupper() and tolower(). They accept two parameters: the character to convert and the locale to use for the conversion. Here is an example:

auto upper { toupper(L'é', locale { "fr-FR" }) }; // É

You can use the std::use_facet() function template to obtain a particular facet for a particular locale. The template type argument specifies the facet to retrieve, while the function argument specifies the locale from which to retrieve the facet. For example, the following expression retrieves the standard monetary punctuation facet of the British English locale using the POSIX locale name:

use_facet<moneypunct<wchar_t>>(locale { "en_GB" })

Note that the innermost template type determines the character type to use. The result is an object that contains all the information you want to know about British monetary punctuation. The data available in the standard facets is defined in <locale>. The following table lists the facet categories defined by the standard. Consult a Standard Library reference (see Appendix B, “Annotated Bibliography”) for details about the individual facets.

FACETDESCRIPTION
ctypeCharacter classification facets
codecvtConversion facets; see next section
collateComparing strings lexicographically
time_getParsing dates and times
time_putFormatting dates and times
num_getParsing numeric values
num_putFormatting numeric values
numpunctDefines the formatting rules for numeric values
money_getParsing monetary values
money_putFormatting monetary values
moneypunctDefines the formatting rules for monetary values

The following code snippet brings together locales and facets by printing out the currency symbol in both US English and British English. Note that, depending on your environment, the British currency symbol may appear as a question mark, a box, or not at all. If your environment is set up to handle it, you may actually get the British pound symbol.

locale locUSEng { "en-US" }; // "en_US" for POSIX
locale locBritEng { "en-GB" }; // "en_GB" for POSIX
wstring dollars { use_facet<moneypunct<wchar_t>>(locUSEng).curr_symbol() };
wstring pounds { use_facet<moneypunct<wchar_t>>(locBritEng).curr_symbol() };
wcout << L"In the US, the currency symbol is " << dollars << endl;
wcout << L"In Great Britain, the currency symbol is " << pounds << endl;

The C++ standard provides the codecvt class template to help with converting between different character encodings. <locale> defines the following four encoding conversion classes:

CLASSDESCRIPTION
codecvt<char,char,mbstate_t>Identity conversion, that is, no conversion
codecvt<char16_t,char,mbstate_t> codecvt<char16_t,char8_t,mbstate_t>Conversion between UTF-16 and UTF-8
codecvt<char32_t,char,mbstate_t> codecvt<char32_t,char8_t,mbstate_t>Conversion between UTF-32 and UTF-8
codecvt<wchar_t,char,mbstate_t>Conversion between wide (implementation-specific) and narrow character encodings

Unfortunately, these facets are rather complicated to use. As an example, the following code snippet converts a narrow string to a wide string:

auto& facet { use_facet<codecvt<wchar_t, char, mbstate_t>>(locale { }) };
string narrowString { "Hello" };
mbstate_t mb { };
wstring wideString(narrowString.size(), '\0');
const char* fromNext { nullptr };
wchar_t* toNext { nullptr };
facet.in(mb,
narrowString.data(), narrowString.data() + narrowString.size(), fromNext,
wideString.data(), wideString.data() + wideString.size(), toNext);
wideString.resize(toNext - wideString.data());
wcout << wideString << endl;

Before C++17, the following three code conversion facets were defined in <codecvt>: codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16. These could be used with two convenience conversion interfaces: wstring_convert and wbuffer_convert. However, C++17 has deprecated those three conversion facets (the entirety of <codecvt>) and the two convenience interfaces, so they are not further discussed in this book. The C++ Standards Committee decided to deprecate this functionality because it does not handle errors very well. Ill-formed Unicode strings are a security risk, and in fact can be and have been used as an attack vector to compromise the security of systems. Also, the API is too obscure and too hard to understand. I recommend using third-party libraries, such as ICU, to work correctly with Unicode strings until the Standards Committee comes up with a suitable, safe, and easier-to-use replacement for the deprecated functionality.

Regular expressions, defined in <regex>, are a powerful string-related feature of the Standard Library. They support a special mini-language for string processing and might seem complicated at first, but once you get to know them, they make working with strings easier. Regular expressions can be used for several string operations:

  • Validation: Check if an input string is well formed. For example, is the input string a well-formed phone number?
  • Decision: Check what kind of string an input represents. For example, is the input string the name of a JPEG or a PNG file?
  • Parsing: Extract information from an input string. For example, extract the year, month, and day from a date.
  • Transformation: Search substrings and replace them with a new formatted substring. For example, search all occurrences of “C++23” and replace them with “C++”.
  • Iteration: Search all occurrences of a substring. For example, extract all phone numbers from an input string.
  • Tokenization: Split a string into substrings based on a set of delimiters. For example, split a string on whitespace, commas, periods, and so on, to extract the individual words.

Of course, you could write your own code to perform any of these operations on strings, but I recommend using the regular expressions functionality, because writing correct and safe code to process strings is tricky.

Before going into more detail on regular expressions, there is some important terminology you need to know. The following terms are used throughout the discussion:

  • Pattern: The actual regular expression is a pattern represented by a string.
  • Match: Determines whether there is a match between a given regular expression and all of the characters in a given sequence [first, last).
  • Search: Determines whether there is some substring within a given sequence [first, last) that matches a given regular expression.
  • Replace: Identifies substrings in a given sequence and replaces them with a corresponding new substring computed from another pattern, called a substitution pattern.

There are several different grammars for regular expressions. C++ includes support for the following grammars:

  • ECMAScript: The grammar based on the ECMAScript standard. ECMAScript is a scripting language standardized by ECMA-262. The core of JavaScript, ActionScript, Jscript, and so on, all use the ECMAScript language standard.
  • basic: The basic POSIX grammar.
  • extended: The extended POSIX grammar.
  • awk: The grammar used by the POSIX awk utility.
  • grep: The grammar used by the POSIX grep utility.
  • egrep: The grammar used by the POSIX grep utility with the -E parameter.

If you already know any of these regular expression grammars, you can use it straightaway in C++ by instructing the regular expression library to use that specific syntax (syntax_option_type). The default grammar in C++ is ECMAScript, whose syntax is explained in detail in the following section. It is also the most powerful grammar. Explaining the other regular expression grammars falls outside the scope of this book.

A regular expression pattern is a sequence of characters representing what you want to match. Any character in the regular expression matches itself except, for the following special characters:

^ $ \ . * + ? ( ) [ ] { } |

These special characters are explained throughout the following discussion. If you need to match one of these special characters, you need to escape it using the \ character, as in this example:

\[ or \. or \* or \\

The special characters ^ and $ are called anchors. The ^ character matches the position immediately following a line termination character, and $ matches the position of a line termination character. By default, ^ and $ also match the beginning and ending of a string, respectively, but this behavior can be disabled.

For example, ^test$ matches only the string test, and not strings that contain test somewhere in the line, such as 1test, test2, test abc, and so on.

The wildcard character . can be used to match any single character except a newline character. For example, the regular expression a.c will match abc, and a5c, but will not match ab5c, ac, and so on.

The | character can be used to specify the “or” relationship. For example, a|b matches a or b.

Parentheses, (), are used to mark subexpressions, also called capture groups. Capture groups can be used for several purposes:

  • Capture groups can be used to identify individual subsequences of the original string; each marked subexpression (capture group) is returned in the result. For example, the regular expression (.)(ab|cd)(.) has three marked subexpressions. Performing a search operation with this regular expression on 1cd4 results in a match with four entries. The first entry is the entire match, 1cd4, followed by three entries for the three marked subexpressions. These three entries are 1, cd, and 4.
  • Capture groups can be used during matching for a purpose called back references (explained later).
  • Capture groups can be used to identify components during replace operations (explained later).

Parts of a regular expression can be repeated by using one of four quantifiers:

  • * matches the preceding part zero or more times. For example, a*b matches b, ab, aab, aaaab, and so on.
  • + matches the preceding part one or more times. For example, a+b matches ab, aab, aaaab, and so on, but not b.
  • ? matches the preceding part zero or one time. For example, a?b matches b and ab, but nothing else.
  • {…} represents a bounded quantifier. b{n} matches b repeated exactly n times; b{n,} matches b repeated n times or more; and b{n,m} matches b repeated between n and m times inclusive. For example, b{3,4} matches bbb and bbbb but not b, bb, bbbbb, and so on.

These quantifiers are called greedy because they find the longest match while still matching the remainder of the regular expression. To make them non-greedy, a ? can be added behind the quantifier, as in *?, +?, ??, and {…}?. A non-greedy quantifier repeats its pattern as few times as possible while still matching the remainder of the regular expression.

For example, the following table shows the difference between a greedy and a non-greedy regular expression, and the resulting submatches when running them on the input sequence aaabbb:

REGULAR EXPRESSIONSUBMATCHES
Greedy: (a+)(ab)*(b+)“aaa” "" “bbb”
Non-greedy: (a+?)(ab)*(b+)“aa” “ab” “bb”

Just as with mathematical formulas, it’s important to know the precedence of regular expression elements. Precedence is as follows:

  • Elements like b are the basic building blocks of a regular expression.
  • Quantifiers like +, *, ?, and {…} bind tightly to the element on the left; for example, b+.
  • Concatenation like ab+c binds after quantifiers.
  • Alternation like | binds last.

For example, the regular expression ab+c|d matches abc, abbc, abbbc, and so on, and also d. Parentheses can be used to change these precedence rules. For example, ab+(c|d) matches abc, abbc, abbbc, …, abd, abbd, abbbd, and so on. However, by using parentheses, you also mark it as a subexpression or capture group. It is possible to change the precedence rules without creating new capture groups by using (?:). For example, ab+(?:c|d) matches the same as the earlier ab+(c|d) but does not create an additional capture group.

Instead of writing (a|b|c||z), which is clumsy and introduces a capture group, a special syntax for specifying sets of characters or ranges of characters is available. In addition, a “not” form of the match is also available. A character set is specified between square brackets and allows you to write [c1c2…cn], which matches any of the characters c1, c2, …, or cn. For example, [abc] matches any character a, b, or c. If the first character is ^, it means “any but”:

  • ab[cde] matches abc, abd, and abe.
  • ab[^cde] matches abf, abp, and so on, but not abc, abd, and abe.

If you need to match the ^, [, or ] characters themselves, you need to escape them; for example, [\[\^\]] matches the characters [, ^, or ].

If you want to specify all letters, you could use a character set like [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]; however, this is clumsy, and doing this several times is awkward, especially if you make a typo and omit one of the letters accidentally. There are two solutions to this.

One solution is to use the range specification in square brackets; this allows you to write [a-zA-Z], which recognizes all the letters in the range a to z and A to Z. If you need to match a hyphen, you need to escape it; for example, [a-zA-Z\-]+ matches any word including a hyphenated word.

Another solution is to use one of the character classes. These are used to denote specific types of characters and are represented as [:name:]. Which character classes are available depends on the locale, but the names listed in the following table are always recognized. The exact meaning of these character classes is also dependent on the locale. This table assumes the standard C locale:

CHARACTER CLASS NAMEDESCRIPTION
digitDigits, which are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
dSame as digit.
xdigitDigits (digit) and the following letters used in hexadecimal numbers: a, b, c, d, e, f, A, B, C, D, E, F.
alphaAlphabetic characters. For the C locale, these are all lowercase and uppercase letters.
alnumA combination of the alpha class and the digit class.
wSame as alnum.
lowerLowercase letters, if applicable to the locale.
upperUppercase letters, if applicable to the locale.
blankBlank characters, which are whitespace characters used to separate words within a line of text. For the C locale, these are space and \t (tab).
spaceWhitespace characters. For the C locale, these are space, \t, \n, \r, \v, and \f.
sSame as space.
printPrintable characters. These occupy a printing position—for example, on a display—and are the opposite of control characters (cntrl). Examples are lowercase letters, uppercase letters, digits, punctuation characters, and space characters.
cntrlControl characters. These are the opposite of printable characters (print), and don’t occupy a printing position, for example, on a display. Some examples for the C locale are \f, \n, and \r.
graphCharacters with a graphical representation. These are all characters that are printable (print), except the space character ' '.
punctPunctuation characters. For the C locale, these are all graphical characters (graph) that are not alphanumeric (alnum). Some examples are !, #, @, }, and so on.

Character classes are used within character sets; for example, [[:alpha:]]* in English means the same as [a-zA-Z]*.

Because certain character classes are so common, e.g., digits, there are shorthand patterns for them. For example, [:digit:] and [:d:] have the same meaning as [0-9]. Some classes have an even shorter pattern using the escape notation. For example, \d means [:digit:]. Therefore, to recognize a sequence of one or more numbers, you can write any of the following patterns:

  • [0-9]+
  • [[:digit:]]+
  • [[:d:]]+
  • \d+

The following table lists the available escape notations for character classes:

ESCAPE NOTATIONEQUIVALENT TO
\d[[:d:]]
\D[^[:d:]]
\s[[:s:]]
\S[^[:s:]]
\w[_[:w:]]
\W[^_[:w:]]

Here are some examples:

  • Test[5-8] matches Test5, Test6, Test7, and Test8.
  • [[:lower:]] matches a, b, and so on, but not A, B, and so on.
  • [^[:lower:]] matches any character except lowercase letters like a, b, and so on.
  • [[:lower:]5-7] matches any lowercase letter like a, b, and so on, and the numbers 5, 6, and 7.

A word boundary can mean the following:

  • The first character of a word, which is one of the word characters, while the preceding character is not a word character. A word character is a letter, digit, or an underscore. For the standard C locale, this is equal to [A-Za-z0-9_].
  • The end of a word, which is a non-word character, while the preceding character is a word character.
  • The beginning of the source string if the first character of the source string is one of the word characters. Matching the beginning of the source string is enabled by default, but you can disable it with regex_constants::match_not_bow, where bow stands for beginning-of-word.
  • The end of the source string if the last character of the source string is one of the word characters. Matching the end of the source string is enabled by default, but you can disable it with regex_constants::match_not_eow, where eow stands for end-of-word.

You can use \b to match a word boundary, and you can use \B to match anything except a word boundary.

A back reference allows you to reference a captured group inside the regular expression itself: \n refers to the n-th captured group, with n > 0. For example, the regular expression (\d+)-.*-\1 matches a string that has the following format:

  • One or more digits captured in a capture group (\d+)
  • Followed by a dash -
  • Followed by zero or more characters .*
  • Followed by another dash -
  • Followed by the same digits captured by the first capture group \1

This regular expression matches 123-abc-123, 1234-a-1234, and so on, but does not match 123-abc-1234, 123-abc-321, and so on.

Regular expressions support positive lookahead (which uses ?=pattern) and negative lookahead (which uses ?!pattern). The characters following the lookahead must match (positive) or not match (negative) the lookahead pattern, but those characters are not yet consumed.

For example, the pattern a(?!b) contains a negative lookahead to match a letter a not followed by a b. The pattern a(?=b) contains a positive lookahead to match a letter a followed by a b, but b is not consumed so it does not become part of the match.

The following is a more realistic example. The regular expression matches an input sequence that consists of at least one lowercase letter, at least one uppercase letter, at least one punctuation character, and is at least eight characters long. Such a regular expression can, for example, be used to enforce that passwords satisfy certain criteria.

(?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{8,}

In one of the exercises at the end of this chapter, you’ll experiment with this password-validation regular expression.

Regular Expressions and Raw String Literals

Section titled “Regular Expressions and Raw String Literals”

As seen in the preceding sections, regular expressions often use special characters that must be escaped in normal C++ string literals. For example, if you write \d in a regular expression, it matches any digit. However, because \ is a special character in C++, you need to escape it in a regular expression string literal as \\d; otherwise, your C++ compiler tries to interpret the \d. It gets more complicated if you want a regular expression to match a single backslash character, \. Because \ is a special character in the regular expression syntax itself, you need to escape it as \\. The \ character is also a special character in C++ string literals, so you need to escape it, resulting in \\\\.

You can use raw string literals to make complicated regular expressions easier to read in C++ source code. (Raw string literals are discussed in Chapter 2.) For example, take the following regular expression:

"( |\\n|\\r|\\\\)"

This regular expression matches spaces, newlines, carriage returns, and backslashes. It requires a lot of escape characters. Using raw string literals, this can be replaced with the following more readable regular expression:

R"(( |\n|\r|\\))"

The raw string literal starts with R"( and ends with )". Everything in between is the regular expression. Of course, you still need a double backslash at the end because the backslash needs to be escaped in the regular expression itself.

Writing correct regular expressions is not always trivial. For common patterns such as validating passwords, phone numbers, Social Security numbers, IP addresses, email addresses, credit card numbers, dates, and so on, you don’t have to. When you use your favorite Internet search engine and search for regular expressions online, you’ll find several websites with collections of predefined patterns, such as span Start cssStyle=“text-decoration:underline”?regexr.com, span Start cssStyle=“text-decoration:underline”?regex101.com, span Start cssStyle=“text-decoration:underline”?regextester.com, and many more. Quite a few of these sites allow you to test patterns online, so you can easily verify whether they are correct before using them in your code.

This concludes a brief description of the ECMAScript grammar. The following sections explain how to actually use regular expressions in C++ code.

Everything for the regular expression library is defined in <regex> and in the std namespace. The basic template types defined by the regular expression library are:

  • basic_regex: An object representing a specific regular expression.
  • match_results: A substring that matched a regular expression, including all the captured groups. It is a collection of sub_matches.
  • sub_match: An object containing a pair of iterators into the input sequence. These iterators represent a matched capture group. The pair is an iterator pointing to the first character of a matched capture group and an iterator pointing to one-past-the-last character of the matched capture group. It has an str() member function that returns the matched capture group as a string.

The library provides three key algorithms: regex_match(), regex_search(), and regex_replace(). All of these algorithms have different overloads that allow you to specify the source string as a string, a C-style string, or as a begin/end iterator pair. The iterators can be any of the following:

  • const char* or const wchar_t*
  • string::const_iterator or wstring::const_iterator

In fact, any iterator that behaves as a bidirectional iterator can be used. See Chapters 17, “Understanding Iterators and the Ranges Library,” for details on iterators.

The library also defines the following two regular expression iterators, which play an important role in finding all occurrences of a pattern in a source string:

  • regex_iterator: Iterates over all the occurrences of a pattern in a source string.
  • regex_token_iterator: Iterates over all the capture groups of all occurrences of a pattern in a source string.

To make the library easier to use, the standard defines a number of type aliases for the preceding templates:

using regex = basic_regex<char>;
using wregex = basic_regex<wchar_t>;
using csub_match = sub_match<const char*>;
using wcsub_match = sub_match<const wchar_t*>;
using ssub_match = sub_match<string::const_iterator>;
using wssub_match = sub_match<wstring::const_iterator>;
using cmatch = match_results<const char*>;
using wcmatch = match_results<const wchar_t*>;
using smatch = match_results<string::const_iterator>;
using wsmatch = match_results<wstring::const_iterator>;
using cregex_iterator = regex_iterator<const char*>;
using wcregex_iterator = regex_iterator<const wchar_t*>;
using sregex_iterator = regex_iterator<string::const_iterator>;
using wsregex_iterator = regex_iterator<wstring::const_iterator>;
using cregex_token_iterator = regex_token_iterator<const char*>;
using wcregex_token_iterator = regex_token_iterator<const wchar_t*>;
using sregex_token_iterator = regex_token_iterator<string::const_iterator>;
using wsregex_token_iterator = regex_token_iterator<wstring::const_iterator>;

The following sections explain the regex_match(), regex_search(), and regex_replace() algorithms, and the regex_iterator and regex_token_iterator classes.

The regex_match() algorithm can be used to compare a given source string with a regular expression pattern. It returns true if the pattern matches the entire source string, and false otherwise. There are seven overloads of the regex_match() algorithm accepting different kinds of arguments. They all have the following form:

template<>
bool regex_match(InputSequence[, MatchResults], RegEx[, Flags]);

The InputSequence can be represented as follows:

  • A start and end iterator into a source string
  • An std::string
  • A C-style string

The optional MatchResults parameter is a reference to a match_results and receives the match. If regex_match() returns false, you are only allowed to call match_results::empty() or match_results::size(); anything else is undefined. If regex_match() returns true, a match is found, and you can inspect the match_results object for what exactly got matched. This is explained with examples in the following subsections.

The RegEx parameter is the regular expression that needs to be matched. The optional Flags parameter specifies options for the matching algorithm. In most cases, you can keep the default. For more details, consult a Standard Library Reference.

The following program asks the user to enter a date in the format year/month/day, where year is four digits, month is a number between 1 and 12, and day is a number between 1 and 31. A regular expression together with the regex_match() algorithm is used to validate the user input. The details of the regular expression are explained after the code.

regex r { "\\d{4}/(?:0?[1-9]|1[0-2])/(?:0?[1-9]|[1-2][0-9]|3[0-1])" };
while (true) {
print("Enter a date (year/month/day) (q=quit): ");
string str;
if (!getline(cin, str) || str == "q") { break; }
if (regex_match(str, r)) { println(" Valid date."); }
else { println(" Invalid date!"); }
}

The first line creates the regular expression. The expression consists of three parts separated by a forward slash (/) character: one part for year, one for month, and one for day. The following list explains these parts:

  • \d{4}: Matches any combination of four digits; for example, 1234, 2024, and so on.
  • (?:0?[1-9]|1[0-2]): This subpart of the regular expression is wrapped inside parentheses to make sure the precedence is correct. We don’t need a capture group, so (?:) is used. The inner expression consists of an alternation of two parts separated by the | character.
    • 0?[1-9]: Matches any number from 1 to 9 with an optional 0 in front of it. For example, it matches 1, 2, 9, 03, 04, and so on. It does not match 0, 10, 11, and so on.
    • 1[0-2]: Matches 10, 11, or 12, and nothing else.
  • (?:0?[1-9]|[1-2][0-9]|3[0-1]): This subpart is also wrapped inside a non-capture group and consists of an alternation of three parts.
    • 0?[1-9]: Again matches any number from 1 to 9 with an optional 0 in front of it.
    • [1-2][0-9]: Matches any number between 10 and 29 inclusive and nothing else.
    • 3[0-1]: Matches 30 or 31 and nothing else.

The example then enters an infinite loop to ask the user to enter a date. Each date entered is given to the regex_match() algorithm. When regex_match() returns true, the user has entered a date that matches the date regular expression pattern.

This example can be extended by asking the regex_match() algorithm to return captured subexpressions in a results object. You first have to understand what a capture group does. By specifying a match_results object like smatch in a call to regex_match(), the elements of the match_results object are filled in when the regular expression matches the input string. To be able to extract these substrings, you must create capture groups using parentheses.

The first element, [0], in a match_results object contains the string that matched the entire pattern. When using regex_match() and a match is found, this is the entire source sequence. When using regex_search(), discussed in the next section, this can be a substring in the source sequence that matches the regular expression. Element [1] is the substring matched by the first capture group, [2] by the second capture group, and so on. To get a string representation of the ith capture group from a match_results object m, you can use m[i] as in the following code, or m[i].str().

The following code extracts the year, month, and day digits into three separate integer variables. The regular expression in the revised example has a few small changes. The first part matching the year is wrapped in a capture group, while the month and day parts are now also capture groups instead of non-capture groups. The call to regex_match() includes a smatch parameter, which receives the matched capture groups. Here is the adapted example:

regex r { "(\\d{4})/(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])" };
while (true) {
print("Enter a date (year/month/day) (q=quit): ");
string str;
if (!getline(cin, str) || str == "q") { break; }
if (smatch m; regex_match(str, m, r)) {
int year { stoi(m[1]) };
int month { stoi(m[2]) };
int day { stoi(m[3]) };
println(" Valid date: Year={}, month={}, day={}", year, month, day);
} else {
println(" Invalid date!");
}
}

In this example, there are four elements in the smatch results objects:

  • [0]: The string matching the full regular expression, which in this example is the full date
  • [1]: The year
  • [2]: The month
  • [3]: The day

When you execute this example, you can get the following output:

Enter a date (year/month/day) (q=quit): 2024/12/01
Valid date: Year=2024, month=12, day=1
Enter a date (year/month/day) (q=quit): 24/12/01
Invalid date!

The regex_match() algorithm discussed in the previous section returns true if the entire source string matches the regular expression and false otherwise. If you want to search for a matching substring, you need to use regex_search(). There are seven overloads of regex_search(), and they all have the following form:

template<>
bool regex_search(InputSequence[, MatchResults], RegEx[, Flags]);

All overloads return true when a match is found somewhere in the input sequence and false otherwise. The parameters are similar to the parameters for regex_match().

Two overloads of regex_search() accept a begin and end iterator as the input sequence that you want to process. You might be tempted to use this version of regex_search() in a loop to find all occurrences of a pattern in a source string by manipulating these begin and end iterators for each regex_search() call. Never do this! It can cause problems when your regular expression uses anchors (^ or $), word boundaries, and so on. It can also cause an infinite loop due to empty matches. Use a regex_iterator or regex_token_iterator as explained later in this chapter to extract all occurrences of a pattern from a source string.

Never use regex_search() in a loop to find all occurrences of a pattern in a source string. Instead, use a regex_iterator or regex\_token_iterator.

The regex_search() algorithm can be used to extract a matching substring from an input sequence. For example, the following program extracts code comments from a string. The regular expression searches for a substring that starts with // followed by optional whitespace, \s*, followed by one or more characters captured in a capture group, (.+). This capture group captures only the comment substring. The smatch object m receives the search results. If successful, m[1] contains the comment that was found. You can check the m[1].first and m[1].second iterators to see where exactly the comment was found in the source string.

regex r { "//\\s*(.+)$" };
while (true) {
print("Enter a string with optional code comments (q=quit):\n > ");
string str;
if (!getline(cin, str) || str == "q") { break; }
if (smatch m; regex_search(str, m, r)) {
println(" Found comment '{}'", m[1].str());
} else {
println(" No comment found!");
}
}

The output of this program can look as follows:

Enter a string with optional code comments (q=quit):
> std::string str; // Our source string
Found comment 'Our source string'
Enter a string with optional code comments (q=quit):
> int a; // A comment with // in the middle
Found comment 'A comment with // in the middle'
Enter a string with optional code comments (q=quit):
> std::vector values { 1, 2, 3 };
No comment found!

The match_results object also has a prefix() and suffix() member function, which return the string preceding or following the match, respectively.

As explained in the previous section, you should never use regex_search() in a loop to extract all occurrences of a pattern from a source sequence. Instead, you should use a regex_iterator or regex_token_iterator. They work similarly to iterators for Standard Library containers.

The following example asks the user to enter a source string, extracts every word from the string, and prints all words between quotes. The regular expression in this case is [\w]+, which searches for one or more word-letters. This example uses std::string as a source, so it uses sregex_iterator for the iterators. A standard iterator loop is used, but in this case, the end iterator is done slightly differently from the end iterators of Standard Library containers. Normally, you specify an end iterator for a particular container, but for regex_iterator, there is only one “end” iterator. You get this end iterator by default constructing a regex_iterator.

The for loop creates a start iterator called iter, which accepts a begin and end iterator into the source string and a regular expression. The loop body is called for every match found, which is every word in this example. The sregex_iterator iterates over all the matches. By dereferencing a sregex_iterator, you get a smatch object. Accessing the first element of this smatch object, [0], gives you the matched substring:

regex reg { "[\\w]+" };
while (true) {
print("Enter a string to split (q=quit): ");
string str;
if (!getline(cin, str) || str == "q") { break; }
const sregex_iterator end;
for (sregex_iterator iter { cbegin(str), cend(str), reg };
iter != end; ++iter) {
println("\"{}\"", (*iter)[0].str());
}
}

The output of this program can look as follows:

Enter a string to split (q=quit): This, is a test.
"This"
"is"
"a"
"test"

As this example demonstrates, even simple regular expressions can perform some powerful string operations!

Note that both regex_iterator, and regex_token_iterator discussed in the next section, internally store a pointer to the given regular expression. Hence, they both explicitly delete any constructors accepting rvalue reference regular expressions to prevent you from constructing them with temporary regex objects. For example, the following does not compile:

for (sregex_iterator iter { cbegin(str), cend(str), regex { "[\\w]+" } };
iter != end; ++iter) {}

The previous section describes regex_iterator, which iterates through every match. On each iteration, you get a match_results object, which you can use to extract subexpressions for a match that are captured by capture groups.

A regex_token_iterator can be used to automatically iterate over all or selected capture groups across all matches. There are four constructors with the following format:

regex_token_iterator(BidirectionalIterator a,
BidirectionalIterator b,
const regex_type& re
[, SubMatches
[, Flags]]);

All of them require a begin and end iterator as input sequence, and a regular expression. The optional SubMatches parameter is used to specify which capture groups should be iterated over. SubMatches can be specified in four ways:

  • As a single integer representing the index of the capture group that you want to iterate over
  • As a vector with integers representing the indices of the capture groups that you want to iterate over
  • As an initializer_list with capture group indices
  • As a C-style array with capture group indices

When you omit SubMatches or when you specify a 0 for SubMatches, you get an iterator that iterates over all capture groups with index 0, which are the substrings matching the full regular expression. The optional Flags parameter specifies options for the matching algorithm. In most cases, you can keep the default. Consult a Standard Library Reference for more details.

The earlier regex_iterator example can be rewritten using a regex_token_iterator as follows. Instead of using (*iter)[0].str() in the loop body, you simply use iter->str() because a token iterator with 0 (= default) submatch index automatically iterates over all capture groups with index 0. The output of this code is the same as the output generated by the earlier regex_iterator example.

regex reg { "[\\w]+" };
while (true) {
print("Enter a string to split (q=quit): ");
string str;
if (!getline(cin, str) || str == "q") { break; }
const sregex_token_iterator end;
for (sregex_token_iterator iter { cbegin(str), cend(str), reg };
iter != end; ++iter) {
println("\"{}\"", iter->str());
}
}

The following example asks the user to enter a date and then uses a regex_token_iterator to iterate over the second and third capture groups (month and day), which are specified as a vector of integers. The regular expression used for dates is explained earlier in this chapter. The only difference is that ^ and $ anchors are added since we want to match the entire source sequence. Earlier, that was not necessary, because regex_match() automatically matches the entire input string.

regex reg { "^(\\d{4})/(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])$" };
while (true) {
print("Enter a date (year/month/day) (q=quit): ");
string str;
if (!getline(cin, str) || str == "q") { break; }
vector indices { 2, 3 };
const sregex_token_iterator end;
for (sregex_token_iterator iter { cbegin(str), cend(str), reg, indices };
iter != end; ++iter) {
println("\"{}\"", iter->str());
}
}

This code prints only the month and day of valid dates. Output generated by this example can look like this:

Enter a date (year/month/day) (q=quit): 2024/1/13
"1"
"13"
Enter a date (year/month/day) (q=quit): 2024/1/32
Enter a date (year/month/day) (q=quit): 2024/12/5
"12"
"5"

The regex_token_iterator can also be used to perform a field splitting or tokenization. It is a much safer and more flexible alternative compared to using the old, and not further discussed, strtok() function from C. Tokenization is enabled in the regex_token_iterator constructor by specifying -1 as the capture group index to iterate over. In tokenization mode, the iterator iterates over all substrings of the input sequence that do not match the regular expression. The following code demonstrates this by tokenizing a string on the delimiters , and ; with zero or more whitespace characters before or after a delimiter. The code demonstrates the tokenization in two ways: first by iterating over the tokens directly and then by creating a new vector containing all the tokens followed by printing the contents of the vector:

regex reg { R"(\s*[,;]\s*)" };
while (true) {
print("Enter a string to split on ',' and ';' (q=quit): ");
string str;
if (!getline(cin, str) || str == "q") { break; }
// Iterate over the tokens.
const sregex_token_iterator end;
for (sregex_token_iterator iter { cbegin(str), cend(str), reg, -1 };
iter != end; ++iter) {
print("\"{}\", ", iter->str());
}
println("");
// Store all tokens in a vector.
vector<string> tokens {
sregex_token_iterator { cbegin(str), cend(str), reg, -1 },
sregex_token_iterator {} };
// Print the contents of the tokens vector.
println("{:n}", tokens);
}

The regular expression in this example is specified as a raw string literal and searches for patterns that match the following:

  • Zero or more whitespace characters
  • Followed by a , or ; character
  • Followed by zero or more whitespace characters

The output can be as follows:

Enter a string to split on ',' and ';' (q=quit): This is, a; test string.
"This is", "a", "test string.",
"This is", "a", "test string."

As you can see from this output, the string is split on , and ;. All whitespace characters around the , and ; are removed because the tokenization iterator iterates over all substrings that do not match the regular expression and because the regular expression matches , and ; with whitespace around them.

The regex_replace() algorithm requires a regular expression and a formatting string that is used to replace matching substrings. This formatting string can reference parts of the matched substrings by using the escape sequences in the following table.

ESCAPE SEQUENCEREPLACED WITH
$nThe string matching the nth capture group; for example, $1 for the first capture group, $2 for the second, and so on. n must be greater than 0.
$&The string matching the entire regular expression.
$`The part of the input sequence that appears to the left of the substring matching the regular expression.
The part of the input sequence that appears to the right of the substring matching the regular expression.
$$A single dollar sign.

There are six overloads of regex_replace(). The difference between them is in the type of parameters. Four of them have the following format:

template<>
string regex_replace(InputSequence, RegEx, FormatString[, Flags]);

These four overloads return the resulting string after performing the replacement. Both the InputSequence and the FormatString can be an std::string or a C-style string. The RegEx parameter is the regular expression that needs to be matched. The optional Flags parameter specifies options for the replace algorithm.

Two overloads of regex_replace() have the following format:

OutputIterator regex_replace(OutputIterator,
BidirectionalIterator first,
BidirectionalIterator last,
RegEx, FormatString[, Flags]);

These two overloads write the resulting string to the given output iterator and return this output iterator. The input sequence is given as a begin and end iterator. The other parameters are identical to the other four overloads of regex_replace().

As a first example, take the following HTML source string:

<body><h1>Header</h1><p>Some text</p>
</body>

and the following regular expression:

<h1>(.*)</h1><p>(.*)</p>

The following table shows the different escape sequences and what they will be replaced with:

ESCAPE SEQUENCEREPLACED WITH
$1Header
$2Some text
$&

Header

Some text

$`

The following code demonstrates the use of regex_replace():

const string str { "
<body><h1>Header</h1><p>Some text</p>
</body>" };
regex r { "<h1>(.*)</h1><p>(.*)</p>" };
const string replacement { "H1=$1 and P=$2" }; // See earlier table.
string result { regex_replace(str, r, replacement) };
println("Original string: '{}'", str);
println("New string : '{}'", result);

The output of this program is as follows:

Original string: '
<body><h1>Header</h1><p>Some text</p>
</body>'
New string : '
<body>H1=Header and P=Some text
</body>'

The regex_replace() algorithm accepts a number of flags to change its behavior. The most important flags are given in the following table:

FLAGDESCRIPTION
format_defaultThe default is to replace all occurrences of the pattern and to also copy everything to the output that does not match the pattern.
format_no_copyReplaces all occurrences of the pattern but does not copy anything to the output that does not match the pattern.
format_first_onlyReplaces only the first occurrence of the pattern.

The call to regex_replace() in the previous code snippet can be modified to use the format_no_copy flag:

string result { regex_replace(str, r, replacement,
regex_constants::format_no_copy) };

The output now is as follows:

Original string: '
<body><h1>Header</h1><p>Some text</p>
</body>'
New string : 'H1=Header and P=Some text'

Another example using regex_replace() is to replace each word boundary in a string with a newline character so that the output contains only one word per line. The following code snippet demonstrates this without using any loops to process a given input string. The code first creates a regular expression that matches individual words. When a match is found with regex_replace(), it is substituted with $1\n where $1 is replaced with the matched word. Note also the use of the format_no_copy flag to prevent copying whitespace and other non-word characters from the source string to the output.

regex reg { "([\\w]+)" };
const string replacement { "$1\n" };
while (true) {
print("Enter a string to split over multiple lines (q=quit): ");
string str;
if (!getline(cin, str) || str == "q") { break; }
println("{}", regex_replace(str, reg, replacement,
regex_constants::format_no_copy));
}

The output of this program can be as follows:

Enter a string to split over multiple lines (q=quit): This is a test.
This
is
a
test

This chapter gave you an appreciation for coding with localization in mind. As anyone who has been through a localization effort will tell you, adding support for a new language or locale is infinitely easier if you have planned ahead, for example, by using Unicode characters and being mindful of locales.

The second part of this chapter explained the regular expressions library. Once you know the syntax of regular expressions, it becomes much easier to work with strings. Regular expressions allow you to validate strings, search for substrings inside an input sequence, perform find-and-replace operations, and so on. It is highly recommended that you get to know regular expressions and start using them instead of writing your own string manipulation routines. They will make your life easier.

By solving the following exercises, you can practice the material discussed in this chapter. Solutions to all exercises are available with the code download on the book’s website at www.wiley.com/go/proc++6e. However, if you are stuck on an exercise, first reread parts of this chapter to try to find an answer yourself before looking at the solution from the website.

  1. Exercise 21-1: Use an appropriate facet to figure out the decimal separator for formatting numbers according to the user’s environment. Consult a Standard Library reference to learn about the exact member functions of your chosen facet.

  2. Exercise 21-2: Write an application that asks the user to enter a phone number as formatted in the United States. Here’s an example: 202-555-0108. Use a regular expression to validate the format of the phone number, that is, three digits, followed by a dash, three more digits, another dash, and a final four digits. If it’s a valid phone number, print out the three parts on separate lines. For example, for the earlier phone number, the result must be as follows:

    202
    555
    0108
  3. Exercise 21-3: Write an application that asks the user for a piece of source code that can span multiple lines and that can contain // style comments. To signal the end of the input, use a sentinel character, for example @. You can use std::getline() with '@' as delimiter to read in multiple lines of text from the standard input console. Finally, use a regular expression to remove comments from all lines of the code snippet. Make sure your code properly works on a snippet such as the following:

    string str; // A comment // Some more comments.
    str = "Hello"; // Hello.

    The result for this input must be as follows:

    string str;
    str = "Hello";
  4. Exercise 21-4: The section “Lookahead” earlier in this chapter mentioned a password-validation regular expression. Write a program to test this regular expression. Ask the user to enter a password and validate it. Once you’ve verified that the regular expression works, add one more validation rule to it: a password must also consist of at least two digits.