14.11.1. Extended Regular Expression Examples

Time: 00:04:54 | Download: Large, Large (CC), Small | Streaming, Streaming (CC) | Slides: PDF, PPTX

Review

The regular expression systems incorporated into most high-level programming languages are capable of more than matching or accepting input. In conjunction with the language's controls, REs can take a more active role in extracting data embedded in the program's input. The following examples demonstrate data extraction with an smatch object (an instance of match_results tailored for string data) and numbered groups and how programs can convert data in various common formats to a single "standard" format with the regex_replace function.

Data Access With RE Numbered Groups

C++ regular expressions group sub-expressions with parentheses, allowing operations to operate on the group as a whole. RE also assigns group numbers based on the parentheses' location in the expression. Matcher objects, instances of match_results and its sub-types, access the matching character sequences by group number.

string entry;
getline(in, entry);

smatch m;

if (! regex_match(entry, m, regex( "([Dd]eposit|[1-9][0-9]*):([^:]+):([^:]*):(\\d*\\.\\d{2})" ) ))
//if (! regex_match(entry, m, regex( R"(([Dd]eposit|[1-9][0-9]*):([^:]+):([^:]*):(\d*\.\d{2}))" ) ))	// raw string
    continue;

string	type = m[1];
string	date = m[2];
string	to = m[3];
double	amount = stod(m[4]);

checkbook3.cpp: Data extraction with smatch. The second version of the checkbook program used an RE to validate each record in a checkbook register file. Once it validated the data, it used the same sequence of getline function calls to extract each field as did the original program. The final version uses the RE for validation but adds a matcher object, m, to the regex_match call, allowing it to extract each field with numbered groups. smatch is a match_results sub-type that manages string objects.

The regex_match function identifies sequences of characters in the input data matching the RE. Rather than discarding the matched characters, the function saves them in the matcher object, m. The overloaded indexing operator, [], enables the program to access the characters by their group number, making data extraction straightforward. The challenge is understanding how regex_match (and similar functions) number the matching groups.

"([Dd]eposit\|[1-9][0-9]):([^:]+):([^:]):(\\d*\\.\\d{2})"	0: ([Dd]eposit\|[1-9][0-9]):([^:]+):([^:]):(\\d\\.\\d{2}) 1: ([Dd]eposit\|[1-9][0-9]) 2: ([^:]+) 3: ([^:]) 4: (\\d\\.\\d{2})
(a)	(b)

for (int i = 0; i < m.size(); i++)
    cout << i << ": " << m[i] << endl;

0: deposit:July 7:-:300.00
1: deposit
2: July 7
3: -
4: 300.00

0: 416:July 8:Gas Company:15.85
1: 416
2: July 8
3: Gas Company
4: 15.85

(c)

(d)

(e)

Simple group numbers. The matching functions, regex_match, regex_search, and regex_replace, assign group numbers from left to right for each pair of opening and closing parentheses. The numbers count from 1 upwards, so (...)(...)(...) produces groups 1, 2, and 3. The sequence also implicitly produces a group 0: the entire RE.

The checkbook RE is simple in the sense that it doesn't have any nested groups.
Group 0 always refers to the entire RE. The functions number the remaining groups from left to right as they are processed.
Programs can use a for-loop to access the groups in order, though they typically begin with 1 rather than 0. Programmers can also insert temporary loops to help develop and debug programs using RE. Notice that the program drives the loop with the size function - the smatch object also has a length function, but it includes numerous empty strings.
The for-loop output corresponding to the first checkbook record.
The for-loop output corresponding to the second checkbook record.

Nested and non-capturing groups make identifying group numbers more challenging.

U.S. Phone Number Patterns	Matching Regular Expression
(123) 456-7890 123-456-7890 123 456 7890 123.456.7890 1234567890	"(?:\\()?((\\d){3})(?:\\))?[ -\\.]?((\\d){3})[ -\\.]?((\\d){4})" R"((?:\()?((\d){3})(?:\))?[ -\.]?((\d){3})[ -\.]?((\d){4}))" // raw string

U.S. Phone Number Patterns

Matching Regular Expression

(123) 456-7890
123-456-7890
123 456 7890
123.456.7890
1234567890

"(?:\\()?((\\d){3})(?:\\))?[ -\\.]?((\\d){3})[ -\\.]?((\\d){4})"

R"((?:\()?((\d){3})(?:\))?[ -\.]?((\d){3})[ -\.]?((\d){4}))"	// raw string

Understanding group numbering. Phone numbers in the United States consist of ten digits, typically displayed in one of two formats, although other formats are occasionally seen. Imagine that a user enters a U.S. phone number into a program, which validates that it has the correct number of digits. Which format should the program accept? Regular expressions can validate multiple formats. The one solving this problem is more complex than the previous ones, demonstrating more grouping options.

Programs must escape, with the \, any operator character, removing its operator status, to match it explicitly in the input or target text. The \ escapes special characters appearing in strings, so programs must also escape it, leading to double escape sequences: \\. Alternatively, programs can use a raw string literal when specifying an RE. Adding "?:" to the parentheses, (?:r) creates a non-capturing group that matches but ignores the sub-expression r - it consumes or skips r without assigning it a group number. The following table explores the phone number matching RE from left to right, describing the sub-expressions and their effect on group numbers.

Group	Matching Expression	Explanation
None	(?:\\()?	Matches zero or one occurrence of the `\\(` character without assigning it a group number.
1 and 2	((\\d){3})	Matches exactly three occurrences of a digit character. Scanning left to right, the matching function encounters the outer (red) parentheses first, assigning them group number 1. This group encompasses all three digits. It assigns the second, inner parentheses number 2, which, when displayed, is the last digit of the three.
None	(?:\\))?	Matches zero or one occurrence of the closing parenthesis, `\\)`, without assigning it a group number.
None	[ -\\.]?	The sub-expression doesn't have any parentheses, so the matching function doesn't assign it a group number. The expression matches zero or one space, dash, or period.
3 and 4	((\\d){3})	Matches three-digit characters. The function resumes group numbering from the last numbered group, numbering the red parentheses 3 (all three digits) and the inner pair 4 (the last of the three digits).
None	[ -\\.]?	The expression has no parentheses, so the matching function doesn't assign it a group number. It matches zero or one space, dash, or period.
5 and 6	((\\d){4})	Matches four digits. Resuming the group numbering, the function assigns the red parentheses number 5 (all four digits) and the inner pair number 6 (the last digit of four).

The sub-expressions extracting the groups of digits follow the same pattern, ((\\d){3}), with the last group only differing in the quantifier value. The outer parentheses collect the digits and determine the group number used to access them. The sub-expressions require the inner parentheses to designate the elements on which the quantifiers operate, but they also establish a group number. The inner group only captures the last digit in the sequence, which is typically irrelevant. Programs can eliminate the extraneous group numbers by making the inner groups non-capturing:
((?:\\d){3}).

RE And Format Conversions

The previous examples demonstrate how programs can use regular expressions to validate and extract data from input. However, the RE systems in many programming languages can also modify or reformat extracted data. To demonstrate this feature, imagine that, for consistency, a programmer wants to convert all entered phone numbers to a "standard" format. The regex_replace function accomplishes this task.

int main()
{
    ifstream in("phone.txt");

    // Original expressions with extraneous numbered groups
    string   re = "(?:\\()?((\\d){3})(?:\\))?[ -\\.]?((\\d){3})[ -\\.]?((\\d){4})";		// (a)
    //string   re = R"((?:\()?((\d){3})(?:\))?[ -\.]?((\d){3})[ -\.]?((\d){4}))";		// raw string version
    string   format = "($1) $3-$5";								// (b)

    // Modified expressions eliminating unnecessary numbered groups
    /*string re = "(?:\\()?((?:\\d){3})(?:\\))?[ -\\.]?((?:\\d){3})[ -\\.]?((?:\\d){4})";	// (c)
    //string   re = R"((?:\()?((?:\d){3})(?:\))?[ -\.]?((?:\d){3})[ -\.]?((?:\d){4}))";		// raw string version
    string   format = "($1) $2-$3";*/								// (d)

    if (!in.good())
    {
        cerr << "Unable to open \"phone.txt\"\n";
        exit(1);
    }

    while (!in.eof())
    {
        string    phone;
        getline(in, phone);

        if (regex_match(phone, regex("^$|^#.*")))						// (e)
            continue;

        if (regex_match(phone, regex(re)))							// (f)
            cout << regex_replace(phone, regex(re), format) << endl;				// (e)
        else
            cerr << "Unsupported format: " << phone << endl;
    }

    return 0;
}

phone.cpp: Reformatting data with regex_replace.

A regular expression matching U.S. phone numbers, including the extraneous numbered groups.
A string specifying the output format. The parentheses, space, and dash appear verbatim in the output. $1, $3, and $5 refer to the first, third, and fifth number groups as described in the previous figure.
A regular expression matching U.S. phone numbers, but eliminating unnecessary numbered groups.
A output format reflecting the renumbered groups.
An RE filtering out empty and comment lines. The beginning and end-of-line anchors, ^$, without any characters between them, match an empty line. The expression ^#.* matches any line that begins with the '#' character. The OR operator, |, forms the final RE, matching either sub-expression.
The regex_match function returns true if the characters in the target string, phone, match the regular expression in re.
This example is the text's first demonstration of the regex_replace function. It matches the text in the target, phone, with the regular expression in re. Unlike regex_match, it replaces matching text in the target with the formatted values specified by the third argument, format.

Although the program only prints the reformatted data, it could maintain it permanently (perhaps in a database) with a simple modification:

phone = regex_replace(phone, regex(re), format);

Regular Expression Example Downloadable Files

View	Download	Comments
checkbook3.cpp	checkbook3.cpp	A version of the checkbook program accessing data with number groups.
checkbook2.txt	checkbook2.txt	A checkbook register data file repeated from the previous section.
phone.cpp	phone.cpp	A program validating and reformatting U.S. phone numbers, demonstrating the `regex_replace` function.
phone.txt	phone.txt	An input file with various formatted U.S. phone numbers.
groups.cpp	groups.cpp	Demonstrates group numbers as illustrated in the video