The regular expression systems incorporated into most high-level programming languages are capable of more than matching or accepting input. In conjunction with the language's controls, REs can take a more active role in extracting data embedded in the program's input. The following examples demonstrate data extraction with an smatch object (an instance of match_results tailored for string data) and numbered groups and how programs can convert data in various common formats to a single "standard" format with the regex_replace function.
C++ regular expressions group sub-expressions with parentheses, allowing operations to operate on the group as a whole. RE also assigns group numbers based on the parentheses' location in the expression. Matching objects, instances of match_results and its sub-types, access the matching character sequences by group number.
string entry; getline(in, entry); smatch m; if (! regex_match(entry, m, regex( "([Dd]eposit|[1-9][0-9]*):([^:]+):([^:]*):(\\d*\\.\\d{2})" ) )) //if (! regex_match(entry, m, regex( R"(([Dd]eposit|[1-9][0-9]*):([^:]+):([^:]*):(\d*\.\d{2}))" ) )) // raw string continue; string type = m[1]; string date = m[2]; string to = m[3]; double amount = stod(m[4]);
The regex_match function identifies sequences of characters in the input data matching the RE. Rather than discarding the matched characters, the function saves them in the matching object, m. The overloaded indexing operator, [], enables the program to access the characters by their group number, making data extraction straightforward. The challenge is understanding how regex_match (and similar functions) number the matching groups.
"([Dd]eposit|[1-9][0-9]*):([^:]+):([^:]*):(\\d*\\.\\d{2})" |
0: ([Dd]eposit|[1-9][0-9]*):([^:]+):([^:]*):(\\d*\\.\\d{2}) 1: ([Dd]eposit|[1-9][0-9]*) 2: ([^:]+) 3: ([^:]*) 4: (\\d*\\.\\d{2}) |
(a) | (b) |
for (int i = 0; i < m.size(); i++) cout << i << ": " << m[i] << endl; |
0: deposit:July 7:-:300.00 1: deposit 2: July 7 3: - 4: 300.00 |
0: 416:July 8:Gas Company:15.85 1: 416 2: July 8 3: Gas Company 4: 15.85 |
(c) | (d) | (e) |
(...)(...)(...)
produces groups 1, 2, and 3. The sequence also implicitly produces a group 0: the entire RE.
U.S. Phone Number Patterns | Matching Regular Expression |
---|---|
(123) 456-7890 123-456-7890 123 456 7890 123.456.7890 1234567890 |
"(?:\\()?((\\d){3})(?:\\))?[ -\\.]?((\\d){3})[ -\\.]?((\\d){4})" R"((?:\()?((\d){3})(?:\))?[ -\.]?((\d){3})[ -\.]?((\d){4}))" // raw string |
Programs must escape, with the \
, any operator character, removing its operator status, to match it explicitly in the input or target text. The \
escapes special characters appearing in strings, so programs must also escape it, leading to double escape sequences: \\
. Alternatively, programs can use a raw string literal when specifying an RE. Adding "?:" to the parentheses, (?:r)
creates a non-capturing group that matches but ignores the sub-expression r - it consumes or skips r without assigning it a group number. The following table explores the phone number matching RE from left to right, describing the sub-expressions and their effect on group numbers.
Group | Matching Expression | Explanation |
---|---|---|
None | (?:\\()? |
Matches zero or one occurrence of the \\( character without assigning it a group number. |
1 and 2 | ((\\d){3}) |
Matches exactly three occurrences of a digit character. Scanning left to right, the matching function encounters the outer (red) parentheses first, assigning them group number 1. This group encompasses all three digits. It assigns the second, inner parentheses number 2, which, when displayed, is the last digit of the three. |
None | (?:\\))? |
Matches zero or one occurrence of the closing parenthesis, \\) , without assigning it a group number. |
None | [ -\\.]? |
The sub-expression doesn't have any parentheses, so the matching function doesn't assign it a group number. The expression matches zero or one space, dash, or period. |
3 and 4 | ((\\d){3}) |
Matches three-digit characters. The function resumes group numbering from the last numbered group, numbering the red parentheses 3 (all three digits) and the inner pair 4 (the last of the three digits). |
None | [ -\\.]? |
The expression has no parentheses, so the matching function doesn't assign it a group number. It matches zero or one space, dash, or period. |
5 and 6 | ((\\d){4}) |
Matches four digits. Resuming the group numbering, the function assigns the red parentheses number 5 (all four digits) and the inner pair number 6 (the last digit of four). |
The sub-expressions extracting the groups of digits follow the same pattern, ((\\d){3})
, with the last group only differing in the quantifier value. The outer parentheses collect the digits and determine the group number used to access them. The sub-expressions require the inner parentheses to designate the elements on which the quantifiers operate, but they also establish a group number. The inner group only captures the last digit in the sequence, which is typically irrelevant. Programs can eliminate the extraneous group numbers by making the inner groups non-capturing:((?:\\d){3})
.
The previous examples demonstrate how programs can use regular expressions to validate and extract data from input. However, the RE systems in many programming languages can also modify or reformat extracted data. To demonstrate this feature, imagine that, for consistency, a programmer wants to convert all entered phone numbers to a "standard" format. The regex_replace function accomplishes this task.
int main()
{
ifstream in("phone.txt");
// Original expressions with extraneous numbered groups
string re = "(?:\\()?((\\d){3})(?:\\))?[ -\\.]?((\\d){3})[ -\\.]?((\\d){4})"; // (a)
//string re = R"((?:\()?((\d){3})(?:\))?[ -\.]?((\d){3})[ -\.]?((\d){4}))"; // raw string version
string format = "($1) $3-$5"; // (b)
// Modified expressions eliminating unnecessary numbered groups
/*string re = "(?:\\()?((?:\\d){3})(?:\\))?[ -\\.]?((?:\\d){3})[ -\\.]?((?:\\d){4})"; // (c)
//string re = R"((?:\()?((?:\d){3})(?:\))?[ -\.]?((?:\d){3})[ -\.]?((?:\d){4}))"; // raw string version
string format = "($1) $2-$3";*/ // (d)
if (!in.good())
{
cerr << "Unable to open \"phone.txt\"\n";
exit(1);
}
while (!in.eof())
{
string phone;
getline(in, phone);
if (regex_match(phone, regex("^$|^#.*"))) // (e)
continue;
if (regex_match(phone, regex(re))) // (f)
cout << regex_replace(phone, regex(re), format) << endl; // (e)
else
cerr << "Unsupported format: " << phone << endl;
}
return 0;
}
^$
, without any characters between them, match an empty line. The expression ^#.*
matches any line that begins with the '#' character. The OR operator, |
, forms the final RE, matching either sub-expression.phone = regex_replace(phone, regex(re), format);
View | Download | Comments |
---|---|---|
checkbook3.cpp | checkbook3.cpp | A version of the checkbook program accessing data with number groups. |
checkbook2.txt | checkbook2.txt | A checkbook register data file repeated from the previous section. |
phone.cpp | phone.cpp | A program validating and reformatting U.S. phone numbers, demonstrating the regex_replace function. |
phone.txt | phone.txt | An input file with various formatted U.S. phone numbers. |