The regular expression systems incorporated into most high-level programming languages are capable of more than matching or accepting input. In conjunction with the language's controls, REs can take a more active role in extracting data embedded in the program's input. The following examples demonstrate data extraction with an smatch object (an instance of match_results tailored for string data) and numbered groups and how programs can convert data in various common formats to a single "standard" format with the regex_replace function.
C++ regular expressions group sub-expressions with parentheses, allowing operations to operate on the group as a whole. RE also assigns group numbers based on the parentheses' location in the expression. Matching objects, instances of match_results and its sub-types, access the matching character sequences by group number.
string entry; getline(in, entry); smatch m; if (! regex_match(entry, m, regex("([Dd]eposit|[1-9][0-9]*):([^:]+):([^:]*):(\\d*\\.\\d{2})"))) //if (!regex_match(entry, m, regex(R"(([Dd]eposit|[1-9][0-9]*):([^:]+):([^:]*):(\d*\.\d{2}))"))) // raw string version continue; string type = m[1]; string date = m[2]; string to = m[3]; double amount = stod(m[4]);
The regex_match function identifies sequences of characters in the input data matching the RE. Rather than discarding the matched characters, the function saves them in the matching object, m. The overloaded indexing operator, [], allows the program to access the characters by their group number, making the data extraction straightforward. The challenge is understanding how regex_match (and similar functions) number the matching groups.
"([Dd]eposit|[1-9][0-9]*):([^:]+):([^:]*):(\\d*\\.\\d{2})" |
0: ([Dd]eposit|[1-9][0-9]*):([^:]+):([^:]*):(\\d*\\.\\d{2}) 1: ([Dd]eposit|[1-9][0-9]*) 2: ([^:]+) 3: ([^:]*) 4: (\\d*\\.\\d{2}) |
(a) | (b) |
for (int i = 0; i < m.size(); i++) cout << i << ": " << m[i] << endl; |
0: deposit:July 7:-:300.00 1: deposit 2: July 7 3: - 4: 300.00 |
0: 416:July 8:Gas Company:15.85 1: 416 2: July 8 3: Gas Company 4: 15.85 |
(c) | (d) | (e) |
(...)(...)(...)
produces groups 1, 2, and 3. The sequence also implicitly produces a group 0: the entire RE.
U.S. Phone Number Patterns | Matching Regular Expression |
---|---|
(123) 456-7890 123-456-7890 123 456 7890 123.456.7890 1234567890 |
"(?:\\()?((\\d){3})(?:\\))?[ -\\.]?((\\d){3})[ -\\.]?((\\d){4})" R"((?:\()?((\d){3})(?:\))?[ -\.]?((\d){3})[ -\.]?((\d){4}))" // raw string version |
Programs must escape, with the \
, any operator character, removing its operator status, to match it explicitly in the input or target text. The \
escapes special characters appearing in strings, so programs must also escape it, leading to double escape sequences: \\
. Alternatively, programs can use a raw string literal when specifying an RE. Adding "?:" to the parentheses, (?:r)
creates a non-capturing group that matches but ignores the sub-expression r - it consumes or skips r without assigning it a group number. The following table explores the phone number matching RE from left to right, describing the sub-expressions and their effect on group numbers.
Group | Matching Expression | Explanation |
---|---|---|
None | (?:\\()? |
Matches zero or one occurrence of the \\( character without assigning it a group number. |
1 and 2 | ((\\d){3}) |
Matches exactly three occurrences of a digit character. Scanning left to right, the matching function encounters the outer (red) parentheses first, assigning them group number 1. This group encompasses all three digits. It assigns the second, inner parentheses number 2, which is the last digit of the three. |
None | (?:\\))? |
Matches zero or one occurrence of the closing parenthesis, \\) , without assigning it a group number. |
None | [ -\\.]? |
The sub-expression doesn't have any parentheses, so the matching function doesn't assign it a group number. The expression matches zero or one space, dash, or period. |
3 and 4 | ((\\d){3}) |
Matches three-digit characters. The function resumes group numbering from the last numbered group, numbering the red parentheses 3 (all three digits) and the inner pair 4 (the last of the three digits). |
None | [ -\\.]? |
The expression has no parentheses, so the matching function doesn't assign it a group number. It matches zero or one space, dash, or period. |
5 and 6 | ((\\d){4}) |
Matches four digits. Resuming the group numbering, the function assigns the red parentheses number 5 (all four digits) and the inner pair number 6 (the last digit of four). |
The previous examples demonstrate how programs can use regular expressions to validate and extract data from input. However, the RE systems in many programming languages can also modify or reformat extracted data. To demonstrate this feature, imagine that, for consistency, a programmer wants to convert all entered phone numbers to a "standard" format. The regex_replace function accomplishes this task.
int main()
{
ifstream in("phone.txt");
string re = "(?:\\()?((\\d){3})(?:\\))?[ -\\.]?((\\d){3})[ -\\.]?((\\d){4})"; // (a)
//string re = R"((?:\()?((\d){3})(?:\))?[ -\.]?((\d){3})[ -\.]?((\d){4}))"; // raw string version
string format = "($1) $3-$5"; // (b)
if (!in.good())
{
cerr << "Unable to open \"phone.txt\"\n";
exit(1);
}
while (!in.eof())
{
string phone;
getline(in, phone);
smatch m;
if (regex_match(phone, regex("^$|#.*"))) // (c)
continue;
if (regex_match(phone, regex(re))) // (d)
cout << regex_replace(phone, regex(re), format) << endl; // (e)
else
cerr << "Unsupported format: " << phone << endl;
}
return 0;
}
^#.*
matches any line that begins with the '#' character. The OR operator, |
, forms the final RE, matching either sub-expression.phone = regex_replace(phone, regex(re), format);
View | Download | Comments |
---|---|---|
checkbook3.cpp | checkbook3.cpp | A version of the checkbook program accessing data with number groups. |
checkbook2.txt | checkbook2.txt | A checkbook register data file repeated from the previous section. |
phone.cpp | phone.cpp | A program validating and reformatting U.S. phone numbers, demonstrating the regex_replace function. |
phone.txt | phone.txt | An input file with various formatted U.S. phone numbers. |