14.11. Bulletproof Code (3): Intro To Regular Expressions

Review

A regular expression (RE, RegEx, or regex) is a sequence of characters forming a metalanguage describing a pattern. Programs attempt to match strings of characters, often program input, to the RE-specified pattern, accepting matching and rejecting non-matching strings. Some characters in the RE, called atoms, must exactly match a corresponding character in the input. For example, for the input to match and be accepted, a d at a given location in the input must correspond to a d in the RE. REs also specify many meta-characters - characters or character-sequences representing classes or ranges of characters. For example, \d represents any digit character - any character in the range 0 through 9 matches it. Some classes have multiple representations because REs were developed over time in different contexts, so [:digit:] and \d are equivalent specifications matching a digit.

RE also utilize numerous operators, which, like the arithmetic expressions used throughout the text, operate recursively. So, if r1 and r2 are two arbitrary REs, and 🙂 is an operator, then:

are also regular expressions. The C++ regex function translates regular expressions into a software machine that compares input strings to the pattern, providing the "more robust and general solution" necessary to create robust, bulletproof data entry code. The following figures present frequently used regular expression components, which the text demonstrates in this and subsequent sections with a sequence of increasingly complex examples.

Operators And Meta-characters

Mathematics and various programming languages evolved similar but not identical RE notations. Currently, programming libraries support many of those notations, resulting in multiple ways of expressing the same feature and creating a few conflicts. The following figures excerpt some basic programming meta-characters and operators. Please see ECMAScript syntax for a complete list and more details.

Character Matches Comments
x The single character x verbatim A character or atom that must appear explicitly in the input
. Any one character except a newline The period or dot character
\n The ASCI line feed or newline character The period or dot character
\r The ASCI carriage return character The period or dot character
\d Digit character 0 1 2 3 4 5 6 7 8 9
\D Not a digit character Any character that is not a digit: 0 1 2 3 4 5 6 7 8 9
\s A whitespace character A blank character
\S Not a whitespace character Any non-whitespace character
\w Word characters Any alphabetic character used to form a word
\W Not word characters A character not used to form a word
\c An escape sequence treating c as a "regular" character Programs must escape the RE operators ^ $ \ . * + ? ( ) [ ] { } | to match them as characters
[set] One character in the set [xyz] matches an x, a y, or a z
[range] One character in the range [a-z] matches one character in the range 'a' to 'z'
[^set] Any character not in the set [^xyz] matches any character except an x, a y, or a z
[^range] Any character not in the range [^a-z] matches any character except those in the range 'a' to 'z'
Regular expression meta-characters. Much of REs' expressive power derives from meta-characters matching specific kinds of characters, while rejecting characters of different types. Programmers specify an RE as a string of characters enclosed by double quotation marks, suggesting awareness of two concepts is crucial:

 

Class Matches
[:alnum:]Alpha-numerical character: a-z, A-Z, and 0-9
[:alpha:]Alphabetic character: a-z and A-Z
[:blank:]Blank character
[:cntrl:]Control character
[:digit:]Decimal digit character: 0-9. The same as \d
[:graph:]Character with graphical representation
[:lower:]Lowercase letter
[:print:]Printable character
[:punct:]Punctuation mark character
[:space:]Whitespace character
[:upper:]Uppercase letter
[:xdigit:]Hexadecimal digit character: a-f, A-F, and 0-9
[:d:]Decimal digit character - the same as \d
[:w:]Word character - the same as \w
[:s:]Whitespace character - the same as \s
RE character classes. Character classes name character categories, allowing programmers to specify matching patterns with the class name. Compare the character classes to the CCType Library Functions. The table is adapted from cplusplus.com Character classes.

 

Operator Meaning
xyz A sequence meaning x followed by y followed by z
r1|r2 r1 or r2
r* Zero or more occurrences of r
r+ One or more occurrences of r
r? Zero or one occurrence of r
r{n} Exactly n occurrences of r
r{n,} n or more occurrences of r
r{m,n} At least m occurrences of r but not more than n
(r) Groups the sub-expressions in r, allowing other operators to act on the group. Also allows programs to access matching groups by number (demonstrated in a subsequent example)
(?:r) Similar to the above, it groups the sub-expressions in r. However, it does not form a numbered group - it consumes but ignores input characters
^r Anchors r to the beginning of the line
r$ Anchors r to the end of the line
RE operators. The operators act on one or two REs, r, r1, and r2.

The <regex> Library Classes And Functions

The C++ regular expression library, specified in the <regex> header file, is complex and extensive. The following abridged descriptions focus on the basic and most frequently used features. Nevertheless, the covered features are sufficient for validating and formatting a wide variety of program inputs.

The RE classes and functions typically utilize templates and inheritance to achieve a high degree of generality, but resulting in potentially obscure and confusing specifications. The text presents string-based REs, reducing their generality but also simplifying the presentation to a level more appropriate for an introduction. Please see <regex> for more details.
Class Description
basic_regex A template class specifying RE for various data types
regex A convenience alias for a basic_regex object tailored for characters: typedef basic_regex<char> regex;
match_results A template iterator saving input patterns matching an RE
smatch A convenience alias for a match_results object tailored for string input
RE classes. Instances of these classes are essential arguments to the RE functions described in the following figures. They build the software machines that process program input, identifying and allowing programs to access the matches.

 

Function Description
regex(const char* re)
A regex initialization constructor that builds an RE machine that compares input strings to a pattern specified by re.
bool regex_match(string& t, regex& re);
Compares the target string, t, to the regular expression re. The function returns true if the target matches the RE; returns false otherwise.
bool regex_match(string& t, smatch m, regex& re);
Similar to the previous version but stores matching elements in argument m.
string regex_replace(string& t, regex& re, string& format);
Copies the target sequence, t, replacing RE matches with characters from format. format consists of characters copied verbatim and replacement operators: $n, where n is a match number, 0 < n < 100.
bool regex_search(string& t, regex& re);
Searches the target string, t, for sub-strings matching the regular expression re. The function returns true if the target contains a match; returns false otherwise.
bool regex_search(string& t, smatch m, regex& re);
Similar to the previous version but stores matching elements in argument m.
The fundamental RE functions. These few RE functions perform the bulk of the regular expression tasks in C++ programs.

Regular Expression Examples

The RE classes and functions typically utilize templates and inheritance to achieve a high degree of generality, resulting in specifications that are often obscure and confusing. The text presents string-based REs, reducing their generality but also simplifying the presentation to a level more appropriate for an introduction. Please see <regex> for more details.
	. . .
string	line;
getline(in, line);

//if (! regex_match(line, regex(".+:.+:.+")))		// version 1 (incomplete)
//if (! regex_match(line, regex("[^:]+:[^:]+:[^:]+")))	// version 2
if (! regex_match(line, regex("[^:]+(:[^:]+){2}")))	// version 3
    continue;

istringstream input(line);
	. . . 
s-rolodex3.cpp: Validating program input with an RE. Three versions of an if-statement replacing the validation test (b) in the previous version. Each version's RE subsumes the test for empty lines and comments. This version also uses an istringstream to collect output during processing.
Version 1
The dot, ., matches any non-line-terminator character. The + operator requires the previous RE, the dot, to appear at least once, but allows more repetitions. The colon, :, is an atom or character that must appear in the input or target string. This RE matches three colon-separated fields while rejecting records with fewer fields. However, it incorrectly matches input with four or more fields. The regex_match function is "greedy," consuming as much input as possible while still matching the RE. The first two .⁠+ sub-expressions consume all characters up to the colon but must stop to match the explicit colons in the RE. The third .⁠+ expression matches all remaining input, including any colon and following characters.
Version 2
The sub-expression, [^:], matches any character or atom that is not a colon. The + operator repeats the pattern one or more times. So, [^:]+ matches a sequence of one or more characters that are not the colon. Unlike version 1, this RE stops accepting or matching input when it encounters the third colon, rejecting input with too many fields.
Version 3
The first sub-expression, [^]+, is identical to version 2. The second is more complex but also more flexible. Decomposing the second sub-expression into two parts makes it easier to understand. The first part, (:[^:]+), requires a sequence consisting of a single colon followed by one or more characters that are not colons. The parentheses group the sequence for the second part, {2}, which requires exactly two occurrences of the sequence. Programmers can change the number of fields in each record just by changing the repetition count.

 

	. . .
string	entry;
getline(in, entry);

if (! regex_match(entry, regex("([Dd]eposit|[1-9][0-9]*):[^:]+:[^:]+:\\d*\\.\\d{2}")))
    continue;

istringstream input(entry);
	. . .
checkbook2.cpp: Implementing alternate sub-expressions. Each row or record in the checkbook register file consists of four fields. Deposits begin with the word "Deposit," a date, a dash or hyphen, and an amount. Check records begin with a check number, a date, a recipient, and an amount. The previous version of the program only detected empty lines or comments, but failed to detect malformed or an incorrect number of fields. The illustrated RE corrects those deficiencies.
([Dd]posit|[1-9][0-9]*)
The | operator, read as "or," selects one of two adjacent RE sub-expressions:
  • The previous checkbook program allowed the word "deposit" to begin with a lower or upper-case letter. The RE continues the practice with the sub-expression [Dd]eposit. Adjacent characters or atoms inside square brackets form an implied "or" operation; more complex REs require the explicit | operator. A sequence of characters or atoms outside the brackets matches the individual characters in the specified order.
  • The sub-expression [1-9][0-9]* requires one character in the range '1' to '9' followed by zero or more characters in the range '0' to '9' - a number with one or more digits that doesn't begin with zero.
The parentheses group the two sub-expressions, so that one or the other must appear in the input before the field-separating colon.
[^:]+
Matches one or more characters except colons.
\\d*\\.\\d{2}
\d* matches zero or more digits. The dot operator matches any non-line-termination character, but the RE escapes it, so \. matches a single period. Finally, \d{2} matches two digits exactly. Embedding the sub-expression in a string requires the program to escape each \ character, resulting in the illustrated \\ sequences.
As in the previous example, the colons in the RE must match colons in the input data.

Regular Expression Example Downloadable Files

ViewDownloadComments
s-rolodex3.cpp s-rolodex3.cpp A more robust version of the Rolodex program using a regular expression to validate each line of input data.
rolodex3.txt rolodex3.txt An input file with blank lines, comments, and correct and malformed data.
checkbook2.cpp checkbook2.cpp A version of the checkbook program that validates input with a regular expression.
checkbook2.txt checkbook2.txt A checkbook register data file containing blank lines, comments, and valid and malformed data.