14.11. Bulletproof Code (3): Intro To Regular Expressions

Time: 00:07:04 | Download: Large, Large (CC), Small | Streaming, Streaming (CC) | Slides: PDF, PPTX

Review

A regular expression (RE, RegEx, or regex) is a sequence of characters forming a metalanguage describing a pattern. Programs attempt to match strings of characters, often program input, to the RE-specified pattern, accepting matching and rejecting non-matching strings. Some characters in the RE, called atoms, must exactly match a corresponding character in the input. For example, for the input to match and be accepted, a d at a given location in the input must correspond to a d in the RE. REs also specify many meta-characters - characters or character-sequences representing classes or ranges of characters. For example, \d represents any digit character - any character in the range 0 through 9 matches it. Some classes have multiple representations because REs were developed over time in different contexts, so [:digit:] and \d are equivalent specifications matching a digit.

RE also utilize numerous operators, which, like the arithmetic expressions used throughout the text, operate recursively. So, if r₁ and r₂ are two arbitrary REs, and 🙂 is an operator, then:

r₁🙂
r₁🙂r₂

are also regular expressions. The C++ regex function translates regular expressions into a software machine that compares input strings to the pattern, providing the "more robust and general solution" necessary to create robust, bulletproof data entry code. The following figures present frequently used regular expression components, which the text demonstrates in this and subsequent sections with a sequence of increasingly complex examples.

Operators And Meta-characters

Mathematics and various programming languages evolved similar but not identical RE notations. Currently, programming libraries support many of those notations, resulting in multiple ways of expressing the same feature and creating a few conflicts. The following figures excerpt some basic programming meta-characters and operators. Please see ECMAScript syntax for a complete list and more details.

Character	Matches	Comments
x	The single character x verbatim	A character or atom that must appear explicitly in the input
`.`	Any one character except a newline	The period or dot character
`\n`	The ASCI line feed or newline character	The period or dot character
`\r`	The ASCI carriage return character	The period or dot character
`\d`	Digit character	`0 1 2 3 4 5 6 7 8 9`
`\D`	Not a digit character	Any character that is not a digit: `0 1 2 3 4 5 6 7 8 9`
`\s`	A whitespace character	A blank character
`\S`	Not a whitespace character	Any non-whitespace character
`\w`	Word characters	Any alphabetic character used to form a word
`\W`	Not word characters	A character not used to form a word
`\c`	An escape sequence treating c as a "regular" character	Programs must escape the RE operators `^ $ \ . * + ? ( ) [ ] { } \|` to match them as characters
`[set]`	One character in the set	`[xyz]` matches an x, a y, or a z
`[range]`	One character in the range	`[a-z]` matches one character in the range 'a' to 'z'
`[^set]`	Any character not in the set	`[^xyz]` matches any character except an x, a y, or a z
`[^range]`	Any character not in the range	`[^a-z]` matches any character except those in the range 'a' to 'z'

Regular expression meta-characters. Much of REs' expressive power derives from meta-characters matching specific kinds of characters, while rejecting characters of different types. Programmers specify an RE as a string of characters enclosed by double quotation marks, suggesting awareness of two concepts is crucial:

REs treat the space or blank character like any other. Therefore, programmers can't add space around any RE component to improve readability.
REs use many of the same escape sequences C++ uses to embed special characters in strings. Therefore, to form an RE using the backslash character, programmers must escape it. For example, to put the digit character, \d, in a string, a program must escape the backslash: "\\d". Similarly, to match a question mark, programs must escape it to suppress its operator behavior: \?, which becomes "\\?" in a string. Matching a backslash character is especially troublesome: it has a special meaning in REs, so they must escape it as \\, and it has a special meaning in strings, so they must escape it, making the overall sequence "\\\\".

Class	Matches
`[:alnum:]`	Alpha-numerical character: `a-z`, `A-Z`, and `0-9`
`[:alpha:]`	Alphabetic character: `a-z` and `A-Z`
`[:blank:]`	Blank character
`[:cntrl:]`	Control character
`[:digit:]`	Decimal digit character: `0-9`. The same as `\d`
`[:graph:]`	Character with graphical representation
`[:lower:]`	Lowercase letter
`[:print:]`	Printable character
`[:punct:]`	Punctuation mark character
`[:space:]`	Whitespace character
`[:upper:]`	Uppercase letter
`[:xdigit:]`	Hexadecimal digit character: `a-f`, `A-F`, and `0-9`
`[:d:]`	Decimal digit character - the same as `\d`
`[:w:]`	Word character - the same as `\w`
`[:s:]`	Whitespace character - the same as `\s`

RE character classes. Character classes name character categories, allowing programmers to specify matching patterns with the class name. Compare the character classes to the CCType Library Functions. The table is adapted from cplusplus.com Character classes.

Operator	Meaning
xyz	A sequence meaning x followed by y followed by z; xyz may be atoms or RE
`r₁\|r₂`	`r₁` or `r₂`
`r*`	Zero or more occurrences of `r`
`r+`	One or more occurrences of `r`
`r?`	Zero or one occurrence of `r`
`r{n}`	Exactly n occurrences of `r`
`r{n,}`	n or more occurrences of `r`
`r{m,n}`	At least m occurrences of `r` but not more than n
`(r)`	Groups the sub-expressions in `r`, allowing other operators to act on the group. Also allows programs to access matching groups by number (demonstrated in a subsequent example)
`(?:r)`	Similar to the above, it groups the sub-expressions in `r`. However, it does not form a numbered group - it consumes but ignores input characters
`^r`	Anchors `r` to the beginning of the line
`r$`	Anchors `r` to the end of the line

RE operators. The operators act on one or two REs, r, r₁, and r₂. Programmers refer to the shaded operators collectively as quantifiers.

The `<regex>` Library Classes And Functions

The C++ regular expression library, specified in the <regex> header file, is complex and extensive. The following abridged descriptions focus on the basic and most frequently used features. Nevertheless, the covered features are sufficient for validating and formatting a wide variety of program inputs.

The RE classes and functions typically utilize templates and inheritance to achieve a high degree of generality, but resulting in potentially obscure and confusing specifications. The text presents string-based REs, reducing their generality but also simplifying the presentation to a level more appropriate for an introduction. Please see <regex> for more details.

Class	Description
`basic_regex`	A template class specifying RE for various data types
`regex`	A convenience alias for a `basic_regex` object tailored for characters: `typedef basic_regex<char> regex;`
`match_results`	A template iterator saving input patterns matching an RE
`smatch`	A convenience alias for a `match_results` object tailored for `string` input

RE classes. Instances of these classes are essential arguments to the RE functions described in the following figures. They build the software machines that process program input, identifying and allowing programs to access the matches.

Function	Description
regex(const char* re)	A `regex` initialization constructor that builds an RE machine that compares input strings to a pattern specified by `re`.
bool regex_match(string& t, regex& re);	Compares the target string, `t`, to the regular expression `re`. The function returns true if the target matches the RE; returns false otherwise.
bool regex_match(string& t, smatch m, regex& re);	Similar to the previous version but stores matching elements in argument `m`.
string regex_replace(string& t, regex& re, string& format);	Copies the target sequence, `t`, replacing RE matches with characters from `format`. `format` consists of characters copied verbatim and replacement operators: `$n`, where n is a match number, `0 < n < 100`.
bool regex_search(string& t, regex& re);	Searches the target string, `t`, for sub-strings matching the regular expression `re`. The function returns true if the target contains a match; returns false otherwise.
bool regex_search(string& t, smatch m, regex& re);	Similar to the previous version but stores matching elements in argument `m`.

The fundamental RE functions. These few RE functions perform the bulk of the regular expression tasks in C++ programs.

Regular Expression Examples

The RE classes and functions typically utilize templates and inheritance to achieve a high degree of generality, resulting in specifications that are often obscure and confusing. The text presents string-based REs, reducing their generality but also simplifying the presentation to a level more appropriate for an introduction. Please see <regex> for more details.

	. . .
string	line;
getline(in, line);

//if (! regex_match(line, regex( ".+:.+:.+" ) ))		// version 1 (incomplete)
//if (! regex_match(line, regex( "[^:]+:[^:]+:[^:]+" ) ))	// version 2
if (! regex_match(line, regex( "[^:]+(:[^:]+){2}" ) ))		// version 3
    continue;

istringstream input(line);
	. . .

s-rolodex3.cpp: Validating program input with an RE. The program reads the data file one line at a time and validates each line with a regular expression. The three if-statements replace the validation test (b) in the previous version. Each RE subsumes the test for empty lines and comments. Once the program validates the input pattern, it constructs an istringstream object with the full validated line of data and rereads each field from it with the three-argument getline function. The red colons in each RE correspond to the required colons in the input data.

Version 1: The dot, ., matches any non-line-terminator character. The + operator requires the previous RE, the dot, to appear at least once, but allows more repetitions. The colon, :, is an atom or character that must appear in the input or target string. This RE matches three colon-separated fields while rejecting records with fewer fields. However, it incorrectly matches input with four or more fields. The regex_match function is "greedy," consuming as much input as possible while still matching the RE. The first two .⁠+ sub-expressions consume all characters up to the colon but must stop to match the explicit colons in the RE. The third .⁠+ expression matches all remaining input, including any colon and following characters.
Version 2: The sub-expression, [^:], matches any character or atom that is not a colon. The + operator repeats the pattern one or more times. So, [^:]+ matches a sequence of one or more characters, stopping when it reaches a colon. Unlike version 1, this RE stops accepting or matching input when it encounters the third colon, rejecting input with too many fields.
Version 3: Recognizing that each line of the Rolodex input represents an instance of the fence post problem, with the fields corresponding to the posts and the colons to the spans, helps explain the third RE. The first sub-expression, [^]+, is identical to version 2 and corresponds to the first fence post. The second is more complex but also more flexible. Decomposing it makes it easier to understand. The parentheses bound the first part, (:[^:]+), forming a group consisting of a sequence: a single colon followed by one or more non-colon characters. The sequence corresponds to one fence span and post. The second part, {2}, is a quantifier operating on the group, requiring exactly two occurrences of the sequence in the input. Programmers can change the number of fields in each record just by changing the quantifier.

	. . .
string	entry;
getline(in, entry);

if (! regex_match(entry, regex("([Dd]eposit|[1-9][0-9]*):[^:]+:[^:]+:\\d*\\.\\d{2}")))
    continue;

istringstream input(entry);
	. . .

checkbook2.cpp: Implementing alternate sub-expressions. The RE-based test illustrated here replaces the if-statement in the previous example. As before, each row or record in the checkbook register file consists of four fields. Deposits consist of the word "Deposit," a date, a dash or hyphen, and an amount. Check records consist a check number, a date, a recipient, and an amount. The previous version of the program only detected empty lines or comments, but failed to detect malformed or an incorrect number of fields. The illustrated RE corrects those deficiencies.

([Dd]posit|[1-9][0-9]*)

The | operator, read as "or," selects one of two adjacent RE sub-expressions:

The previous checkbook program allowed the word "deposit" to begin with a lower or upper-case letter. The RE continues the practice with the sub-expression [Dd]eposit. Adjacent characters or atoms inside square brackets form an implied "or" operation; more complex REs require the explicit | operator. A sequence of characters or atoms outside the brackets matches the individual characters in the specified order.
The sub-expression [1-9][0-9]* requires one character in the range '1' to '9' followed by zero or more characters in the range '0' to '9' - a number with one or more digits that doesn't begin with zero.

The parentheses group the two sub-expressions, so that one or the other must appear in the input before the field-separating colon.

[^:]+

Matches one or more characters except colons.

\\d*\\.\\d{2}

\d* matches zero or more digits. The dot operator matches any non-line-termination character, but the RE escapes it, so \. matches a single period. Finally, \d{2} matches two digits exactly. Embedding the sub-expression in a string requires the program to escape each \ character, resulting in the illustrated \\ sequences.

The colons and decimal point (painted red) in the RE must match corresponding characters in the input data.

Regular Expression Example Downloadable Files

View	Download	Comments
s-rolodex3.cpp	s-rolodex3.cpp	A more robust version of the Rolodex program using a regular expression to validate each line of input data.
rolodex3.txt	rolodex3.txt	An input file with blank lines, comments, and correct and malformed data.
checkbook2.cpp	checkbook2.cpp	A version of the checkbook program that validates input with a regular expression.
checkbook2.txt	checkbook2.txt	A checkbook register data file containing blank lines, comments, and valid and malformed data.

Operators And Meta-characters

The <regex> Library Classes And Functions

Regular Expression Examples

Regular Expression Example Downloadable Files

The `<regex>` Library Classes And Functions