A regular expression (RE, RegEx, or regex) is a sequence of characters forming a metalanguage describing a pattern. Programs attempt to match strings of characters, often program input, to the RE-specified pattern, accepting matching and rejecting non-matching strings. Some characters in the RE, called atoms, must exactly match a corresponding character in the input. For example, for the input to match and be accepted, a d
at a given location in the input must correspond to a d
in the RE. REs also specify many meta-characters - characters or character-sequences representing classes or ranges of characters. For example, \d
represents any digit character - any character in the range 0 through 9 matches it. Some classes have multiple representations because REs were developed over time in different contexts, so [:digit:]
and \d
are equivalent specifications matching a digit.
RE also utilize numerous operators, which, like the arithmetic expressions used throughout the text, operate recursively. So, if r1 and r2 are two arbitrary REs, and 🙂 is an operator, then:
are also regular expressions. The C++ regex function translates regular expressions into a software machine that compares input strings to the pattern, providing the "more robust and general solution" necessary to create robust, bulletproof data entry code. The following figures present frequently used regular expression components, which the text demonstrates in this and subsequent sections with a sequence of increasingly complex examples.
Mathematics and various programming languages evolved similar but not identical RE notations. Currently, programming libraries support many of those notations, resulting in multiple ways of expressing the same feature and creating a few conflicts. The following figures excerpt some basic programming meta-characters and operators. Please see ECMAScript syntax for a complete list and more details.
Character | Matches | Comments |
---|---|---|
x | The single character x verbatim | A character or atom that must appear explicitly in the input |
. | Any one character except a newline | The period or dot character |
\n | The ASCI line feed or newline character | The period or dot character |
\r | The ASCI carriage return character | The period or dot character |
\d | Digit character | 0 1 2 3 4 5 6 7 8 9 |
\D | Not a digit character | Any character that is not a digit: 0 1 2 3 4 5 6 7 8 9 |
\s | A whitespace character | A blank character |
\S | Not a whitespace character | Any non-whitespace character |
\w | Word characters | Any alphabetic character used to form a word |
\W | Not word characters | A character not used to form a word |
\c | An escape sequence treating c as a "regular" character | Programs must escape the RE operators ^ $ \ . * + ? ( ) [ ] { } | to match them as characters |
[set] | One character in the set | [xyz] matches an x, a y, or a z |
[range] | One character in the range | [a-z] matches one character in the range 'a' to 'z' |
[^set] | Any character not in the set | [^xyz] matches any character except an x, a y, or a z |
[^range] | Any character not in the range | [^a-z] matches any character except those in the range 'a' to 'z' |
Class | Matches |
---|---|
[:alnum:] | Alpha-numerical character: a-z, A-Z, and 0-9 |
[:alpha:] | Alphabetic character: a-z and A-Z |
[:blank:] | Blank character |
[:cntrl:] | Control character |
[:digit:] | Decimal digit character: 0-9. The same as \d |
[:graph:] | Character with graphical representation |
[:lower:] | Lowercase letter |
[:print:] | Printable character |
[:punct:] | Punctuation mark character |
[:space:] | Whitespace character |
[:upper:] | Uppercase letter |
[:xdigit:] | Hexadecimal digit character: a-f, A-F, and 0-9 |
[:d:] | Decimal digit character - the same as \d |
[:w:] | Word character - the same as \w |
[:s:] | Whitespace character - the same as \s |
Operator | Meaning |
---|---|
xyz | A sequence meaning x followed by y followed by z |
r1|r2 | r1 or r2 |
r* | Zero or more occurrences of r |
r+ | One or more occurrences of r |
r? | Zero or one occurrence of r |
r{n} | Exactly n occurrences of r |
r{n,} | n or more occurrences of r |
r{m,n} | At least m occurrences of r but not more than n |
(r) | Groups the sub-expressions in r, allowing other operators to act on the group. Also allows programs to access matching groups by number (demonstrated in a subsequent example) |
(?:r) | Similar to the above, it groups the sub-expressions in r. However, it does not form a numbered group - it consumes but ignores input characters |
^r | Anchors r to the beginning of the line |
r$ | Anchors r to the end of the line |
The C++ regular expression library, specified in the <regex> header file, is complex and extensive. The following abridged descriptions focus on the basic and most frequently used features. Nevertheless, the covered features are sufficient for validating and formatting a wide variety of program inputs.
Class | Description |
---|---|
basic_regex | A template class specifying RE for various data types |
regex | A convenience alias for a basic_regex object tailored for characters: typedef basic_regex<char> regex; |
match_results | A template iterator saving input patterns matching an RE |
smatch | A convenience alias for a match_results object tailored for string input |
Function | Description |
---|---|
regex(const char* re) |
A regex initialization constructor that builds an RE machine that compares input strings to a pattern specified by re. |
bool regex_match(string& t, regex& re); |
Compares the target string, t, to the regular expression re. The function returns true if the target matches the RE; returns false otherwise. |
bool regex_match(string& t, smatch m, regex& re); |
Similar to the previous version but stores matching elements in argument m. |
string regex_replace(string& t, regex& re, string& format); |
Copies the target sequence, t, replacing RE matches with characters from format. format consists of characters copied verbatim and replacement operators: $n, where n is a match number, 0 < n < 100. |
bool regex_search(string& t, regex& re); |
Searches the target string, t, for sub-strings matching the regular expression re. The function returns true if the target contains a match; returns false otherwise. |
bool regex_search(string& t, smatch m, regex& re); |
Similar to the previous version but stores matching elements in argument m. |
. . . string line; getline(in, line); //if (! regex_match(line, regex(".+:.+:.+"))) // version 1 (incomplete) //if (! regex_match(line, regex("[^:]+:[^:]+:[^:]+"))) // version 2 if (! regex_match(line, regex("[^:]+(:[^:]+){2}"))) // version 3 continue; istringstream input(line); . . .
.
, matches any non-line-terminator character. The +
operator requires the previous RE, the dot, to appear at least once, but allows more repetitions. The colon, :
, is an atom or character that must appear in the input or target string. This RE matches three colon-separated fields while rejecting records with fewer fields. However, it incorrectly matches input with four or more fields. The regex_match function is "greedy," consuming as much input as possible while still matching the RE. The first two .+
sub-expressions consume all characters up to the colon but must stop to match the explicit colons in the RE. The third .+
expression matches all remaining input, including any colon and following characters.[^:]
, matches any character or atom that is not a colon. The +
operator repeats the pattern one or more times. So, [^:]+
matches a sequence of one or more characters that are not the colon. Unlike version 1, this RE stops accepting or matching input when it encounters the third colon, rejecting input with too many fields.[^]+
, is identical to version 2. The second is more complex but also more flexible. Decomposing the second sub-expression into two parts makes it easier to understand. The first part, (:[^:]+)
, requires a sequence consisting of a single colon followed by one or more characters that are not colons. The parentheses group the sequence for the second part, {2}
, which requires exactly two occurrences of the sequence. Programmers can change the number of fields in each record just by changing the repetition count.
. . . string entry; getline(in, entry); if (! regex_match(entry, regex("([Dd]eposit|[1-9][0-9]*):[^:]+:[^:]+:\\d*\\.\\d{2}"))) continue; istringstream input(entry); . . .
View | Download | Comments |
---|---|---|
s-rolodex3.cpp | s-rolodex3.cpp | A more robust version of the Rolodex program using a regular expression to validate each line of input data. |
rolodex3.txt | rolodex3.txt | An input file with blank lines, comments, and correct and malformed data. |
checkbook2.cpp | checkbook2.cpp | A version of the checkbook program that validates input with a regular expression. |
checkbook2.txt | checkbook2.txt | A checkbook register data file containing blank lines, comments, and valid and malformed data. |