The comma-separated value (CSV) format is a textual representation of structured data arranged or imagined as a table consisting of rows (records) and columns (fields). Programs like spreadsheets or databases often use the CSV format to import and export data, making it a useful format for porting data between two otherwise incompatible systems. The CSV format represents each row as a line in a text file separated into distinct fields by commas. If this represented the full extent of the CSV format, the complexity of parsing the files would be on par with parsing the Rolodex files demonstrated previously.
Fields sometimes contain a comma as part of the data. For example, some systems save a person's name as LastName, FirstName
without meaning to separate the names into different fields. The CSV format solves the problem by surrounding the field with quotation marks: "LastName, FirstName"
. The quotation marks are not part of the data, and the processing program must filter them out. Less common, saved data might contain an embedded quotation mark that is not part of a pair surrounding a field: part1"part2
. The CSV format encodes an embedded quotation mark as two: part1""part2
, which the program must render as a single mark on output. Finally, two adjacent commas represent a valid but empty field. (Less common, and supported by few systems, is including a line-separator in a field.) The textbook uses the problem of parsing a CSV file and extracting the delimited data as the final demonstration of regular expressions.
File Input Code | Program Test Data |
---|---|
int main() { ifstream in("csv.txt"); if (!in.good()) { cerr << "Unable to open \"cvs.txt\"\n"; exit(1); } while (!in.eof()) { string line; getline(in, line); parse(line); } return 0; } |
# Correct without quotation marks W12345678,Cranston Snort,cs@mail.weber.edu # Correct with embedded comma within quotation marks W12345678,"Snort, Cranston",cs@mail.weber.edu # Correct with embedded double quotation marks W12345678,Cranston""Snort,cs@mail.weber.edu # Correct with two adjacent commas W12345678,Cranston Snort,cs@mail.weber.edu,,Room 222 # Incorrect with unmatched quotation mark W12345678,"Snort, Cranston.,cs@mail.weber.edu "W12345678","Snort, Cranston,cs@mail.weber.edu |
The previous examples managed to solve a stated problem with a single match or replace function call. The CSV solution takes a different approach: it iteratively searches left to right for specific sub-patterns, removing them from the input when found, shorting in input string. The characters matching sub-patters are temporarily stored in an smatch object until the program extracts them, accumulating its output in an ostringstream object. Accumulating its output in a string stream object allows the program to forgo output if it detects an error at any time during the parsing process. The program loops the parsing process until the input string becomes empty.
void parse(string input) { smatch m; ostringstream sout; if (regex_match(input, regex( "^$|^#.*$" ))) // (a) return; while (input.length() > 0) { if (regex_match(input, regex( "[^\"]*\"[^\"]*" ))) // (b) { cerr << "Unbalanced \"" << endl; return; } else if (regex_search(input, m, regex( "^(?:\"([^\"]+)\",)" ))) // (c) sout << left << setw(20) << m[1]; else if (regex_search(input, m, regex( "^," ))) // (d) sout << left << setw(20) << ""; else if (regex_search(input, m, regex( "(?:([^,]+),?)" ))) // (e) sout << left << setw(20) << regex_replace(string(m[1]), regex( "(\"\")" ), "\"" ); // (f) input = m.suffix().str(); // (g) } cout << sout.str() << endl; // (h) }
^$
matches empty lines while #.*$
matches comments.[^\"]* \" [^\"]*
means anything except a quotation mark - a quotation mark - anything except a quotation mark.^(?: \"([^\"]+)\", )
creates a non-capturing group consisting of a quotation mark, one or more characters that are not a quotation mark, followed by a quotation mark.(?: ([^,]+),? )
creates a non-capturing group consisting of one or more characters excluding commas, followed by zero or one comma.
void parse(string input) { smatch m; ostringstream sout; if (regex_match(input, regex( "^$|^#.*$" ))) return; while (input.length() > 0) { if (regex_match(input, regex( R"([^"]*"[^"]*)" ))) { cerr << R"(Unbalanced ")" << endl; return; } else if (regex_search(input, m, regex( R"#(^(?:"([^"]+)",))#" ))) sout << left << setw(20) << m[1]; else if (regex_search(input, m, regex( "^," ))) sout << left << setw(20) << ""; else if (regex_search(input, m, regex( R"#((?:([^,]+),?))#" ))) sout << left << setw(20) << regex_replace(string(m[1]), regex( R"#((""))#" ), R"#(")#" ); input = m.suffix().str(); } cout << sout.str() << endl; }
View | Download | Comments |
---|---|---|
csv.cpp | csv.cpp | A CSV solution based on regular expressions |
csv-ras.cpp | csv-raw.cpp | An RE CSV solution using raw string literals |
csv.txt | csv.txt | A CSV test data file |