The comma-separated value (CSV) format is a textual representation of structured data arranged or imagined as a table consisting of rows (records) and columns (fields). Programs such as spreadsheets or databases often use the CSV format to import and export data, making it a useful format for transferring data between two otherwise incompatible systems. The CSV format represents each row as a line in a text file separated into distinct fields by commas. If this represented the full extent of the CSV format, the complexity of parsing the files would be on par with parsing the Rolodex files demonstrated previously.
Fields sometimes contain a comma as part of the data. For example, some systems save a person's name as LastName, FirstName
without meaning to separate the names into different fields. The CSV format solves the problem by surrounding the field with quotation marks: "LastName, FirstName"
. The quotation marks are not part of the data, and the processing program must filter them out. Less common, saved data might contain an embedded quotation mark that is not part of a pair surrounding a field: part1"part2
. The CSV format encodes an embedded quotation mark as two: part1""part2
, which the program must render as a single mark on output. Finally, two adjacent commas represent a valid but empty field. (Less common, and supported by few systems, is including a line-separator in a field.) The textbook uses the problem of parsing a CSV file and extracting the delimited data as the final demonstration of regular expressions.
File Input Code | Program Test Data |
---|---|
int main() { ifstream in("csv.txt"); if (!in.good()) { cerr << "Unable to open \"csv.txt\"\n"; exit(1); } while (!in.eof()) { string line; getline(in, line); parse(line); } return 0; } |
# Correct without quotation marks W12345678,Cranston Snort,cs@mail.weber.edu # Correct with a comma embedded within quotation marks W12345678,"Snort, Cranston",cs@mail.weber.edu W12345678,Cranston Snort,cs@mail.weber.edu,"a,b" # Correct with embedded double quotation marks W12345678,Cranston""Snort,cs@mail.weber.edu # Correct with two adjacent commas W12345678,Cranston Snort,cs@mail.weber.edu,,Room 222 # Incorrect with an unmatched quotation mark W12345678,"Snort, Cranston.,cs@mail.weber.edu "W12345678","Snort, Cranston,cs@mail.weber.edu |
The lines in a general CSV file may have an unknown number of fields, and the number may vary from one line to the next. So, where the previous examples managed to solve a stated problem with a single match or replace function call, the CSV solution takes a different approach: It iteratively searches each line left to right for specific sub-patterns, removing them from the input when found, shortening the input string. The characters matching sub-patters are temporarily stored in an smatch object until the program extracts them, accumulating its output in an ostringstream object. Accumulating its output in a string stream object allows the program to forgo output if it detects an error at any time during the parsing process. The program loops the parsing process until the input string becomes empty.
void parse(string input) { if (regex_match(input, regex( "^$|^#.*$" ))) // (a) return; smatch m; ostringstream sout; while (input.length() > 0) { if (regex_match(input, regex( "[^\"]*\"[^\"]*" ))) // (b) { cerr << "Unbalanced \"" << endl; return; } if (regex_search(input, m, regex( "(^(?:\"([^\"]+)\")|([^,\\n]*))(?:[,\\n]?)" ))) // (c) //if (regex_search(input, m, regex( R"#((^(?:"([^"]+)")|([^,\n]*))(?:[,\n]?))#" ))) // raw string sout << left << setw(20) << regex_replace(string(m[1]), regex( "(\"\")" ), "\"" ); // replaces "" with " input = m.suffix().str(); // (d) } cout << sout.str() << endl; // (e) }
^$
matches empty lines while #.*$
matches comments.[^\"]* \" [^\"]*
means anything except a quotation mark - a quotation mark - anything except a quotation mark.(r1|r2)r3
r1 = ^(?:\"([^\"]+)\") |
Matches text surrounded by quotation marks, including commas, and discards the surrounding marks. |
r2 = ([^,\\n]*) |
Matches fields, treating adjacent commas as an empty field. The search ends with a comma or the end of the input. |
r3 = (?:[,\\n]?) |
Discards the commas separating fields and the newlines at the end of the input. |
Programs reading CSV-encoded files will likely process the fields beyond printing them. Imagine, for example, a user moving data between spreadsheet or database systems by exporting the data from one system to a CSV file. To import the data, the second system must not only extract the fields but must also process them into its own internal format. A straightforward solution saves the fields in a data structure as it extracts them. Since the number of fields varies from file to file and line to line, a general solution requires a structure capable of storing an indeterminate number of fields. Consequently, the following parse function saves the fields in a vector.
typedef vector<string> field_list; field_list parse(string input) { field_list fields; if (regex_match(input, regex( "^$|#.*$" ))) return fields; smatch m; while (input.length() > 0) { if (regex_match(input, regex( "[^\"]*\"[^\"]*" ))) { fields.clear(); return fields; } /*if (regex_match(input, regex( "[^\"]*\"[^\"]*" ))) throw "Unbalanced \"";*/ if (regex_search(input, m, regex( "(^(?:\"([^\"]+)\")|([^,\\n]*))(?:[,\\n]?)" ))) fields.push_back(regex_replace(string(m[1]), regex( "(\"\")" ), "\"" )); input = m.suffix().str(); } return fields; }
View | Download | Comments |
---|---|---|
csv.cpp | csv.cpp | A CSV solution based on regular expressions |
csv_vector.cpp | csv_vector.cpp | A CSV solution saving extracted CSV fields in a vector |
csv.txt | csv.txt | A CSV test data file |