14.11.2. Comma-Separated Values

Review

The comma-separated value (CSV) format is a textual representation of structured data arranged or imagined as a table consisting of rows (records) and columns (fields). Programs like spreadsheets or databases often use the CSV format to import and export data, making it a useful format for porting data between two otherwise incompatible systems. The CSV format represents each row as a line in a text file separated into distinct fields by commas. If this represented the full extent of the CSV format, the complexity of parsing the files would be on par with parsing the Rolodex files demonstrated previously.

Fields sometimes contain a comma as part of the data. For example, some systems save a person's name as LastName, FirstName without meaning to separate the names into different fields. The CSV format solves the problem by surrounding the field with quotation marks: "LastName, FirstName". The quotation marks are not part of the data, and the processing program must filter them out. Less common, saved data might contain an embedded quotation mark that is not part of a pair surrounding a field: part1"part2. The CSV format encodes an embedded quotation mark as two: part1""part2, which the program must render as a single mark on output. Finally, two adjacent commas represent a valid but empty field. (Less common, and supported by few systems, is including a line-separator in a field.) The textbook uses the problem of parsing a CSV file and extracting the delimited data as the final demonstration of regular expressions.

File Input Code Program Test Data
int main()
{
    ifstream in("csv.txt");

    if (!in.good())
    {
        cerr << "Unable to open \"cvs.txt\"\n";
        exit(1);
    }

    while (!in.eof())
    {
        string line;
        getline(in, line);
        parse(line);
    }

    return 0;
}
# Correct without quotation marks
W12345678,Cranston Snort,cs@mail.weber.edu

# Correct with comma embedded within quotation marks
W12345678,"Snort, Cranston",cs@mail.weber.edu
W12345678,Cranston Snort,cs@mail.weber.edu,"a,b"

# Correct with embedded double quotation marks
W12345678,Cranston""Snort,cs@mail.weber.edu

# Correct with two adjacent commas
W12345678,Cranston Snort,cs@mail.weber.edu,,Room 222

# Incorrect with unmatched quotation mark
W12345678,"Snort, Cranston.,cs@mail.weber.edu
"W12345678","Snort, Cranston,cs@mail.weber.edu
CSV driver and data. The driver opens, validates, and reads the data file, passing each line to the parse function. Although the while-loop executes once after reading the last line, the parse function skips it and all empty lines.

The previous examples managed to solve a stated problem with a single match or replace function call. The CSV solution takes a different approach: it iteratively searches left to right for specific sub-patterns, removing them from the input when found, shorting in input string. The characters matching sub-patters are temporarily stored in an smatch object until the program extracts them, accumulating its output in an ostringstream object. Accumulating its output in a string stream object allows the program to forgo output if it detects an error at any time during the parsing process. The program loops the parsing process until the input string becomes empty.

void parse(string input)
{
    smatch         m;
    ostringstream  sout;

    if (regex_match(input, regex( "^$|^#.*$" )))							// (a)
        return;

    while (input.length() > 0)
    {
        if (regex_match(input, regex( "[^\"]*\"[^\"]*" )))						// (b)
        {
            cerr << "Unbalanced \"" << endl;
            return;
        }

        if (regex_search(input, m, regex( "(^(?:\"([^\"]+)\")|([^,\\r\\n]*))(?:[,(?:\\r\\n)]?)" )))	// (c)
        //if (regex_search(input, m, regex( R"#((^(?:"([^"]+)")|([^,\r\n]*))(?:[,(?:\r\n)]?))#" )))	// raw string
            sout << left << setw(20) <<
                regex_replace(string(m[1]), regex( "(\"\")" ), "\"" );    // replaces "" with "

        input = m.suffix().str();									// (d)
    }

    cout << sout.str() << endl;										// (e)
}
csv.cpp: Extracting fields from a CSV-encoded string with RE. The RE CSV solution utilizes the regex_search function, which scans a string from left to right, stopping when it locates a substring matching the specified RE. The search function saves the matching text in the matcher object, m, and returns true; it returns false if it doesn't find a matching substring. The parse function saves intermediate output in an istringstream object, which it sends to the console if it doesn't detect any errors while parsing the input string.
  1. Filters out blank and comment lines: ^$ matches empty lines while #.*$ matches comments.
  2. Detects unbalanced quotation marks and aborts processing the current line: [^\"]*  \"  [^\"]* means anything except a quotation mark - a quotation mark - anything except a quotation mark.
  3. To better understand the RE, decompose it into three sub-expressions: (r1|r2)r3
    • r1 = ^(?:\"([^\"]+)\")        Text surrounded by quotation marks, including commas
    • r2 = ([^,\\r\\n]*)               Comma separated fields; treats adjacent commas as empty
    • r3 = (?:[,(?:\\r\\n)]?)      Discards commas, carriage returns, and newlines
    The regex_replace function converts adjacent quotation marks into a single mark on output.
  4. Removes and saves the suffix or last part of the most recent match, effectively removing the matched field and reducing the string's length.
  5. The str function returns the output accumulated in the istringstream object.

CSV Regular Expression Example Downloadable Files

ViewDownloadComments
csv.cpp csv.cpp A CSV solution based on regular expressions
csv.txt csv.txt A CSV test data file