14.11.2. Comma-Separated Values

Review

The comma-separated value (CSV) format is a textual representation of structured data arranged or imagined as a table consisting of rows (records) and columns (fields). Programs like spreadsheets or databases often use the CSV format to import and export data, making it a useful format for porting data between two otherwise incompatible systems. The CSV format represents each row as a line in a text file separated into distinct fields by commas. If this represented the full extent of the CSV format, the complexity of parsing the files would be on par with parsing the Rolodex files demonstrated previously.

Fields sometimes contain a comma as part of the data. For example, some systems save a person's name as LastName, FirstName without meaning to separate the names into different fields. The CSV format solves the problem by surrounding the field with quotation marks: "LastName, FirstName". The quotation marks are not part of the data, and the processing program must filter them out. Less common, saved data might contain an embedded quotation mark that is not part of a pair surrounding a field: part1"part2. The CSV format encodes an embedded quotation mark as two: part1""part2, which the program must render as a single mark on output. Finally, two adjacent commas represent a valid but empty field. (Less common, and supported by few systems, is including a line-separator in a field.) The textbook uses the problem of parsing a CSV file and extracting the delimited data as the final demonstration of regular expressions.

File Input Code Program Test Data
int main()
{
    ifstream in("csv.txt");

    if (!in.good())
    {
        cerr << "Unable to open \"cvs.txt\"\n";
        exit(1);
    }

    while (!in.eof())
    {
        string line;
        getline(in, line);
        parse(line);
    }

    return 0;
}
# Correct without quotation marks
W12345678,Cranston Snort,cs@mail.weber.edu

# Correct with embedded comma within quotation marks
W12345678,"Snort, Cranston",cs@mail.weber.edu

# Correct with embedded double quotation marks
W12345678,Cranston""Snort,cs@mail.weber.edu

# Correct with two adjacent commas
W12345678,Cranston Snort,cs@mail.weber.edu,,Room 222

# Incorrect with unmatched quotation mark
W12345678,"Snort, Cranston.,cs@mail.weber.edu
"W12345678","Snort, Cranston,cs@mail.weber.edu
CSV driver and data. The driver opens, validates, and reads the data file, passing each line to the parse function. Although the while-loop executes once after reading the last line, the parse function skips it and all empty lines.

The previous examples managed to solve a stated problem with a single match or replace function call. The CSV solution takes a different approach: it iteratively searches left to right for specific sub-patterns, removing them from the input when found, shorting in input string. The characters matching sub-patters are temporarily stored in an smatch object until the program extracts them, accumulating its output in an ostringstream object. Accumulating its output in a string stream object allows the program to forgo output if it detects an error at any time during the parsing process. The program loops the parsing process until the input string becomes empty.

void parse(string input)
{
    smatch         m;
    ostringstream  sout;

    if (regex_match(input, regex( "^$|^#.*$" )))						// (a)
        return;

    while (input.length() > 0)
    {
        if (regex_match(input, regex( "[^\"]*\"[^\"]*" )))					// (b)
        {
            cerr << "Unbalanced \"" << endl;
            return;
        }

        else if (regex_search(input, m, regex( "^(?:\"([^\"]+)\",)" )))				// (c)
            sout << left << setw(20) << m[1];

        else if (regex_search(input, m, regex( "^," )))						// (d)
            sout << left << setw(20) << "";

        else if (regex_search(input, m, regex( "(?:([^,]+),?)" )))				// (e)
            sout << left << setw(20) << regex_replace(string(m[1]), regex( "(\"\")" ), "\"" );	// (f)

        input = m.suffix().str();								// (g)
    }

    cout << sout.str() << endl;									// (h)
}
csv.cpp: Extracting fields from a CSV-encoded string with RE. The RE CSV solution utilizes the regex_search function, scans a string from left to right, stopping when it locates a substring matching the specified RE. The search function saves the matching text in the match object, m, and returns true; it returns false otherwise. The parse function saves intermediate output in an istringstream object, which it sends to the console if it doesn't detect any errors while parsing the input string.
  1. Filters out blank and comment lines: ^$ matches empty lines while #.*$ matches comments.
  2. Detects unbalanced quotation marks and aborts processing the current line: [^\"]*  \"  [^\"]* means anything except a quotation mark - a quotation mark - anything except a quotation mark.
  3. Detects and processes text surrounded by quotation marks, including commas. ^(?:  \"([^\"]+)\",  ) creates a non-capturing group consisting of a quotation mark, one or more characters that are not a quotation mark, followed by a quotation mark.
  4. Processes adjacent commas as an empty field. The previous loop iteration consumed the first comma; this operation consumes the second and adds an empty field to the string stream.
  5. The case processing general comma-separated fields. The RE retains adjacent quotation marks, processed by the following regex_replace. (?:  ([^,]+),?  ) creates a non-capturing group consisting of one or more characters excluding commas, followed by zero or one comma.
  6. Detects two adjacent quotation marks in the general case output and converts them to one.
  7. Removes and saves the suffix or last part of the most recent match, effectively removing the matched field and reducing the string's length.
  8. The str function returns the output accumulated in the istringstream object.

 

void parse(string input)
{
    smatch        m;
    ostringstream sout;

    if (regex_match(input, regex( "^$|^#.*$" )))
        return;

    while (input.length() > 0)
    {
        if (regex_match(input, regex( R"([^"]*"[^"]*)" )))
        {
            cerr << R"(Unbalanced ")" << endl;
            return;
        }

        else if (regex_search(input, m, regex( R"#(^(?:"([^"]+)",))#" )))
            sout << left << setw(20) << m[1];

        else if (regex_search(input, m, regex( "^," )))
            sout << left << setw(20) << "";

        else if (regex_search(input, m, regex( R"#((?:([^,]+),?))#" )))
            sout << left << setw(20) << regex_replace(string(m[1]), regex( R"#((""))#" ), R"#(")#" );

        input = m.suffix().str();
    }

    cout << sout.str() << endl;
}
cvs-raw.cpp. An RE-based solution for the CSV problem utilizing raw string literals.

CSV Regular Expression Example Downloadable Files

ViewDownloadComments
csv.cpp csv.cpp A CSV solution based on regular expressions
csv-ras.cpp csv-raw.cpp An RE CSV solution using raw string literals
csv.txt csv.txt A CSV test data file