14.11.2. Comma-Separated Values

Time: 00:05:16 | Download: Large, Large (CC), Small | Streaming, Streaming (CC) | Slides: PDF, PPTX

Review

The comma-separated value (CSV) format is a textual representation of structured data arranged or imagined as a table consisting of rows (records) and columns (fields). Programs such as spreadsheets or databases often use the CSV format to import and export data, making it a useful format for transferring data between two otherwise incompatible systems. The CSV format represents each row as a line in a text file separated into distinct fields by commas. If this represented the full extent of the CSV format, the complexity of parsing the files would be on par with parsing the Rolodex files demonstrated previously.

Fields sometimes contain a comma as part of the data. For example, some systems save a person's name as LastName, FirstName without meaning to separate the names into different fields. The CSV format solves the problem by surrounding the field with quotation marks: "LastName, FirstName". The quotation marks are not part of the data, and the processing program must filter them out. Less common, saved data might contain an embedded quotation mark that is not part of a pair surrounding a field: part1"part2. The CSV format encodes an embedded quotation mark as two: part1""part2, which the program must render as a single mark on output. Finally, two adjacent commas represent a valid but empty field. (Less common, and supported by few systems, is including a line-separator in a field.) The textbook uses the problem of parsing a CSV file and extracting the delimited data as the final demonstration of regular expressions.

File Input Code	Program Test Data
int main() { ifstream in("csv.txt"); if (!in.good()) { cerr << "Unable to open \"csv.txt\"\n"; exit(1); } while (!in.eof()) { string line; getline(in, line); parse(line); } return 0; }	# Correct without quotation marks W12345678,Cranston Snort,cs@mail.weber.edu # Correct with a comma embedded within quotation marks W12345678,"Snort, Cranston",cs@mail.weber.edu W12345678,Cranston Snort,cs@mail.weber.edu,"a,b" # Correct with embedded double quotation marks W12345678,Cranston""Snort,cs@mail.weber.edu # Correct with two adjacent commas W12345678,Cranston Snort,cs@mail.weber.edu,,Room 222 # Incorrect with an unmatched quotation mark W12345678,"Snort, Cranston.,cs@mail.weber.edu "W12345678","Snort, Cranston,cs@mail.weber.edu

File Input Code

Program Test Data

int main()
{
    ifstream in("csv.txt");

    if (!in.good())
    {
        cerr << "Unable to open \"csv.txt\"\n";
        exit(1);
    }

    while (!in.eof())
    {
        string line;
        getline(in, line);
        parse(line);
    }

    return 0;
}

# Correct without quotation marks
W12345678,Cranston Snort,cs@mail.weber.edu

# Correct with a comma embedded within quotation marks
W12345678,"Snort, Cranston",cs@mail.weber.edu
W12345678,Cranston Snort,cs@mail.weber.edu,"a,b"

# Correct with embedded double quotation marks
W12345678,Cranston""Snort,cs@mail.weber.edu

# Correct with two adjacent commas
W12345678,Cranston Snort,cs@mail.weber.edu,,Room 222

# Incorrect with an unmatched quotation mark
W12345678,"Snort, Cranston.,cs@mail.weber.edu
"W12345678","Snort, Cranston,cs@mail.weber.edu

CSV driver and data. The driver opens, validates, and reads the data file, passing each line to the parse function. Significantly, the driver opens the file in text mode, ensuring that the getline function maps carriage return newline sequences in Windows files to a single newline character. Although the while-loop executes once after reading the last line, the parse function skips it and all empty lines.

The lines in a general CSV file may have an unknown number of fields, and the number may vary from one line to the next. So, where the previous examples managed to solve a stated problem with a single match or replace function call, the CSV solution takes a different approach: It iteratively searches each line left to right for specific sub-patterns, removing them from the input when found, shortening the input string. The characters matching sub-patters are temporarily stored in an smatch object until the program extracts them, accumulating its output in an ostringstream object. Accumulating its output in a string stream object allows the program to forgo output if it detects an error at any time during the parsing process. The program loops the parsing process until the input string becomes empty.

void parse(string input)
{
    if (regex_match(input, regex( "^$|^#.*$" )))							// (a)
        return;

    smatch         m;
    ostringstream  sout;

    while (input.length() > 0)
    {
        if (regex_match(input, regex( "[^\"]*\"[^\"]*" )))						// (b)
        {
            cerr << "Unbalanced \"" << endl;
            return;
        }

        if (regex_search(input, m, regex( "(^(?:\"([^\"]+)\")|([^,\\n]*))(?:[,\\n]?)" )))		// (c)
        //if (regex_search(input, m, regex( R"#((^(?:"([^"]+)")|([^,\n]*))(?:[,\n]?))#" )))		// raw string
            sout << left << setw(20) <<
                regex_replace(string(m[1]), regex( "(\"\")" ), "\"" );    // replaces "" with "

        input = m.suffix().str();									// (d)
    }

    cout << sout.str() << endl;										// (e)
}

csv.cpp: Extracting fields from a CSV-encoded string with RE. The RE CSV solution utilizes the regex_search function, which scans a string from left to right, stopping when it locates a substring matching the specified RE. The search function saves the matching text in the matcher object, m, and returns true; it returns false if it doesn't find a matching substring. The parse function saves intermediate output in an istringstream object, which it sends to the console if it doesn't detect any errors while parsing the input string.

Filters out blank and comment lines: ^$ matches empty lines while #.*$ matches comments.
Detects unbalanced quotation marks and aborts processing the current line: [^\"]* \" [^\"]* means anything except a quotation mark - a quotation mark - anything except a quotation mark.

To better understand the RE, decompose it into three sub-expressions: (r₁|r₂)r₃

r₁ = `^(?:\"([^\"]+)\")`	Matches text surrounded by quotation marks, including commas, and discards the surrounding marks.
r₂ = `([^,\\n]*)`	Matches fields, treating adjacent commas as an empty field. The search ends with a comma or the end of the input.
r₃ = `(?:[,\\n]?)`	Discards the commas separating fields and the newlines at the end of the input.

The regex_replace function converts adjacent quotation marks into a single mark on output.

Removes and saves the suffix or last part of the most recent match, effectively removing the matched field and reducing the string's length.
The str function returns the output accumulated in the istringstream object.

Programs reading CSV-encoded files will likely process the fields beyond printing them. Imagine, for example, a user moving data between spreadsheet or database systems by exporting the data from one system to a CSV file. To import the data, the second system must not only extract the fields but must also process them into its own internal format. A straightforward solution saves the fields in a data structure as it extracts them. Since the number of fields varies from file to file and line to line, a general solution requires a structure capable of storing an indeterminate number of fields. Consequently, the following parse function saves the fields in a vector.

typedef vector<string> field_list;

field_list parse(string input)
{
    field_list fields;

    if (regex_match(input, regex( "^$|#.*$" )))
        return fields;

    smatch        m;

    while (input.length() > 0)
    {
        if (regex_match(input, regex( "[^\"]*\"[^\"]*" )))
        {
            fields.clear();
            return fields;
        }
        /*if (regex_match(input, regex( "[^\"]*\"[^\"]*" )))
            throw "Unbalanced \"";*/

        if (regex_search(input, m, regex( "(^(?:\"([^\"]+)\")|([^,\\n]*))(?:[,\\n]?)" )))
            fields.push_back(regex_replace(string(m[1]), regex( "(\"\")" ), "\"" ));
                
        input = m.suffix().str();
    }

    return fields;
}

csv_vector.cpp: Extracting and returning CSV fields. The parse function creates a vector and fills it with extracted CSV fields, then returns it to the client. Although it's not necessary, the example creates a program-specific datatype, field_list with the typedef operator. This common practice simplifies how programs use complex datatypes like templates. The example also demonstrates two alternative approaches to handling errors. The first clears or empties the vector before returning. The second throws an exception. The complete program, linked below, illustrates a client calling the function.

CSV Regular Expression Example Downloadable Files

View	Download	Comments
csv.cpp	csv.cpp	A CSV solution based on regular expressions
csv_vector.cpp	csv_vector.cpp	A CSV solution saving extracted CSV fields in a `vector`
csv.txt	csv.txt	A CSV test data file