14.4. Processing Files

Review

To process [verb 1, 2.b] something means "to subject [it] to or handle [it] through an established usually routine set of procedures." "Processing files" implies processing the data in them, subjecting it to a set of procedures - algorithmically manipulating it. Programmers must consider two aspects of file processing when writing programs: the order in which the program reads or writes the file and the units or "chunks" the program reads or writes.

I/O Units and Access Order

Data consists of one or more bytes that programs group into logical "chunks," which the text calls "I/O units" to distinguish them from the physical "chunks" or blocks transferred between the program and the computer's hardware. Programs can process files in two orders: sequentially or randomly. Sequential processing accesses a file from beginning to end, with each read or write operation advancing the position pointer by the I/O unit size. The term "random" in "random access" doesn't mean that programs access data by chance but can access the data in any order, often responding to an outside request. The synonym "direct" suggests that a program can go directly to any data element using an index and one of the "seek" functions. Programmers synthesize a third order, keyed or indexed access, by extending random access. The underlying problem determines the I/O unit's size and the file's access order. The access order determines which stream classes programs use, while the I/O unit determines the I/O functions.

I/O Unit
Access Order
Sequential Random / Direct Keyed / Indexed
Character Text & Binary1    
Line Text    
Block2 Binary (infrequently) Binary Binary
Buffer Text & Binary    
File processing overview. The table summarizes the processing order and data transfer sizes C++ supports for text and binary files. Subsequent sections provide greater detail and more extensive programming examples.
Sequential
Access begins with the position pointer at zero and processes the data from the beginning of the file to the end, automatically advancing the position pointer by the I/O unit size. Programs perform sequential file access with instances of the ifstream and ofstream classes.
Random / Direct
C++ programs can implement random read-only file access with an ifstream object, but they typically mix reading and writing operations performed with an fstream object.
Keyed / Indexed
Keyed sequential access (KSAM), also known as indexed sequential access (ISAM), extends random access. It requires at least two files: a data file consisting of fixed-length records and a file consisting of corresponding keys and indexes. The Indexed Sequential Access Method section presented later in the chapter illustrates this process.
Character
Data is read from or written to a file one byte or one character at a time. This unit allows the program to examine and process individual characters or bytes. Programs can write the characters that are processed into different files as they are processed.
Line
Data is read from or written to a file one line at a time as determined by the line separator. This unit is flexible, allowing programs to process complete lines or examine the individual characters in the line. Programs can use line-oriented processing to make robust, bullet-proof code by reading a line and verifying that it matches (often with a regular expression) an acceptable pattern.
Block2
The program reads or writes fixed-size blocks of data. The block size corresponds to a struct, class, database record, or, in special cases, the physical filebuf size. Databases frequently use block I/O operations, implemented with fstream objects, that allow them to read a block, update it, and write it back to the same physical location in secondary memory.
Buffer
Buffer I/O is uncommon and only helpful in limited situations. Each stream object has a streambuf by aggregation. The rdbuf function gets and sets the buffer, allowing rapid, if limited, file processing.
  1. Some I/O functions, eof for example, may not work with binary data.
  2. As used here, the term "block" describes a unit of memory whose size corresponds to a logical grouping of related data defined in a program. For example, a program creating Student objects may read and write those objects as Student-sized blocks. In contrast, the previous use of "block" described a "chunk" of memory whose size was a product of the operating system and hardware.

File Processing Control

Programs reading files sequentially often use loops to iterate through their contents, interleaving reading and data processing operations. The operating system manages the computer hardware, including access to secondary memory, making it further responsible for detecting and signaling when a program reaches the end of a file. The stream classes present a consistent interface, independent of the hardware and the operating system, providing programs with many ways to detect the signal.

The eof Function

Each stream object has a set of four one-bit flags maintaining the associated file's current state or condition. One of those flags, eofbit, saves the stream's "end of file" status. The last read operation sets the bit to "1" if it advances the stream's position pointer to the end of the file; otherwise, it leaves the bit set to "0." The eof or end of file function returns the current value of the eofbit, providing loop control for processing a file's contents.

IncorrectCorrect
ifstream file(file_name);

while (! file.eof())
{
	// read file
	// process data
}
 
ifstream file(file_name);

// initial file read
while (! file.eof())
{
	// process the data
	// subsequent file reads
}
(a)(b)
File read patterns with the eof function. Imagine using a word processor: it can edit short or long files. It can remove or add text to the file, changing its size. Generalizing this observation suggests that programs reading files should operate independently of the file's size. The eof function allows programmers to form loops that read and process the data in a file regardless of its size. However, the function doesn't test the file's status but returns the current value of the eofbit, which the previous read operation sets.
  1. The obvious solution fails in most cases because the read operation extracting the last data unit is successful, so it doesn't set the eofbit to 1. The next read operation fails to extract data, so it sets the bit, indicating that the stream is at the end. However, the following process operation doesn't have data to process.
  2. This version changes the operation order in the loop's body. The process operation sets the eofbit when the stream reaches the end of the file, causing the eof function to end the loop before executing the process operation. However, this version requires a read operation before entering the loop.
The next figure demonstrates some simple read operations with more complex examples following in subsequent sections.
#include <iostream>
#include <fstream>
using namespace std;

int main()
{
    ifstream in("data.txt");
    char c;

    while (!in.eof())
    {
        in >> c;
        // in.get(c);	// alternate
        cout << '|' << c << '|' << endl;
    }

    return 0;
}
#include <iostream>
#include <fstream>
using namespace std;

int main()
{
    ifstream in("data.txt");

    while (!in.eof())
    {
        int c = in.get();
        cout << '|' << (char)c
            << '|' << endl;
    }

    return 0;
}
 
#include <iostream>
#include <fstream>
using namespace std;

int main()
{
    ifstream in("data2.txt");
    char line[100];

    while (!in.eof())
    {
        in.getline(line, 100);
        cout << '|' << line
            << '|' << endl;
    }

    return 0;
}
|a|
|b|
|c|
|d|
|d|
|a|
|b|
|c|
|d|
| |
|See the quick|
|red fox jump|
|over the lazy|
|brown dog.|
(a)(b)(c)
The problem with the eof function. The first two programs demonstrate the eof function's problematic behavior when reading single characters, while the second demonstrates better behavior when reading strings. The programs echo the input to the console, surrounding it with '|' characters to demonstrate the eof function's behavior. The test data for programs (a) and (b) consists of a file with four characters on one line: abcd. The program (c) input is identical to the output without the '|' characters.
  1. The inserter reads a character from the file, leaving the eofbit unset (i.e., 0). It only sets the eofbit after failing to read a character on the fifth iteration, leaving the previously read character, "d," in the variable c. C++ defines three overloaded versions of the get function. Replacing the inserter with in.get(c); produces the same output.
  2. The overloaded version of the get function returns each character read from the file as an integer, which the program must cast to a char for output. As with the previous program, the get function doesn't set the eofbit when it successfully reads a character. It returns an EOF and sets eofbit (which is unprintable) when it fails to read a character on the fifth iteration. The following figure demonstrates a better way to use the get function and the EOF it returns.
  3. The getline function reads a full line from the file, discarding the newline character at the end. Unlike the other input operations, it sets the eofbit after reading the final line, preventing a fifth loop iteration.
Caution

The POSIX standard requires a newline at the end of text files, meaning they end with a blank line. The vi text editor and its derivatives (e.g., vim and gvim) transparently add the newline, while nano adds and displays the newline but won't remove it. The blank line in test data created on a POSIX system changes the illustrated output. Subsequent, more robust examples demonstrate how to deal with blank lines.

The EOF Constant

#include <iostream>
#include <fstream>
using namespace std;

int main()
{
    ifstream in("data.txt");
    int c;
    
    while ((c = in.get()) != EOF) 
        cout << '|' << (char)c << '|' << endl;

    return 0;
}
Fixing the character read problem. When the get function reads a character, it doesn't "know" that it's the last character until it tries to read the next character and discovers there isn't one. Merging the read operation with the loop test solves the problem. The input operation of program (b) above follows the test; in contrast, the input operation is part of the expression driving the loop, which tests the get function's return value rather than testing the eofbit with the eof function. Each pair of parentheses represents a different operator: The grouping parentheses force the get function call and assignment operation to run first, storing the returned character in variable c. The loop runs while the returned character is not equal to EOF. The loop in this program runs four times without producing extraneous output.

operator( )

The C++ operator summary lists three distinct operators implemented with parentheses in the second group of operators. The input stream classes overload the third, type-casting operator, to form a conversion operator: bool operator() function. The following figure presents a function prototype and code fragments demonstrating how programs use the conversion operator to read a file.

ifstream& get(char& c);
ifstream input(file_name);
(a)(b)
do {
	...
} while (input);
char c;
while (input.get(c))
	. . . 
(c)(d)
Using the input stream operator().
  1. The prototype for the Figure 3(a) version of the overloaded get function. For this demonstration, the significant feature is the function's return type: an ifstream reference.
  2. The program defines the input file stream object appearing in the following code fragments.
  3. The while test automatically calls the conversion operator directly on the stream object, creating a Boolean-valued expression driving the loop. The expression is true if the program has not reached the end of the file and false if it has.
  4. The get function reads one character from the input file, stores it in the variable c, and returns a reference to input, automatically calling the conversion operator.