14.5. Processing Files

To process something means "to subject [it] to or handle [it] through an established usually routine set of procedures." When we say that we are going to "process a file," we more accurately mean we are going to "process the data in the file." So, when we process a file, we either subject the data that it contains to a set of procedures (i.e., we manipulate the data algorithmically), we store the results of a set of procedures or algorithmic operations in a file or some combination of both. When we prepare to process a file, there are two independent concepts that we must consider. The first, called the access technique, is how we locate the position in the file where the read or write will take place. The second is how we read or write the file - what function(s) we use, which determines how much data is transferred between the file and the program by each I/O operation.

Access Techniques

Systems broadly provide at least two primary ways for programmers to access files. We can also synthesize a third technique using fundamental file processing operations. The problem the program solves dictates the access technique we follow. The access technique then dictates which is the best stream class to use.

Sequential: The sequential access technique is relatively simple: it begins with the position pointer at zero and processes the data from the beginning of the file to the end. File streams treat data as a stream of bytes as it moves between the program and a file. Even simple data, like an int, typically consists of multiple bytes. So, the number of bytes processed by each operation depends on the data type. Each read or write operation advances the position pointer by the number of bytes read or written, respectively. So, if the program reads or writes n bytes, the stream automatically advances the position pointer by n: position pointer += n;. Instances of the ifstream and ofstream classes perform sequential access.
Random/Direct: Random access, also known as direct access, may be implemented with any stream class but is generally implemented with instances of fstream. In this context, "random" means that a program can access the data in any order, not just sequentially. And "direct" means that a program can access a specific data item using an index or record number - very much like an array. Direct access uses block I/O operations (introduced below) and adds a family of overloaded "seek" functions. See Random/Direct Access later in this chapter for details, illustrations, and an example.
Keyed/Indexed: Keyed sequential access (KSAM), also known as indexed sequential access (ISAM), is the most complex file processing technique. It requires at least two files. The first is a data file consisting of a sequence of "records." A record is a large collection of related data (e.g., a structure or other object). The second file is an index or key file consisting of much smaller records that map a key to the index of a record in a data file. In an authentic application, the data file is usually too large to fit in main memory, which makes it difficult and expensive (in time) to reorganize. So, the program appends new records at the end of the data file. Alternatively, the index file is smaller and easier to organize to facilitate fast searches. A program searches for a specific key in the index file. If it finds a matching key, it uses the associated index to access the associated record in the data file. See Keyed/Indexed Sequential Access later in this chapter for more details, illustrations, and an example.

File Process Control

Programs, especially those using sequential access, often use loops to process files. When the program writes data to a file, some condition related to the data source signals the program when there is no more data to write. When the program reads information from a file, the file itself (or, more accurately, the operating system) must signal the program when it has read all the data. In the last section, we learned that each stream object maintains a set of four one-bit flags that indicate the stream's current state or condition. One of those flags, eofbit, signals when the position pointer reaches the end of the file. C++ streams provide two ways for a program to detect the end of a file.

eof

The eof or end of file function returns true after the position pointer reaches the end of file. It's easy to base loop statements that read and process the file contents on this function.

ifstream file(file_name);
		.
		.
		.
// read data from file
while (! file.eof())
{
	// process the data
	// read data from file
}

Using the eof function to control a loop reading a file. It is often the case that a file contains an unknown amount of data. The eof function allows programmers to form loops that read and process the data in a file until all of the data is read and processed. The loop continues until the last read operation move the position pointer to at the end of the file. This example continues the practice introduced in the previous section of using file_name two represent the different ways of denoting a file's name.

Unfortunately, the behavior of the eof function is not as straightforward as we might expect. The function does not actually test the file to see if there is more data to read. Instead, it returns the current value of the eofbit, which is set by the last read function called. So, the value eof returns depends on the outcome of a different, previous function call. This unexpected behavior is usually only a problem when reading single characters from a file - functions reading more complex data detect and set the eofbit as a part of the read operation.

#include <iostream>
#include <fstream>
using namespace std;

int main()
{
    ifstream in("data.txt");
    char c;

    while (!in.eof())
    {
        in >> c;
        //in.get(c);
        cout << '|' << c << '|' << endl;
    }

	return 0;
}

#include <iostream>
#include <fstream>
using namespace std;

int main()
{
    ifstream in("data.txt");

    while (!in.eof())
    {
        int c = in.get();
        cout << '|' << (char)c << '|' << endl;
    }

    return 0;
}

|a|
|b|
|c|
|d|
|d|

|a|
|b|
|c|
|d|
| |

(a)

(b)

The problem with the eof function. Both programs demonstrate the behavior of the eof function used in conjunction with three different read operations. The test data consists of a file with four characters on one line: abcd. The while-loops in both programs loop five times - one time too many. The output that each program produces is displayed below the programs. The '|' character is included as part of the output to make the space in the output of program (b) "visible."

The program shows two different read functions, in >> c and in.get(c). The program exhibits the same flawed behavior regardless of which read function is used: the eofbit is set by the read function only after the loop begins its fifth and final iteration. During the final iteration of the loop, the read function attempts to read data, fails, and then sets the eofbit - the "processing" code (represented by the output operation) uses the character read by the previous iteration of the loop.
The second version of the program behaves just like the first. The only difference is the read function used in this program, an overloaded version of the get function, returns the character as an int, which must be cast to a char for output. The last read, which takes place during the final iteration, after the eof function is called, returns EOF, which produces an unprintable character when cast.

#include <iostream>
#include <fstream>
using namespace std;

int main()
{
    ifstream in("data.txt");
    int c;
    
    while ((c = in.get()) != EOF) 
        cout << '|' << (char)c << '|' << endl;

    return 0;
}

Fixing the character read problem. Imagine a program reading a file one character at a time. When it reads a character, it doen't "know" that it's the last character until it tries to read the next character and discovers there isn't one. We can solve this probem by rewriting 2(b) above. The version in this figure embeds read operation and the test inside the while-loop control. The three pairs of parentheses are all different. The outer pair are part of the while-lopp syntax. The inner-most pair, get(), form an empty argument list for the get function. And the middle pair, highlighted in red, are grouping parentheses. The code calls the get function, which returns a character, and stores the character in variable c. Next, the code tests the value store in c; while it's not equal to EOF, a symbolic constant for the end-of-file marker (often -1), the loop continues. When get returns EOF, the loop ends without processing it. So, the loop only runs four times, and doesn't produce the extraneous output produced by 2(a) and 2(b).

operator( )

The overloaded bool operator() function is a conversion operator that is used in conjunction with input streams and some of their member functions to detect when the end of the file is reached. It's important to understand that programmers do not explicitly call the conversion function. Like a constructor, the conversion function is called automatically when the context requires a Boolean value where an input stream is used. The following code fragments demonstrate how to use the function.

ifstream input(file_name);	ifstream& get(char& c);
(a)	(b)
do { ... } while (input);	char c; while (input.get(c)) . . .
(c)	(d)

Calling bool operator( ).

The statement defines an input file stream object that illustrates the conversion operator in (c) and (d).
The prototype for one overloaded version of the get function illustrating that the function returns an ifstream reference.
The while test automatically calls the conversion operator, which creates a Boolean-valued expression that drives the loop. The value of the expression is either true (if the file has not reached the end or encountered an error) or false (at the end of the file or on any error condition).
The get function reads one character from the input file, stores it in the variable c, and returns a reference to input, which automatically calls the conversion operator. The conversion operator creates a Boolean value that drives the loop.

I/O Operations

Once we have selected an access technique that matches the problem we are trying to solve, the next step is to determine the best way to read or write the file. Different I/O operations allow the program to read or write different amounts of data with each I/O operation. Like access techniques, these I/O operations must also match the problem that the program solves.

Three ways of reading or writing data are generally supported, with a fourth, very specialized way also provided. The three common techniques of processing files match the I/O operations to the natural data boundaries - that is, they read or write the number of bytes to form a complete data item: one byte for a character, four bytes for an integer, eight bytes for a double, etc. The fourth way accesses data along hardware-oriented boundaries, which dramatically limits further data processing.

Character

Data is read from or written to a file one byte or one character at a time.

Appropriate for both textual and binary data, but some functions (e.g., eof) may not work with binary data
Generally only appropriate for sequential access
Flexible: each character read from a file may be individually processed; processing that produces individual bytes or characters may be written as they are produced; or reading and writing (to and from different files) may be mixed
With caution, >> and << can be used to read and write data respectively

Line

Data is read from or written to a file one line at a time.

Only appropriate for textual data; variables are usually a string or a C-string
Generally only appropriate for sequential access
Flexible: each line read from a file may be individually processed; processing that produces distinct lines may be written as they are produced; or reading and writing (to and from different files) may be mixed
With caution, >> and << can be used to read and write data respectively

Block

Data is read from or written to a file one block at a time.

The size of each block
- is fixed (i.e., the size does not change)
- may correspond to the size of a struct, class, database record, or, in a special case, the physical filebuf
Typically used for binary data
Most often used with direct or keyed/indexed access, but can be used with sequential access in specific situations

Buffer

Each stream object has a streambuf object by aggregation (Figure 2). The rdbuf function gets and sets the buffer, which allows rapidly processing data in limited situations.

Most appropriate for sequential access
Most appropriate for binary data

The example that follows, mycopy.cpp, demonstrates some simple file processing operations.