To process [verb 1, 2.b] something means "to subject [it] to or handle [it] through an established usually routine set of procedures." "Processing files" implies processing the data in them, subjecting it to a set of procedures - algorithmically manipulating it. Programmers must consider two aspects of file processing when writing programs: the order in which the program reads or writes the file and the units or "chunks" the program reads or writes.
I/O Units and Access Order
Data consists of one or more bytes that programs group into logical "chunks," which the text calls "I/O units" to distinguish them from the physical "chunks" or blocks transferred between the program and the computer's hardware. Programs can process files in two orders: sequentially or randomly. Sequential processing accesses a file from beginning to end, with each read or write operation advancing the position pointer by the I/O unit size. The term "random" in "random access" doesn't mean that programs access data by chance but can access the data in any order, often responding to an outside request. The synonym "direct" suggests that a program can go directly to any data element using an index and one of the "seek" functions. Programmers synthesize a third order, keyed or indexed access, by extending random access. The underlying problem determines the I/O unit's size and the file's access order. The access order determines which stream classes programs use, while the I/O unit determines the I/O functions.
I/O Unit
Access Order
Sequential
Random / Direct
Keyed / Indexed
Character
Text & Binary1
Line
Text
Block2
Binary (infrequently)
Binary
Binary
Buffer
Text & Binary
File processing overview.
The table summarizes the processing order and data transfer sizes C++ supports for text and binary files. Subsequent sections provide greater detail and more extensive programming examples.
Sequential
Access begins with the position pointer at zero and processes the data from the beginning of the file to the end, automatically advancing the position pointer by the I/O unit size. Programs perform sequential file access with instances of the ifstream and ofstream classes.
Random / Direct
C++ programs can implement random read-only file access with an ifstream object, but they typically mix reading and writing operations performed with an fstream object.
Keyed / Indexed
Keyed sequential access (KSAM), also known as indexed sequential access (ISAM), extends random access. It requires at least two files: a data file consisting of fixed-length records and a file consisting of corresponding keys and indexes. The Indexed Sequential Access Method section presented later in the chapter illustrates this process.
Character
Data is read from or written to a file one byte or one character at a time. This unit allows the program to examine and process individual characters or bytes. Programs can write the characters that are processed into different files as they are processed.
Line
Data is read from or written to a file one line at a time as determined by the line separator. This unit is flexible, allowing programs to process complete lines or examine the individual characters in the line. Programs can use line-oriented processing to make robust, bullet-proof code by reading a line and verifying that it matches (often with a regular expression) an acceptable pattern.
Block2
The program reads or writes fixed-size blocks of data. The block size corresponds to a struct, class, database record, or, in special cases, the physical filebuf size. Databases frequently use block I/O operations, implemented with fstream objects, that allow them to read a block, update it, and write it back to the same physical location in secondary memory.
Buffer
Buffer I/O is uncommon and only helpful in limited situations. Each stream object has a streambuf by aggregation. The rdbuf function gets and sets the buffer, allowing rapid, if limited, file processing.
The int get() and istream& get(char) functions return non-negative values. Programs algorithmically process the returned value for display as a negative number.
As used here, the term "block" describes a unit of memory whose size corresponds to a logical grouping of related data defined in a program. For example, a program creating Student objects may read and write those objects as Student-sized blocks. In contrast, the previous use of "block" described a "chunk" of memory whose size was a product of the operating system and hardware.
File Processing Control
Imagine using a word processor: it can edit short or long files. It can remove or add text to the file, changing its size. Generalizing this observation suggests that programs reading files should operate independently of the file's size. Programs reading files sequentially often use loops to iterate through their contents, interleaving reading and data processing operations. The operating system manages the computer hardware, including access to secondary memory, making it further responsible for detecting and signaling when a program reaches the end of a file. The stream classes present a consistent interface, (mostly) independent of the hardware and the operating system, providing programs with many ways to detect the signal.
The eof Function
Each stream object has a set of four one-bit flags maintaining the associated file's current state or condition. One of those flags, the end-of-file or eofbit, saves the stream's reading status. The end-of-file function, eof, returns this saved value. However, it's the read operations, not the eof function, that sets the eofbit flag, and they only set it when they attempt and fail to read data. This sequence explains why there is a lag between setting the eofbit and the eof function detecting the stream's end.
Incorrect
Correct
ifstream file(file_name);
while (! file.eof()) // 1
{
// read file // 2
// process data // 3
}
ifstream file(file_name);
// initial file read // 1
while (! file.eof()) // 2
{
// process data // 3
// read file // 4
}
(a)
(b)
File processing with the eof function.
Knowing that the eof function returns the value saved in the eofbit and where the program sets or changes the saved value is essential for correctly using the eof function.
Imagine that file_name exists but is empty. As the program enters the loop at step 1, no read operations have occurred, so the eofbit is unset or logically false. The eof function returns false, but the negation operator "flips" it to true, driving the loop forward. The read operation at step 2 fails to read data and sets the eofbit to 1 or logically true. The processing operations at step 3 attempt to work with non-existent data, making their behavior problematic. Even when file_name contains valid data, the final loop iteration behaves as described. This version fails because the read and processing operations follow the eof test.
This version adds a read operation and reverses the operation order in the loop's body. Again, imagine that the file exists but is empty. The read operation at step 1 fails to read data sets the eofbit to false. The eof function call at step 2 returns true, and the negation operator "flips" it to false, terminating the loop before it begins. If the file contains valid data, the program enters the loop, processing the data at step 3. The loop body ends with a read operation at step 4. If the read succeeds, it leaves the eofbit unchanged, and the test at step 2 continues the loop. If the read fails, it sets the eofbit, causing the test at step 2 to end the loop. This version succeeds because the eof test always follows a read: steps 1 and 2, or steps 4 and 2.
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
ifstream in("data.txt");
char c;
while (!in.eof())
{
in >> c;
// in.get(c); // alternate
cout << '|' << c << '|' << endl;
}
return 0;
}
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
ifstream in("data.txt");
while (!in.eof())
{
int c = in.get();
cout << '|' << (char)c
<< '|' << endl;
}
return 0;
}
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
ifstream in("data.txt");
char c;
do
{
in.get(c);
cout << '|' << c << '|' << endl;
}
while (!in.eof());
return 0;
}
|a|
|b|
|c|
|d|
|d|
|a|
|b|
|c|
|d|
| |
|a|
|b|
|c|
|d|
|d|
(a)
(b)
(c)
The behavior of the eof function: three incorrect examples.
The example programs demonstrate the eof function's problematic behavior. The programs "process" the file data by echoing it to the console, surrounding it with '|' characters to illustrate the processing. The test data consists of a file with a single line of four characters: abcd. The first four iterations of each loop successfully read a character, leaving the eofbit unset (i.e., 0). The programs iterate one time too many, demonstrating the lag between the read operation setting the eofbit and the the program detecting the change with the eof function.
The inserter and the get function exhibit the same behavior.
This version of the overloaded get function returns each character read from the file as an integer, which the program must cast to a char for output. It returns the unprintable EOF constant when it fails to read a character on the fifth iteration.
The while-loop performs its test at the top, suggesting that switching to a test at the bottom loop might solve the problem. Although the do-while-loop moves its test to the bottom, it doesn't solve the problem because it doesn't correct the lag between setting and testing the eofbit.
Caution
The POSIX standard requires a newline at the end of text files, meaning they end with a blank line. The vi text editor and its derivatives (e.g., vim and gvim) transparently add the newline, while nano adds and displays the newline but won't remove it. The blank line in test data created on a POSIX system changes the illustrated output. Subsequent, more robust examples demonstrate how to deal with blank lines.
Input
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
ifstream in("data2.txt");
char line[100];
while (!in.eof())
{
in.getline(line, 100);
cout << '|' << line << '|' << endl;
}
return 0;
}
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main()
{
ifstream in("data2.txt");
string line;
while (!in.eof())
{
getline(in, line);
cout << '|' << line << '|' << endl;
}
return 0;
}
See the quick
red fox jump
over the lazy
brown dog.
Output
|See the quick|
|red fox jump|
|over the lazy|
|brown dog.|
Line input and the eof function.
The getline function reads a full line from the file, discarding the newline character at the end. Unlike the character input operations, it detects the file's end and sets the eofbit after reading the final line, preventing an extraneous loop iteration.
The EOF Constant
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
ifstream in("data.txt");
int c;
while ((c = in.get()) != EOF)
cout << '|' << (char)c << '|' << endl;
return 0;
}
Fixing the character read problem.
Figure 3(b) suggests that an overloaded version of get function returns a character (as an integer) or the EOF symbolic constant. The 3(b) example still fails because the program separates the read operation from the eof test driving the loop. Merging the read operation with the loop test solves the problem. This example nests the read operation within the loop, forming an expression that tests the get function's return value rather than the eofbit. Each pair of parentheses represents a different operator:
The outer pair is part of the while-loop syntax.
The middle pair, highlighted in red, are grouping parentheses.
The inner-most pair, get(), form an empty argument list for the get function.
The grouping parentheses force the get function call and assignment operation to run first, storing the returned character in variable c for processing later, and forcing the read operation to precede the test. The loop runs while the returned character is not equal to EOF, running four times without producing extraneous output.
operator bool
Programs often need to convert one data type to another, and C++ gives programmers two methods for implementing the automatic conversion. They use conversion constructors to convert a type to an instance of a class they "own." Ownership implies that a programmer can edit and incorporate a class into programs. However, the C++ compiler defines the fundamental types, and programmers cannot modify them. So, to convert an instance of their class to a fundamental type, they use a conversion operator. From the perspective of a programmer-owned class, "think of constructors pulling data into an object and operators pushing it out" (Constructor vs. Operator). This example demonstrates a stream-class-define a conversion operator that converts a stream object into a Boolean value.
explicit operator bool() const;
ifstream& get(char& c);
(a)
(b)
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
ifstream in("data.txt");
char c;
while (in.get(c))
cout << '|' << c << '|' << endl;
return 0;
}
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
ifstream in("data.txt");
char c;
do {
in.get(c);
cout << '|' << c << '|' << endl;
} while (in);
return 0;
}
(c)
(d)
Using the stream operator bool.
In addition to the eofbit, stream objects also define two other error flags: failbit and badbit (goodbit is the fourth flag). A failing read or write operation sets the failbit. Specifically, on input, reaching the of a file sets this flag. The badbit indicates the stream is corrupt and likely unable to continue processing. So, the stream's overloaded bool operator detects the end of the file indirectly, returning the bitwise expression: !(failbit | badbit).
The prototype for operator bool as specified in the ios class. The "explicit" keyword prevents further implicit conversions (e.g., to other integral types). The program calls this operator automatically to "convert" an input stream to a Boolean value.
Previous examples - Figure 3 (a) and (c) - using this overloaded version of get ignored the function's return value. However, the prototype clearly shows that it returns an ifstream reference, which is crucial for understanding the example programs.
The while test automatically calls the conversion operator directly on the stream object, creating a Boolean-valued expression driving the loop. The operator returns false if either failbit or badbit is set; otherwise it returns true. This version works because test follows the read.
The get function reads one character from the input file, stores it in the variable c, and returns a reference to input, automatically calling the conversion operator. However, this example suffers the same lag problem demonstrated in Figure 3 because the processing occurs between the input and test, with the input following the test.