14.2. File I/O Prerequisites

Although the three concepts presented in this section are independent of one another, they each play a role in how a file opened and accessed with a stream object behaves. In that sense, these prerequisite concepts lay a foundation upon which the remainder of the chapter depends.

File System Organization

From the formal definition of a file presented at the beginning of the chapter, we also know that every file has a name and that we can access the file's contents by its name. But there are many ways to express the name of a file. Modern general-purpose operating systems organize files and directories (also known as folders) into a hierarchy or tree. Unlike arborists, computer scientists organize their trees with the root at the top and the leaves at the bottom. Any directory in the tree may contain many subdirectories or files, but every name in a directory must be unique.

A Windows file system the root, \, at the top, and sub-directories below. A POSIX file system the root, /, at the top, and sub-directories below.
(a)(b)
Hierarchical or tree-structured file system. It is convenient to view the files on a computer as a hierarchy or tree with the root at the top. Each box in the figure represents a directory or folder. The top directory is called the root and denoted with a back-slash, \, on Windows computers and a forward-slash, /, on POSIX-compliant systems. A sub-directory may have any number of files and sub-directories.
  1. A fragment of the Windows file system
  2. Part of a file system as it might appear on a Unix, Linux, or macOS system (collectively known as POSIX systems)

A running program has a location or position in the file system tree, which is known as its current working directory (cwd). When the operating system runs a program, it sets its current working directory as one of the directories in the computer's file system. Most operating systems have two ways of naming or referring to a file in a program:

Absolute or Full
The absolute or full name of a file defines a path from the root of the file system down to the specific file. This name is independent of the current working directory and is always the same. The file name begins with the root directory name and each directory and file name is separated with the file name separator symbol, which is the same character used to name the root directory.
Relative
Relative names are so named because they are relative to the program's current working directory. Relative names may be just the file name or may include some directory names as well.
\Users\dab\My Music\Shilo.mp3
/home/dab/Music/Shilo.mp3
My Music\Shilo.mp3
Music/Shilo.mp3
..\dab\Shilo.mp3
../dab/Shilo.mp3
Shilo.mp3
Shilo.mp3
(a)(b)(c)(d)
File path name examples. Each of the following examples assumes that there is a file named "Shilo.mp3" located in the music directory of each file system illustrated in Figure 1. The examples in each category apply to (top) a Windows system and a (bottom) Unix/Linux/macOS system.
  1. Full or absolute path name. Windows uses the back-slash \ character as the name of root and as the file separator character. The other operating systems use the slash or forward-slash / character as root and the path separator
  2. A relative path name that assumes that "dab" is the current working directory.
  3. A relative path name that assumes that the current working directory is either "dab" or "klb."
  4. A relative path name that assumes that "My Music" (Windows) or "Music" (POSIX) is the current working directory.
Two special directory names are often used with relative path names:

Whenever we open a file, we must choose to use either a full or a relative pathname. If we choose to use a relative name, then the file's location is determined by the running program's current working directory when the file is accessed. See Path for an interesting history of file system pathnames and additional detail.

Flags and Bitwise Operators

The C++ input/output system uses bit-vectors, unsigned 32-bit integers, to control various aspects of how stream objects behave - how output is formatted or how input is interpreted. Each bit in a bit-vector is called a flag and represents a specific behavior or formatting feature. A flag set to 1 indicates that a feature is switched on or is active, while a flag set to 0 indicates that the feature is switched off or is inactive. The I/O system also defines a pseudo data type, called fmtflags, to represent bit-vectors and bit-masks. Bit-masks are constant values that represent bit patterns, or, in this context, formatting flags. The individual bits or flags in a fmtflags variable may be set (set to 1) and unset (set to 0) with I/O system functions or directly with bitwise operations on the fmtflags variables.

Bitwise-OR
a b a | b
0 0 0
0 1 1
1 0 1
1 1 1
  1100
| 1001
------
  1101
An image depicting bit-masks as a grate through which each bit must pass. Slots in the grate are formed by 0's or 1'. For bitwise OR, 0's represent open slots in the grate that allow the bits to pass through unmodified, while 1's always output a 1 regardless of the input value.
(a)(b)
Bitwise-AND
a b a & b
0 0 0
0 1 0
1 0 0
1 1 1
  1100
& 1001
------
  1000
An image depicting bit-masks as a grate through which each bit must pass. Slots in the grate are formed by 0's or 1'. For bitwise AND, 1's represent open slots in the grate that allow the bits to pass through unmodified, while 0's switch bits off, always outputing a 0 regardless of the input.
(c)(d)
Bitwise operations are used to switch the bits or flags in a bit-vector on and off.
  1. Both operands must be 0 to produce a 0; any other combination produces a 1.
  2. The bitwise-OR operator, |, is used to switch on some bits in a bit-vector, data in the diagram, activating I/O behaviors and formatting options.
  3. Both operands must be a 1 to produce a 1; any other combination produces a 0.
  4. The bitwise-AND operator, &, is use to mask out some bits in a bit-vector, data in the diagram, to identify which bits are set.

The I/O system provides several named bit-masks to make it easier to work with the I/O flags. The constants are typically accessed through the ios class (even though they are defined in basic_ios ). The scope resolution operator, ::, ties the class name ios on the left to the name of a bit-mask or symbolic constant on the right. Four file I/O bit-masks illustrate how bit-masks are used:

 Bit-MaskPurpose
(a) ios::in = 0x01 = 00000000000000000000000000000001 Open the file for input / reading
ios::out = 0x02 = 00000000000000000000000000000010 Open the file for output / writing
ios::app = 0x08 = 00000000000000000000000000001000 Append new data at the end of the file
ios::binary = 0x20 = 00000000000000000000000000100000 Threat the file contents as binary data
(b) fmtflags modes = ios::in | ios::out | ios::app | ios::binary; Combine all of the behaviors
(c) = 0x2B = 00000000000000000000000000101011  
Creating bit-vectors / formatting flags.
  1. The value, in hexadecimal and binary, of four I/O bit-masks; binary files are described below
  2. Combining the individual bit-masks with the bitwise-OR to create a bit-vector with multiple bits set to 1
  3. The value of the bit-vector; each 1-bit represents a stream behavior that activated or enabled
fstream file(file_name, modes);
if (mode & ios::binary)
{
	// process the file as binary
}

if (mode & ios::app)
{
	// append new data at the end of the file
}
(a)(b)
Using bit-vectors / formating flags.
  1. The modes format flags (from the figure above) may be used to create a file stream object and open a file with the four behaviors described above. The fstream class and opening files are described in the next section.
  2. The library code that opens and processes files may contain code similar to the if-statements illustrated here. Each if-statement tests to see if a behavior is active and then conditionally executes the appropriate code if it is. The bitwise AND operation switches off all of the bits except the one corresponding to the binary bit-mask. The magnitude of the test expression is not important; all that is important is that the expression is not 0. If ios::binary had been left out of modes, then the expression mode & binary would produce a 0 and the code for performing binary I/O would be skipped.

Bit-masks are used during the construction of stream objects, with the open function, with the setiosflags and resetiosflags manipulators, and with the setf and unsetf functions. A list of bit-mask constants may be found in the I/O Summary at the end of the chapter. The complete set of bitwise operators was presented in the Chapter 3 Supplemental section.

Text and Binary Files

The formal definition also explains that "Data files may be numeric, alphabetic, alphanumeric, or binary." Numeric data is just a special case of binary data, and together, alphabetic and alphanumeric data form a more general class called textual data. So, for the remainder of this chapter, we focus our attention on two broad data classifications: textual and binary. We won't deal with textual and binary data combinations, but when a combination of data occurs in practice, the program typically treats it as a binary file.

The modern use of Unicode and wide (2-byte) characters to represent textual data has somewhat blurred the distinction between text and binary files. Nevertheless, the distinction remains important, and to help us understand the difference between text and binary files, we'll restrict our discussion to the older single-byte representation of textual data based on the ASCII encoding scheme.

The ASCII encoding scheme represents characters as the lower 7-bits of an 8-bit byte or character; the highest bit is always 0. This encoding can represent 128 characters with numeric values that range from 0 to 127. The lowest 32 characters or values (0 to 31) are control characters. Two control characters play an essential role in text files:

These characters are used in text files to mark the end of one line of text and the beginning of the next line (the characters are called line separators or line terminators). The C and C++ programming languages originated on Unix systems. Unix uses a single line feed character, which it calls a newline, to separate the lines in a text file. The other POSIX systems (Linux and macOS) also adopt this convention. Alternatively, Windows uses a two-character sequence, \r\n, to separate the lines in a text file. (Classic macOS, before incorporating a Unix kernel, used a single \r as the line separator.)

It's difficult to port (move from one system to another) a program when the systems utilize different characters for the line separator. The C Programming language added the distinction between text and binary files to help solve the problem created by having different line separator characters. When a program opens a file in text mode on a Windows system, the \r\n sequenced is mapped or converted into a single \n character when read from a file, and the single \n character is converted into a \r\n sequence when written to a file. POSIX systems don't need any mapping, and text mode does not affect file I/O.

Having the system automatically convert between \r\n and \n greatly simplifies the task of porting programs (at the source code level) from one operating system to another. But it also introduces another problem. Some files contain numeric rather than textual data. You are undoubtedly familiar with some of these numeric files: JPG, GIF, MP3, EXE, etc. Remember that text files consist of characters that are encoded as short, 1-byte integers, specifically \n and \r are encoded as the values 10 and 13 respectively. Altering these values in a truly binary file will corrupt the data, resulting in an unviewable image or unplayable audio or video file. We must open binary files in binary mode to prevent the character mapping that will lead to data corruption.

ifstream in("filename", ios::binary);
ifstream in;
in.open("filename", ios::binary);
(a)(b)
Opening a file in binary mode. Note that files that are not explicitly opened in binary mode are opened in text mode by default.
  1. Opening the file with a stream constructor
  2. Opening the file after constructing the stream object

It was suggested at the end of the last section that there are three common ways of accessing a file's contents. It's now appropriate to revisit these three ways in the context of text versus binary files:

  1. bytes or characters: appropriate for both text and binary files
  2. lines: only appropriate for text files
  3. records or blocks: both text and binary but most appropriate for binary files