14.2. File I/O Prerequisites

General-purpose operating systems (OS) provide a host environment for running programs. A significant component of the host environment is a file system consisting of files and directories stored in persistent secondary memory. Two variations between systems affect how C++ programs use the file system. The first is how the system organizes and names the files. The second is how they separate the lines in a line-oriented text file, which ultimately determines if the OS distinguishes between text and binary files. Understanding these differences is a prerequisite for writing effective programs utilizing file I/O.

File System Organization

From the formal definition of a file presented at the beginning of the chapter, we know that files have names, which programs use to access their contents. Most contemporary operating systems use one of two organizational conventions.

A Windows file system with the root, \, at the top, and sub-directories below. A POSIX file system with the root, /, at the top, and sub-directories below.
(a)(b)
Hierarchical or tree-structured file system. It is convenient to view computer file systems as a hierarchy or tree. Each box in the figure represents a directory or folder. Unlike arborists, computer scientists organize their trees with the root at the top and the leaves at the bottom. The tree's root is denoted with a back-slash, \, on Windows systems and a forward-slash, /, on POSIX-compliant systems. A sub-directory may have any number of files and sub-directories, but every name within a directory must be unique. Standard directories in both organizations have the same or similar names based on their common use.
  1. A fragment of the Windows file system
  2. Part of a file system as it might appear on a Unix, Linux, or macOS system (collectively known as POSIX systems)
While executing, a program has a location or position in the file system tree, known as its current working directory (CWD). When the operating system starts a program, it assigns its current working directory to one of the directories in the computer's file system.
A file system tree. The root has three subdirectories: etc, home, and tmp. home has two subdirectories: dilbert and alice. dilbert has three sub-directories: Documents, Music, and Downloads. The picture illustrates a single path from Music downwards: Neil Diamond, Classics, shilo.mp3.
Absolute or Full
An absolute or full pathname always begins with the root and defines a unique path from the root down to a file or directory.
  • The path from the root (a) to the the mp3 file (d), is a unique, absolute path. The file's name is independent of a program's CWD - regardless of the program's CWD, the absolute pathname always refers to the same file.
  • Absolute pathnames on Windows systems may include a drive letter.
Relative
Relative are relative to the program's current working directory. Relative names may be just the file name or may include some directory names as well.
  • The path From Music (b) to the mp3 file only works if dilbert is the program's CWD.
  • The pathname syntax (illustrated in the following figure) supports a CWD below a fork in the tree - the pathname can ascend the tree one or more levels and descend a different path to a file or directory - e.g., (c) to (d).
  • A file name (d), without preceding directory structure, is also a valid pathname.
Absolute vs. relative pathnames. Windows and POSIX systems allow absolute and relative pathnames for system utilities and application programs. In the illustration, blue boxes represent directories and the orange one is a file. Files are always tree leaves.
\Users\dilbert\Music\Neil Diamond\Classics\Shilo.mp3
/home/dilbert/Music/Neil Diamond/Classics/Shilo.mp3
Music\Neil Diamond\Classics\Shilo.mp3
Music/Neil Diamond/Classics/Shilo.mp3
(a)(b)
..\dilbert\Music\Classics\Neil Diamond\Shilo.mp3
../dilbert/Music/Neil Diamond/Classics/Shilo.mp3
Shilo.mp3
Shilo.mp3
(c)(d)
The path separator and pathname examples. The character separating directory and file names is called the path separator. Unix and similar POSIX-compliant systems use the forward-slash character, /, as the separator, while Windows traditionally used the back-slash, \. However, while Windows still uses the back-slash character to report pathnames, it does accept the forward-slash for input. Each example pair illustrates Windows (top) and POSIX (bottom) pathnames.
  1. Full or absolute pathname. Windows uses the \ character as the root's name and as the file separator. POSIX systems use the /.
  2. A relative pathname that assumes that dilbert is the current working directory.
  3. A relative pathname that assumes that the current working directory is either Documents or Downloads.
  4. A relative pathname that assumes that Classics is the current working directory.
Two special directory names are often used with relative pathnames: Whenever a program open a file, programmers must choose to use either an absolute or a relative pathname. If uses a relative name, the file's relative location is determined by the program's CWD when it opens the file. See Path for an interesting history of file system pathnames and additional detail.

The Line Separator: Text and Binary Files

The formal definition of a file presented in the previous section asserts that "Data files may be numeric, alphabetic, alphanumeric, or binary." Numeric data is just a special case of binary data, and together, alphabetic and alphanumeric data form a more general class called textual data. These generalizations allow us to simplify the one aspect of file I/O, focusing on two file types: textual and binary. We won't deal with textual and binary data combinations, but when a combination of data occurs in practice, the program typically treats it as a binary file. The distinction between binary and textual data centers around how a file system marks the separation between the lines in a line-oriented text file.

POSIXWindows
See the quick\n
red fox jump\n
over the lazy\n
brown dog\n.
See the quick\r\n
red fox jump\r\n
over the lazy\r\n
brown dog.\r\n
(a)(b)
Line separator characters. POSIX systems separate the lines of a line-oriented text file with a single linefeed or newline character. Alternatively, Windows uses a two-character sequence, \r\n: a newline and carriage return. Researchers originally developed the C and C++ programming languages on Unix systems using a single newline line separator. (Classic macOS, before incorporating a Unix kernel, used a single \r as the line separator.)

Programs that process files by searching for the line separator are difficult to port between systems utilizing different line-separator conventions. To alleviate the problem, C and C++ programs running on Windows systems map the \r\n characters to a single \n when reading text files, and perform the reverse mapping, \n to \r\n, when writing them. Unfortunately, the mapping causes another problem: the bytes in a binary file may have any one of 256 values, including values corresponding to the ASCII newline and carriage return characters. If a file contains binary data (e.g., an image or audio file), discarding or inserting a byte corrupts the data. To circumvent this problem, C and C++ compilers allow programmers to specify a file's mode, text to binary, when opening it. POSIX systems don't distinguish between text and binary files and don't perform the mapping.

ifstream in("filename", ios::binary);
ifstream in;
in.open("filename", ios::binary);
(a)(b)
Opening a file in binary mode. Programmers choose how to open file, binary or text mode, with the ios::binary. Files not explicitly opened in binary mode are opened in text mode by default.
  1. Opening the file with a stream constructor
  2. Opening the file after constructing the stream object

It was suggested at the end of the last section that there are three common ways of accessing a file's contents. It's now appropriate to revisit these three ways in the context of text versus binary files:

  1. bytes or characters: appropriate for both text and binary files
  2. lines: only appropriate for text files
  3. records or blocks: both text and binary but most appropriate for binary files