14.2. File I/O Prerequisites

Time: 00:04:56 | Download: Large, Large (CC), Small | Streaming Streaming (CC) | Slides: PDF, PPTX

General-purpose operating systems (OS) provide a host environment for running programs. A significant component of the host environment is a file system consisting of files and directories stored in persistent secondary memory. Two variations between systems affect how C++ programs use the file system. The first is how the system organizes and names the files. The second is how they separate the lines in a line-oriented text file, which ultimately determines if the OS distinguishes between text and binary files. Understanding these differences is a prerequisite for writing effective programs utilizing file I/O.

File System Organization

From the formal definition of a file presented at the beginning of the chapter, we know that files have names, which programs use to access their contents. Most contemporary operating systems use one of two organizational conventions.

A Windows file system with the root, \, at the top, and sub-directories below. — **Hierarchical or tree-structured file system**. It is convenient to view computer file systems as a hierarchy or tree. Each box in the figure represents a directory or folder. Unlike arborists, computer scientists organize their trees with the root at the top and the leaves at the bottom. The tree's root is denoted with a back-slash, `\`, on Windows systems and a forward-slash, `/`, on POSIX-compliant systems. A sub-directory may have any number of files and sub-directories, but every name within a directory must be unique. Standard directories in both organizations have the same or similar names based on their common use.

A fragment of the Windows file system

Part of a file system as it might appear on a Unix, Linux, or macOS system (collectively known as POSIX systems)

While executing, a program has a location or position in the file system tree, known as its *current working directory* (CWD). When the operating system starts a program, it assigns its current working directory to one of the directories in the computer's file system.

A POSIX file system with the root, /, at the top, and sub-directories below. — **Hierarchical or tree-structured file system**. It is convenient to view computer file systems as a hierarchy or tree. Each box in the figure represents a directory or folder. Unlike arborists, computer scientists organize their trees with the root at the top and the leaves at the bottom. The tree's root is denoted with a back-slash, `\`, on Windows systems and a forward-slash, `/`, on POSIX-compliant systems. A sub-directory may have any number of files and sub-directories, but every name within a directory must be unique. Standard directories in both organizations have the same or similar names based on their common use.

A fragment of the Windows file system

Part of a file system as it might appear on a Unix, Linux, or macOS system (collectively known as POSIX systems)

While executing, a program has a location or position in the file system tree, known as its *current working directory* (CWD). When the operating system starts a program, it assigns its current working directory to one of the directories in the computer's file system.

A file system tree. The root has three subdirectories: etc, home, and tmp. home has two subdirectories: dilbert and alice. dilbert has three sub-directories: Documents, Music, and Downloads. The picture illustrates a single path from Music downwards: Neil Diamond, Classics, shilo.mp3. — **Absolute vs. relative pathnames**. Windows and POSIX systems allow absolute and relative pathnames for system utilities and application programs. In the illustration, blue boxes represent directories and the orange one is a file. Files are always tree leaves.

\Users\dilbert\Music\Neil Diamond\Classics\Shilo.mp3 /home/dilbert/Music/Neil Diamond/Classics/Shilo.mp3	Music\Neil Diamond\Classics\Shilo.mp3 Music/Neil Diamond/Classics/Shilo.mp3
(a)	(b)
..\dilbert\Music\Classics\Neil Diamond\Shilo.mp3 ../dilbert/Music/Neil Diamond/Classics/Shilo.mp3	Shilo.mp3 Shilo.mp3
(c)	(d)

The path separator and pathname examples. The character separating directory and file names is called the path separator. Unix and similar POSIX-compliant systems use the forward-slash character, /, as the separator, while Windows generally uses the back-slash, \. However, while Windows still uses the back-slash character to report pathnames, it does accept the forward-slash for input, making it easier to move programs between different systems. Each example pair illustrates Windows (top) and POSIX (bottom) pathnames.

Full or absolute pathname. POSIX systems use the / character as the root's name and as the file separator. Windows uses the \. Recall that \ is an escape character that programs must escape in strings.
A relative pathname that assumes that dilbert is the current working directory.
A relative pathname that assumes that the current working directory is either Documents or Downloads.
A relative pathname that assumes that Classics is the current working directory.

Two special directory names are often used with relative pathnames:

.. represents the parent of or one level up from the current working directory
. represents the current working directory

Whenever a program open a file, programmers must choose to use either an absolute or a relative pathname. If uses a relative name, the file's relative location is determined by the program's CWD when it opens the file. See Path for an interesting history of file system pathnames and additional detail.

The Line Terminator: Text and Binary Files

The formal definition of a file presented in the previous section asserts that "Data files may be numeric, alphabetic, alphanumeric, or binary." Numeric data is just a special case of binary data, and together, alphabetic and alphanumeric data form a more general class called textual data. These generalizations allow us to simplify the one aspect of file I/O, focusing on two file types: textual and binary. We won't deal with textual and binary data combinations, but when a combination of data occurs in practice, the program typically treats it as a binary file. The distinction between binary and textual data centers around how a file system marks the separation between the lines in a line-oriented text file.

POSIX	Windows
See the quick\n red fox jump\n over the lazy\n brown dog.\n	See the quick\r\n red fox jump\r\n over the lazy\r\n brown dog.\r\n
(a)	(b)

Line terminator characters. POSIX systems terminate the lines of a line-oriented text file with a single linefeed or newline character. Alternatively, Windows uses a two-character sequence, \r\n: a newline and carriage return. Researchers originally developed the C and C++ programming languages on Unix systems using a single newline line terminator. (Classic macOS, before incorporating a Unix kernel, used a single \r as the line terminator.)

Programs that process files by searching for the line terminator are difficult to port between systems utilizing different line-terminator conventions. To alleviate the problem, C and C++ programs running on Windows systems map the \r\n characters to a single \n when reading text files, and perform the reverse mapping, \n to \r\n, when writing them. Unfortunately, the mapping causes another problem: the bytes in a binary file may have any one of 256 values, including values corresponding to the ASCII newline and carriage return characters. If a file contains binary data (e.g., an image or audio file), discarding or inserting a byte corrupts the data. To circumvent this problem, C and C++ compilers allow programmers to specify a file's mode, text to binary, when opening it. POSIX systems don't distinguish between text and binary files and don't perform the mapping.

ifstream in("filename", ios::binary);	ifstream in; in.open("filename", ios::binary);
(a)	(b)

Opening a file in binary mode. Programmers choose how to open file, binary or text mode, with the ios::binary. Files not explicitly opened in binary mode are opened in text mode by default.

Opening the file with a stream constructor
Opening the file after constructing the stream object

It was suggested at the end of the last section that there are three common ways of accessing a file's contents. It's now appropriate to revisit these three ways in the context of text versus binary files:

bytes or characters: appropriate for both text and binary files
lines: only appropriate for text files
records or blocks: both text and binary but most appropriate for binary files

$A Windows file system with the root, \, at the top, and sub-directories below.$
(a)	(b)