8.2. C-Strings

Time: 00:03:07 | Download: Large, Large (CC), Small | Streaming, Streaming (CC) | Slides (PDF)
Review

C-Strings are character arrays with one additional feature: they mark the end of the string text with a special character called the null termination character. The null termination character or null terminator is the character equivalent of zero, which C++ denotes with the escape sequence '\0' (where the escaped character is the digit zero). In general, a special value that delimits data in a data structure is called a sentinel, so the null termination character is a specific example of a sentinel.

Using a null termination character to mark the end of textual data in a C-string implies that the maximum length of a C-string is one less than the size of the character array forming the C-string. C++ arrays, including those forming C-strings, are zero-indexed, so C-strings always begin at index location 0. The null terminator can appear anywhere in the array, partially filling it if the terminator is not the last array element. The C-string functions ignore all array elements following the null terminator.

The name of an array, without any trailing brackets, is the array's address. So, C++ often represents a C-string as a character pointer that points to an array. String constants or string literals, like "hello world", are also C-strings. When the compiler processes a string literal, it adds the null termination character at the end of the quoted characters, stores them in memory, and generates code based on the string's address. Note that the addresses of string literals and character arrays are constant, so programs cannot change them.

(a)
char s1[] = { 'H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '\0' };
String s1 is represented as a sequence of characters; the last character is the null terminator.

s1 is a C-string is a character array initialized with an aggregate initializer list (which must include the null terminator). The program can change the contents of the array (e.g., s1[0] = 'h';) but not the address represented by the name s1 (e.g., s1 = ...;).

(b)
char s2[] = "Hello world";
char* s3 = s2;
String s2 is represented as a sequence of characters; the last character is the null terminator. s3 is a pointer that points to 'H', the first character in s2.

s2 is a C-string created by a character array initialized with a string literal. The compiler automatically adds the null terminator at the end of the literal and copies it to s2. s3 is a character pointer initialized to point to s2 (s2 is the name of an array without brackets, so it is an address). The program can change the contents of s2, but it cannot change the address represented by s2. The program can also change the contents of the array using s3 (e.g. s3[0] = 'h';), and it can also change s3 (e.g., s3 = s1;).

(c)
const char* s4 = "Hello world";
s4 is a pointer variable pointing to the beginning of the string literal "Hello world" represented by a sequence of characters. The compiler automatically adds the null terminator as the last character.

Modifying a string literal (such as "Hello World") has always been risky. Some modern compilers require making pointers to literals const, which prevents the program from modifying the literal (e.g., s4[0] = 'h'; is a compile-time error). Programs may change the pointer itself: s4 = ...;.

(d)
char* s5 = new char[15] { 'H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' };
s5 is a pointer to the beginning of the array allocated with the new operator. The picture shows that the array is larger than the C-string with empty or uninitialized spaces following the null terminator.

C++ has always allowed programmers to create character arrays with the new operator, but until the C++15 standard, programmers could not use aggregate initializations with new. Notice that '\0' is not included in the initializer list; when the array size is part of the definition, the compiler automatically adds the null termination character.

(e)
char* s6 = new char[15] { "Hello world" };
s6 is a pointer to the beginning of the array allocated with the new operator. The array is larger than the C-string, denoted by the empty or uninitialized spaces following the null terminator.

Similar to (c), but the "const" keyword is not required. In (c), s4 points to a string literal (i.e., a string constant), but s6 points to memory allocated with the new operator and this memory is changeable.

(f)
char s7[15] = { 'E', 'x', 'a', 'm', 'p', 'l', 'e' };
String s7 is represented as a sequence of characters. The null terminator is in the middle of the string, so the last half consists of empty or uninitialized characters.

Compare to (a): when the size of the character array is part of the C-string definition, the compiler automatically adds the null termination character. It's a compile-time error if the array size is < the number of characters in the literal.

(g)
char s8[15] = "Example";
char s8[15] = { "Example" };
String s8 is represented as a sequence of characters. The null terminator is in the middle of the string, so the last half consists of empty or uninitialized characters.

Similar to (b), but creates an array larger than the initializing string literal.

Defining and initializing C-strings. There are many ways of creating and initializing C-strings. Nevertheless, they all result in a similar organization in memory: a null-terminated sequence of characters stored at some address in memory. Programmers must take care to delete C-strings created with new when they are no longer needed to avoid making a memory leak.

(d), (e), (f), and (g) demonstrate that it is possible to have an array that is longer than the stored string. The null terminator marks the end of the data; the array elements following the terminator contain unknown (i.e., "garbage") values that the C-string functions ignore. Each of these C-strings can hold a string 14 characters long plus the null termination character.

Take care when changing the contents of a C-string to not overflow it by adding characters beyond the end of the character array.

The previous examples notwithstanding, programmers can create empty character arrays that will later become C-strings. However, programmers shouldn't generally use the arrays as C-strings until they add the null terminator.

char s9[100];
char* s10 = new char[100];
Illustrates two character arrays as sequences of squares. The arrays are uninitialized, as shown by leaving squares empty. The illustration names the first array s9, while a rectangle, with an arrow pointing to the sequences of squares, represents s10.
Uninitialized character arrays. Programs can create character arrays as an automatic variable on the stack and as a dynamic variable on the heap. Without the null termination character, character arrays cannot act as C-strings. Most C-string functions will fail if given an uninitialized character array as an argument. Two C-string functions, getline and strcpy, can take a "raw" character array without a null terminator and make it into a C-string.

Text processing (i.e., manipulating strings) is a central task of many computer programs. For example, the compiler processing your programs reads a text file and somehow converts the text into a running program! We must do much more with strings beyond just defining and initializing them. The following sections explore how to print, read, and manipulate C-strings.

Caution

C++ arrays and C-strings are simple, fundamental data types. As such, programmers CANNOT use .length with C++ arrays or .length() with C-strings.