8.2.2.5. Searching In C-Strings

Time: 00:09:18 | Download: Large, Large (CC), Small | Streaming, Streaming (CC) | Slides (PDF)
Review

This section describes and demonstrates five C-string functions:

Programmers use these functions less often than those detailed previously. Consequently, you needn't memorize them; many instructors may not assign this section. Nevertheless, I've found them useful over the years and include them for completeness.
A picture of a C-string named target and storing the text 'HELLO WORLD'. Two character pointers, s1 and s2, point to characters in the string. s1 points to the left-most 'L' and s2 points to the right-most 'L'.
Searching for a character in a C-String. strchr searches for the character c in the target string, target. The function searches target from left-to-right - low index to high or the forward direction in English. Similarly, strrchr searches for a character in target but from right-to-left - high index to low or the reverse direction in English. (Think of the second "r" as standing for "reverse" or "right-to-left," as is most convenient for you and your written language.) The functions consider the null-termination character part of the target string, so programmers can use them to locate the string's end. Both functions return a character-pointer (i.e., a C-string) to the searched-for character if found or nullptr if not. The C-string library includes two overloaded versions of both functions. The constant versions prevent the client from modifying the target through the returned pointer.
Header File:
#define <cstring>
Prototypes:
char* strchr(char* target, int c);
const char* strchr(const char* target, int c);
char* strrchr(char* target, int c);
const char* strrchr(const char* target, int c);
See strchr and strrchr for more information.
const char* target = "HELLO WORLD";
const char* s1 = strchr(target, 'L');
const char* s2 = strrchr(target, 'L');

if (s1 != nullptr)
	cout << s1 << endl;
if (s2 != nullptr)
	cout << s2 << endl;
Output:
LLO WORLD
LD
A picture of a C-string named target and storing the text 'HELLO WORLD'. A character pointer (i.e., a C-string), s3, points to the character W, the beginning of the substring, in target.
Searching for a sub-string in a C-Strings. strstr searches the target string, target, left-to-right for the substring sub. It returns a character-pointer (i.e., a C-string) to the sub-string's beginning if found or nullptr if not. The C-string library has two overloaded sub-string searching functions. The second version prevents the client program from modifying the target through the returned pointer.
Header File:
#define <cstring>
Prototypes:
char* strstr(char* target, const char* sub);
const char* strstr(const char* target, const char* sub);
See strstr for more information.
const char* target = "HELLO WORLD";
const char* s3 = strstr(target, "WORLD");

if (s3 != nullptr)
	cout << s3 << endl;
Output:
WORLD

Parsing: strtok

Parsing (also known as tokenizing) a string is a common data processing task. Parsing assumes that some groups of characters within a string, called tokens, have meaning in some context and that other characters, called delimiters, separate the groups. Programmers parse or tokenize a string by scanning it, looking for the delimiters, and separating the individual tokens. Programmers can use the C-string library function strtok (for string tokenizer) to help them complete simple parsing tasks.

Header File:
#define <cstring>
Standard Prototype:
char* strtok(char* target, const char* delims);
Please see strtok for more information
Microsoft Prototype:
char* strtok_s(char* target, const char* delims, char** context);
Please see strtok_s for more information
Linux Prototype:
char* strtok_r(char* target, const char* delims, char** context);
Please see strtok_r for more information
  1. strtok(target, delims) 
  2. strtok(nullptr, delims)
(a)(b)
The strtok function. The target and delims parameters of the standard and Microsoft strtok function are the same: the parsing target and a set of one or more delimiters. While the delimiter-string may contain many delimiter characters, it only takes one to separate two tokens. Unlike most other C-string functions, strtok is destructive: it always alters the target string (illustrated in the next figure). Furthermore, the specific delimiting character is lost.
  1. strtok scans target from left-to-right. Whenever strtok finds a delimiter, it replaces it with a null-termination character and returns the extracted token. The function returns nullptr after extracting all the tokens.
  2. Programmers call strtok in two different ways. They begin parsing target with call i and continue parsing (extracting the remaining tokens) call ii in a loop. We can see the basic logic with pseudocode fragments:
    token = strtok(target, delims);
    while (token != nullptr)
    {
    	// process token
    	token = strtok(nullptr, delims);
    }
    token = strtok(target, delims);
    do
    {
    	// process token
    	token = strtok(nullptr, delims);
    } while (token != nullptr);
    for (token = strtok(target, delims; token != nullpter; token = strtok(nullptr, delims)))
    	// process token
    Programmers typically put the second strtok call in a loop, but they can write individual strtok calls when the problem calls for them. Furthermore, each strtok call can have a different delimiter set. Programmers can change delims between calls if the call is in a loop.
An illustration of a C-string parsed by strtok. The original C-string has five words separated by four different delimiters: 'See,the quick;red:fox' (note that the space is a delimiter). The first time the program calls strok, it replaces the comma with a null-termination character and returns a pointer to 'See'. The second call replaces the space with the null terminator and returns a pointer to 'the'. The third call replaces the semicolon and returns a pointer 'quick'. The next call replaces the colon and returns 'red'. The next to last call doesn't replace any characters in the target but returns a pointer to 'fox'. The last call returns nullptr.
Parsing illustrated. In this example, English words form the tokens, and the punctuation and space between the words are the delimiters. The example shows how the parsed string, target, changes during each call. strtok returns a pointer, represented by t, to each token. For this example:
char target[100] = "See,the quick;red:fox";
const char* delims = " ,:;";
The order of the delimiters is not significant. When the delimiters include a space, I like to put it at the beginning of the delimiter string because I think it is a little easier to see there.
  1. The original C-string before parsing begins.
  2. The initial function call, char* t = strtok(target, delims);, replaces the comma with a null-termination character and returns a pointer to See.
  3. The remaining calls, (c) - (f), are t = strtok(nullptr, delims);.
  4. Replaces the space with a null-terminator and returns a pointer to the.
  5. Replaces the semicolon with a null-terminator and returns a pointer to quick.
  6. Replaces the colon with a null-terminator and returns a pointer to red.
  7. Does not replace any characters in the original string but returns a pointer to fox.
  8. All tokens have been found, so strtok returns nullptr, ending the loop.
char* strtok(char* target, const char* delims)
{
	static char* context = nullptr;

	if (target != nullptr)
		context = target;

	. . .
}
Implementing strtok. Programs typically call strtok repeatedly until they have extracted all the tokens in a string. So, strtok must "remember" where to resume parsing between calls. It uses a local static pointer variable to save the current parsing position between function calls. The code fragment illustrates one possible way to implement this behavior.

Standard Examples

#include <iostream>
#include <cstring>
using namespace std;

int main()
{
	char target[100] = "See,the quick;red:fox";
	const char* delims = " ,:;";

	for (char* token = strtok(target, delims); token; token = strtok(nullptr, delims))
		cout << token << endl;

	//char* token = strtok(target, delims);		// for while and do-while versions

	/*while (token)
	{
		cout << token << endl;
		token = strtok(nullptr, delims);
	}*/

	/*do
	{
		cout << token << endl;
		token = strtok(nullptr, delims);
	} while (token);*/

	return 0;
}
Standard strtok function example. The three versions demonstrated here parallel the examples in Figure 3 but discard the explicit test for nullptr. Recall that C++ treats treats numeric values, including pointers, as alternate representations of Boolean values. So, the expression nullptr is equivalent to token != nullptr. Furthermore, we can't make target a const variable because that wouldn't allow strtok to replace the delimiters with the null-termination character.

Microsoft And Linux Example

The standard strtok function relies on a local static variable (a variable defined in the strtok function) to save the current parsing location within the target. So, if the program begins parsing a second string before finishing the first, the function loses the parse position in the first string. That is, programmers can't begin parsing string-A, switch to string-B, then return to string-A. The Microsoft and Linux versions eliminate this restriction by replacing the local static variable with the context argument.

#include <iostream>
#include <cstring>
using namespace std;

int main()
{
    const char* delims = " :";
    char target1[100] = "See the quick red fox";
    char target2[100] = "jump:over:the:lazy:brown:dog";
    char* context1 = nullptr;
    char* context2 = nullptr;

    char* token1 = strtok_s(target1, delims, &context1);
    char* token2 = strtok_s(target2, delims, &context2);

    while (token1 != nullptr || token2 != nullptr)
    {
        if (token1 != nullptr)
            cout << token1 << endl;
        token1 = strtok_s(nullptr, delims, &context1);

        if (token2 != nullptr)
            cout << token2 << endl;
        token2 = strtok_s(nullptr, delims, &context2);
    }

    return 0;
}
#include <iostream>
#include <cstring>
using namespace std;

int main()
{
    const char* delims = " :";
    char target1[100] = "See the quick red fox";
    char target2[100] = "jump:over:the:lazy:brown:dog";
    char* context1 = nullptr;
    char* context2 = nullptr;

    char* token1 = strtok_r(target1, delims, &context1);
    char* token2 = strtok_r(target2, delims, &context2);

    while (token1 != nullptr || token2 != nullptr)
    {
        if (token1 != nullptr)
            cout << token1 << endl;
        token1 = strtok_r(nullptr, delims, &context1);

        if (token2 != nullptr)
            cout << token2 << endl;
        token2 = strtok_r(nullptr, delims, &context2);
    }

    return 0;
}
MicrosoftLinux
strtok_s and strtok_r examples. The "_s" at the end of the Microsoft strtok_s function suggests that the function is "secure." The "_r" at the end of the Linux strtok_r describes the function as "reentrant." Both functions replace strtok's local variable with a third parameter. Figure 3 names the parameter context, but programmers are free to choose any appropriate name. context performs the same task as the local static variable by saving the current parsing location. However, defining context in the application program's scope allows the application to parse multiple strings simultaneously without conflict.

Aside from the slightly different function names, both programs are identical. context is a C-string - a character-pointer: char*. However, strtok_s and strtok_r modify it, so the program must pass it with an INOUT mechanism and use pass by pointer. So, the strtok_s and strtok_r prototypes presented in Figure 3 show the type as a pointer-to-a-pointer: char**, and the program finds the variable's address with the address-of operator (shown in red).