8.2.2.5. Searching In C-Strings

Time: 00:09:18 | Download: Large, Large (CC), Small | Streaming, Streaming (CC) | Slides: PDF, PPTX
Review

Searching in a target string to locate a specific character or substring is a task encountered in some text-processing programs. This section describes and demonstrates four C-string functions and two variants that provide various searching options.

A picture of a C-string named target and storing the text 'HELLO WORLD'. Two character pointers, s1 and s2, point to characters in the string. s1 points to the left-most 'L' and s2 points to the right-most 'L'.
Searching a C-String for a character. strchr searches for the character c in the target string, target. The function searches target from left-to-right (low index to high), or in the forward English reading direction. Similarly, strrchr searches for a character in target but from right-to-left (high index to low), or the reverse English reading direction. (Think of the second "r" as standing for "reverse" or "right-to-left," as is appropriate for your written language.) The functions consider the null-termination character part of the target string, allowing them to locate the string's end. Both functions return a character-pointer (i.e., a C-string) to the searched-for character if found, or nullptr if not. The C-string library includes two overloaded versions of both functions, with the constant versions preventing the client from modifying the target through the returned pointer.
Header File:
#define <cstring>
Prototypes:
char* strchr(char* target, int c);
const char* strchr(const char* target, int c);
char* strrchr(char* target, int c);
const char* strrchr(const char* target, int c);
See strchr and strrchr for more information.
const char* target = "HELLO WORLD";
const char* s1 = strchr(target, 'L');
const char* s2 = strrchr(target, 'L');

if (s1 != nullptr)
	cout << s1 << endl;
if (s2 != nullptr)
	cout << s2 << endl;
Output:
LLO WORLD
LD
A picture of a C-string named target and storing the text 'HELLO WORLD'. A character pointer (i.e., a C-string), s3, points to the character W, the beginning of the substring, in target.
Searching a C-String for a substring. strstr searches the target string, target, left-to-right (low index to high) for the substring sub. It returns a character-pointer (i.e., a C-string) to the substring's beginning if found, or nullptr if not. The C-string library provides two overloaded substring search functions. The second version prevents the client program from modifying the target through the returned pointer.
Header File:
#define <cstring>
Prototypes:
char* strstr(char* target, const char* sub);
const char* strstr(const char* target, const char* sub);
See strstr for more information.
const char* target = "HELLO WORLD";
const char* s3 = strstr(target, "WORLD");

if (s3 != nullptr)
	cout << s3 << endl;
Output:
WORLD

Parsing: strtok

Parsing (also known as tokenizing) a string is a common data processing task. Parsing assumes that some groups of characters within a string, called tokens, have meaning in some context and that other characters, called delimiters, separate the groups. Programmers parse or tokenize a string by scanning it, looking for the delimiters, and separating the individual tokens. Programmers can use the C-string library function strtok (for string tokenizer) to help them complete simple parsing tasks.

Header File:
#define <cstring>
Standard Prototype:
char* strtok(char* target, const char* delims);
Please see strtok for more information
Microsoft Prototype:
char* strtok_s(char* target, const char* delims, char** context);
Please see strtok_s for more information
Linux Prototype:
char* strtok_r(char* target, const char* delims, char** context);
Please see strtok_r for more information
  1. strtok(target, delims) 
  2. strtok(nullptr, delims)
(a)(b)
The strtok function. The strtok function parses or tokenizes the target string, dividing it into meaningful substrings called tokens. Delimiters separate adjacent tokens. The delimiter string may contain multiple delimiter characters in any order, but only one is necessary to separate tokens. Unlike most other C-string functions, strtok is destructive: it always alters the target string (illustrated in the following figure). Furthermore, the specific delimiting character is lost.

It's necessary for the strtok function to "remember" the current parsing location between function calls. The typical implementation accomplishes this task with a static variable (see Figure 5), preventing the function from simultaneously processing multiple strings. The Microsoft and Linux versions add a third parameter, context. Think of this double pointer as a pointer to a C-string, making it an INOUT parameter. (See Figure 7 for examples, and Returning Non-Local Data for a discussion of a similar problem.) The target, delims, and returned value are identical to the "typical" function.

  1. strtok scans the target from left-to-right. Whenever it finds a delimiter, it replaces it with a null-termination character and returns a pointer to the extracted token. It ignores multiple adjacent delimiters, and delimiters at the beginning or end of target. It returns nullptr after extracting all the tokens.
  2. Programmers call strtok in two different ways. They begin parsing target with call (i) and continue parsing (extracting the remaining tokens) with call (ii) in a loop. The following code fragments demonstrate the basic logic:
    token = strtok(target, delims);
    while (token != nullptr)
    {
    	// process token
    	token = strtok(nullptr, delims);
    }
    token = strtok(target, delims);
    do
    {
    	// process token
    	token = strtok(nullptr, delims);
    } while (token != nullptr);
    for (token = strtok(target, delims; token != nullpter; token = strtok(nullptr, delims)))
    	// process token
    Programs typically put the second strtok function call in a loop, but they can make a sequence of calls formed as individual statements when the problem requires them. For example, subsequent calls can specify different delimiter sets, which the program can implement with a sequence of calls.

 

An illustration of a C-string parsed by strtok. The original C-string has five words separated by four different delimiters: 'See,the quick;red:fox' (note that the space is a delimiter). The first time the program calls strtok, it replaces the comma with a null-termination character and returns a pointer to 'See'. The second call replaces the space with the null terminator and returns a pointer to 'the'. The third call replaces the semicolon and returns a pointer 'quick'. The next call replaces the colon and returns 'red'. The next-to-last call doesn't replace any characters in the target but returns a pointer to 'fox'. The last call returns nullptr.
Parsing with strtok illustrated. In this example, English words form the tokens, and the punctuation and space characters are the delimiters:
char target[100] = "See,the quick;red:fox";
const char* delims = " ,:;";
The order of the delimiter characters is not significant. When the delimiters include a space, I put it at the beginning of the delimiter string, where I find it easier to see. The example shows how the parsed string, target, changes with each call. strtok returns a pointer, represented by t, to each token.
  1. The original C-string before parsing begins.
  2. The initial function call, char* t = strtok(target, delims);, replaces the comma with a null-termination character and returns a pointer to See.
The remaining calls, (c) - (f), are made as t = strtok(nullptr, delims);.
  1. Replaces the space with a null-terminator and returns a pointer to the.
  2. Replaces the semicolon with a null-terminator and returns a pointer to quick.
  3. Replaces the colon with a null-terminator and returns a pointer to red.
  4. Does not replace any characters in the original string but returns a pointer to fox.
The function has found all the tokens, so it returns nullptr and ends the loop.

 

char* strtok(char* target, const char* delims)
{
	static char* context = nullptr;		// initialized once at program load

	if (target != nullptr)			// begin parsing a new string
		context = target;

	// parses string and sets context
	return token;
}
Implementing the strtok function. Programs typically call strtok repeatedly until they have extracted all the tokens in a string. So, strtok must "remember" where to resume parsing between calls. It uses a local static pointer variable to save the current parsing position between function calls. The code fragment illustrates one possible way to implement this behavior.

Standard Examples

#include <iostream>
#include <cstring>
using namespace std;

int main()
{
	char target[100] = "See,the quick;red:fox";
	const char* delims = " ,:;";

	for (char* token = strtok(target, delims); token; token = strtok(nullptr, delims))
		cout << token << endl;

	//char* token = strtok(target, delims);		// for while and do-while versions

	/*while (token)
	{
		cout << token << endl;
		token = strtok(nullptr, delims);
	}*/

	/*do
	{
		cout << token << endl;
		token = strtok(nullptr, delims);
	} while (token);*/

	return 0;
}
Standard strtok function example. These examples extend the Figure 3 examples, replacing "process token" with an output statement and eliminating the explicit test for nullptr. (C++ treats numeric values, including pointers, as alternate representations of Boolean values.) The program can change the delimiters between function calls, but they are held constant during each call. However, the target string is variable, allowing strtok to replace the delimiters with the null-termination character.

Microsoft And Linux Example

The standard strtok function relies on a local static variable (a variable defined in the strtok function) to save the current parsing location within the target. So, if the program begins parsing a second string before finishing the first, the function loses the parse position in the first string. That is, programmers can't begin parsing string-A, switch to parsing string-B, then return to parsing string-A. The Microsoft and Linux versions eliminate this restriction by replacing the local static variable with the context parameter, whose corresponding argument is defined in the client's scope.

#include <iostream>
#include <cstring>
using namespace std;

int main()
{
    const char* delims = " :";
    char target1[100] = "See the quick red fox";
    char target2[100] = "jump:over:the:lazy:brown:dog";
    char* context1 = nullptr;
    char* context2 = nullptr;

    char* token1 = strtok_s(target1, delims, &context1);
    char* token2 = strtok_s(target2, delims, &context2);

    while (token1 != nullptr || token2 != nullptr)
    {
        if (token1 != nullptr)
            cout << token1 << endl;
        token1 = strtok_s(nullptr, delims, &context1);

        if (token2 != nullptr)
            cout << token2 << endl;
        token2 = strtok_s(nullptr, delims, &context2);
    }

    return 0;
}
#include <iostream>
#include <cstring>
using namespace std;

int main()
{
    const char* delims = " :";
    char target1[100] = "See the quick red fox";
    char target2[100] = "jump:over:the:lazy:brown:dog";
    char* context1 = nullptr;
    char* context2 = nullptr;

    char* token1 = strtok_r(target1, delims, &context1);
    char* token2 = strtok_r(target2, delims, &context2);

    while (token1 != nullptr || token2 != nullptr)
    {
        if (token1 != nullptr)
            cout << token1 << endl;
        token1 = strtok_r(nullptr, delims, &context1);

        if (token2 != nullptr)
            cout << token2 << endl;
        token2 = strtok_r(nullptr, delims, &context2);
    }

    return 0;
}
MicrosoftLinux
strtok_s and strtok_r examples. The "_s" at the end of the Microsoft strtok_s function suggests that the function is "secure." The "_r" at the end of the Linux strtok_r describes the function as "reentrant." Both functions replace strtok's local static variable with a third parameter. Figure 3 names the parameter context, but programmers are free to choose any appropriate name. context performs the same task as the local static variable by saving the current parsing location. However, defining context in the application program's scope allows the application to parse multiple strings simultaneously without conflict.

Aside from the slightly different function names, both programs are identical. context is a C-string - a character-pointer: char*. However, strtok_s and strtok_r modify it, so the program must pass it with an INOUT mechanism, and they use pass by pointer. So, the strtok_s and strtok_r prototypes presented in Figure 3 show the type as a pointer-to-a-pointer: char**, and the program finds the variable's address with the address-of operator (shown in red).