8.7.3. Software Development: The Anagram Problem

Time: 00:09:24 | Download: Large, Large (CC), Small | Streaming, Streaming (CC) | Slides (PDF)

Review

For-Range Loops
Character Classification and Conversion Functions
Portable Data Types - types names ending with _t
Zeroing array elements
ASCII Encoding

Our next problem is an example of the more extensive software development process. We begin by stating a problem, solving it generally (independent of a specific programming language), successively refining the solution, and implementing several versions of the general solution as C++ functions and a complete program. The development process allows us to review functions, arrays, strings, and ASCII-encoded characters.

The Anagram Problem

"An anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once." When determining if one string is an anagram or another, we ignore spaces, punctuation characters, and character cases (upper or lower). For example, the letters in the phrase, "See the quick red fox jump over the lazy brown dog," can be rearranged to form the rather bland anagram "abcddeeeeeefghhijklmnoooopqrrrsttuuvwxyz." Cleaver anagrams are more interesting and more challenging to create and validate. The second phrase of a cleaver anagram forms a valid word or statement that is often a humorous comment on the first phrase. For example, an anagram for "Dormitory" is "Dirty Room." Some anagram aficionados have way too much time on their hands, as is illustrated by the following clever anagram:

Phrase: To be or not to be: that is the question, whether its nobler in the mind to suffer the slings and arrows of outrageous fortune.
Anagram: In one of the Bard's best-thought-of tragedies, our insistent hero, Hamlet, queries on two fronts about how life turns rotten.

Our anagram problem is to design and implement a program that compares two strings and reports if the second is an anagram of the first.

Solving The Anagram Problem

Designing and implementing a program requires more detail than the initial problem statement provides. In a "real world" situation, we would verify the refined and expanded problem statement with the client before designing or implementing the program. We begin by refining and decomposing the problem into four steps or sub-problems:

Prompt the user to enter and read two strings from the console (a familiar operation)
Normalize each string to an easily-compared standard form:
- Remove all space and punctuation characters
- Convert all letters to a single case (either upper or lower)
To qualify as an anagram, both strings must contain the same number of each alphabetic letter. So, the program counts the number of occurrences of each letter (the number of a's, the number of b's, etc. in each string
Compare all the counts; if all counts are equal, the second string is an anagram of the first; otherwise, it is not an anagram

Pseudocode

"Pseudocode is an informal high-level description of the operating principle of a computer program or other algorithm." Pseudocode can be anything from words written in a natural language to working code in a programming language. It is often a mixture of both. Pseudocode lets us focus on what we want to do without worrying too much about how we will do it. There isn't a single way of writing pseudocode, but it should be fairly intuitive and generally easy to understand. We'll use pseudocode to express the algorithms needed to solve the anagram problem. This approach allows us to refine our algorithms before spending the time to program them.

Normalizing The Input

It will be easier to compare our two candidate strings if we normalize them first. To normalize the strings means "to make [them] conform to or reduce [them] to a norm or standard" form (Merriam-Webster). As described previously, a convenient normalized or standard form eliminates spaces and punctuation and converts the remaining letters to the same case - we'll arbitrarily choose lowercase. As we need to normalize both input phrases, we should implement this step as a function to avoid duplicating code. We can make the solution a little easier to convert into a program by restating it in a more structured form with pseudocode.

define the variable phrase and initialize it to empty

for each character, c, in the input
{
	if c is an alphabetic letter
	{
		make c lowercase
		append c to phrase
	}
}

The anagram normalization algorithm. Create a new, empty phrase to convert the input into a normal form. Then, for each character in the input, if that character is alphabetic ('A'-'Z' or 'a'-'z'), convert it to a lowercase letter and append it to or add it at the end of the phrase. Skip or do nothing with non-alphabetic characters. After normalization, the two example strings become:

tobeornottobethatisthequestionwhetheritsnoblerinthemindtosuffertheslingsandarrowsofoutrageousfortune

inoneofthebardsbestthoughtoftragediesourinsistentherohamletqueriesontwofrontsabouthowlifeturnsrotten

Counting Letter Occurrences

We have removed the spaces and punctuation characters from the strings and have converted all the letters to the same (lower) case. Next, we need to develop two algorithms. The first algorithm counts the number of each character in both strings: the number of a's, b's, ..., to the z's. The second algorithm compares the counts for both strings; if the counts are the same, the second string is an anagram of the first.

That means that for each normalized string, we need 26 counters - one counter for each letter in the (English) alphabet. We can outline the algorithm as follows:

define and initialize 26 counters: a_count = 0, b_count = 0, ..., z_count = 0

for each letter, c, in phrase
{
	if (c == 'a')
		a_count++;
	else if (c == 'b')
		b_count++;
	. . . .
	else
		z_count++;
}

Algorithm for counting the number of occurrences of each letter. The algorithm outlined by the pseudocode illustrates the necessary operations to solve the problem but not necessarily the best way to do them. The long if-else ladder appearing here is cumbersome and error-prone. We'll develop a more efficient algorithm below that replaces it with more compact code. The algorithm presented in Figure 1 introduced the variable phrase.

Detecting An Anagram: Comparing The Letter Counts

Once the program counts each letter in both strings, it's ready to compare them to determine if one string is an anagram of the other. If the corresponding counts for each letter are the same for both normalized strings, then the two phrases form an anagram. But all the counts must be the same - it only takes one pair of counts that are not equal to detect a failure.

if (a_count1 == a_count2 && b_count1 == b_count2 && . . . && z_count1 == z_count2)
	cout << "The phrases form an anagram\n";
else
	cout << "The phrases DO NOT form an anagram\n";

Algorithm for detecting an anagram. The algorithm compares pairs of counter variables: *_count1 and *_count2. There are 26 pairs, one for each letter in the English alphabet, so there are 26 comparisons. If the values stored in every pair are the same, then the two strings represent an anagram; otherwise, they are not. As in the previous algorithm, working with distinct counter variables is tedious and error-prone. We need a more compact solution that is easier to write.

As currently outlined, the counting (Figure 2) and detecting (Figure 3) algorithms require us to work with two sets of 26 separate counter variables. The algorithms force us to write a lot of code - a long if-else ladder in Figure 2 and a very long sequence of == tests in the if-statement of Figure 3. Both approaches are clumsy, tedious, and error-prone. We can improve both algorithms by replacing the separate counters with two arrays, using one array for each set of counters and corresponding elements in each array to hold the frequency counts for the same letter.

int count[26] {}

for each letter, c, in a normalized string
{
	if (c == 'a')
		count[0]++;
	else if (c == 'b')
		count[1]++;
	. . . .
	else
		count[25]++;
}

count1 and count2 are two integer arrays
containing the letter frequency counts
of two normalized strings.


bool anagram = true;

for (int i = 0; i < 26; i++)
	if (count1[i] != count2[i])
		anagram = false;

(a)

(b)

Intermediate array-based algorithms. The first versions of the counting and detection algorithms are closely bound to the problem; therefore, the text articulated them in problem terms. The intermediate versions introduce an array - an abstract feature not found in the "real world" problem statement but introduced to facilitate a programming solution. Although they are not the final versions of the algorithms, the intermediate algorithms bridge the initial and final versions by demonstrating how the program uses the array.

The intermediate version of the array-based counting algorithm illustrates how to use an array to maintain the letter frequency counts for a normalized string. Although it is easier to define and initialize an array than to define and initialize 26 distinct counter variables, the algorithm still has the disadvantages of being long, cumbersome, and error-prone.
The array-based anagram detection algorithm is much more compact. But as we'll see in the final version, we can still improve it.

Where algorithm (b) is nearly complete, algorithm (a) is still too long. To shorten (a), we need an easy way to map a letter to an array index. If an efficient mapping exists, we can collapse the if-else ladder in (a) to a single operation.

Mapping Characters To Numbers

In the itoa problem, we used ASCII codes to convert an integer into digits by noticing that the ASCII codes for the digits 0-9 are contiguous. The numeric value of a specific digit, d, plus the ASCII code for the digit 0, is the ASCII code for the digit: d + '0' = ASCII(d).

The ASCII codes for letters form two contiguous integer ranges: 'A' = 65 to 'Z' = 90, and 'a' = 97 to 'z' = 122 (see ASCII table). The normalizing algorithm of Figure 1 guarantees that our program only needs to deal with lowercase letters. So, we can use the second range to map ASCII-encoded letters to array indexes. C++ arrays are zero-indexed, so the valid index values for the counter arrays, each one storing 26 separate counts, range from 0 to 25. Our algorithm needs to map the letters 'a'-'z' into index values 0-25. Crucially, each letter must always map to the same index value.

f('a') = 0 f('b') = 1 . . . f('z') = 25	`f(c) = c - 'a'`	'a' - 'a' = 97 - 97 = 0 'b' - 'a' = 98 - 97 = 1 'c' - 'a' = 99 - 97 = 2 . . . 'z' - 'a' = 122 - 97 = 25
(a)	(b)	(c)

Mapping letters to indexes. We begin developing a mapping algorithm by recalling that in the itoa problem, we converted an integer to an ASCII character by adding the ASCII code for a character '0' to the integer. To perform a conversion in the opposite direction - to convert a character to an integer - we subtract the ASCII code for a character 'a' from the character.

The abstract mapping function illustrates what we need: a function that uniquely and consistently maps each lowercase letter to an integer in the range 0 to 25.
A concrete mapping function where c is a variable that stores the character that we want to map to an index value; the result of c-'a' is an integer in the range of 0-25.
An illustration of how (b) works: C++ automatically converts a character to its ASCII encoding when used in an arithmetic expression. The ASCII code for 'a' is 97, and the difference between any lowercase letter and 'a' is an integer in the range 0-25.

When we converted the input to a normal form, we arbitrarily chose to convert all letters to lower case; if we want to convert the letters to upper case, we only need to replace 'a' with 'A' in the algorithm.

From Algorithms To C++ Functions

Reading the initial strings and printing the final message are now familiar operations presented with little detail. Program input and output take place in main as illustrated below:

int main()
{
	string phrase1 = input(1);			// a
	string phrase2 = input(2);			// a

	if (phrase1.length() !=  phrase2.length())	// b
	{
		cout << "Phrases are NOT an anagram\n";
		exit(1);
	}

	if (is_anagram(phrase1, phrase2))		// c
		cout << "Phrases are an anagram\n";	// d
	else
		cout << "Phrases are NOT an anagram\n";	// d

	return 0;
}

Reading the input and printing the result. The highlighted functions are part of the program; the example defines them in the following figures.

The input function normalizes and returns the user's input. The function uses the arguments "1" and "2" to label the input prompt so users know which value they are entering.
If the normalized strings are not the same length, they cannot form an anagram. This test is optional because the next if-statement will also detect the failure. However, testing the length of two strings is fast, saving the effort of counting the characters when the lengths are unequal and the strings can't form an anagram.
The is_anagram function returns true if the two phrases form an anagram; otherwise, it returns false.
The program output.

string input(int n)
{
	string	input;	 				// a

	cout << "Please enter phrase " << n << ": ";	// b
	getline(cin, input);				// c

	return normalize(input);			// d
}

Program input. The input function reads data from the console and calls the normalize function (highlighted), converting the data to a normalized or standard form.

A local object to temporarily contain the input.
A simple user prompt; the function argument, n, is used to label which input is currently taking place.
The function uses the string version of getline for data input rather than the extractor operator because the input may contain spaces.
The normalize function, presented in the next figure, converts the input to a normal form. input returns the string returned by the normalize function.

Our next step is fully converting the counting and detection algorithms into working C++ functions. The functions also allow us to gain more experience passing arrays to and receiving them from functions. We begin by writing the normalize function, which relies on two C++ API operations:

The isalpha function returns true if its argument is an alphabetic character or letter in the range 'A'-'Z' or 'a'-'z'. We use this function to skip or filter out all non-alphabetic characters in the input.
The tolower and toupper functions convert their arguments to lower and upper case letters, respectively. Non-letter characters and letters already in the correct case are returned unchanged. We arbitrarily chose lowercase in the previous algorithms, so we use tolower here.

string normalize(const string& s)			// a
{
	string normalized;				// b

	for (size_t i = 0; i < s.length(); i++)		// c
		if (isalpha(s[i]))			// d
			normalized += tolower(s[i]);	// e

	return normalized;
}

Converting a string to a normal form.

The function normalizes the argument s. The function passes the string by reference for efficiency; it makes the parameter const as it does not need to be changed.
Creates an instance of the string class and the constructor initializes it to be empty.
Iterates or loops over each character in the string.
isalpha returns true if its argument is a letter: A-Z or a-z. So, the if-statement discards all space and punctuation charters.
All letters are converted to lowercase and appended to the end of normalized. If the argument passed to tolower is anything other than an upper case letter, it is returned unmodified. If you want to create a normalized form with upper-case letters, replace tolower with toupper.

isalpha and tolower require the <cctype> header file.

void count(const string& s, int* counts)		// a
{
	for (size_t i = 0; i < s.length(); i++)		// b
		counts[s[i] - 'a']++; 			// c
}

Counting letter occurrences. The letter-to-integer mapping operation presented in Figure 5, coupled with an array, allows us to collapse the long if-else ladder in Figure 4(a) to a single statement!

The program defines and initializes the parameters, s and counts, in the is_anagram function below (Figure 10) and passes them into this function. s is one of the normalized user input strings. Each counts element is 0 before the loop runs. C++ always passes arrays, like counts, by-pointer.
Iterates or loops over each character in the string.
The expression s[i] - 'a' converts one letter (a[i]) from one of the input strings into an index into the counts array. Indexing into an integer array is an expression yielding an integer variable. The auto-increment operator, ++, increments the variable's contents. In my opinion, this operation is the most interesting part of the anagram program.

It's easy for us to think of arrays as data storage structures, but the count function demonstrates that they can also serve as active computational elements.

Our final task is implementing the anagram detection algorithm presented in Figure 4(b). We have already expressed most of the algorithm with C++ code, so little work remains to implement a working function. Nevertheless, we can still make small improvements, as demonstrated by the following implementations.

Version 1	Version 2
bool is_anagram(const string& s1, const string& s2) { int count1[26] {}; // a count(s1, count1); // b int count2[26] {}; // a count(s2, count2); // b bool anagram = true; // c for (int i = 0; i < 26; i++) // d anagram &&= count1[i] == count2[i]; // e return anagram; // f }	bool is_anagram(const string& s1, const string& s2) { int count1[26] {}; // a count(s1, count1); // b int count2[26] {}; // a count(s2, count2); // b for (int i = 0; i < 26; i++) // d if (count1[i] != count2[i]) // g return false; // h return true; // i }

Version 1

Version 2

bool is_anagram(const string& s1, const string& s2)
{
    int count1[26] {};				// a
    count(s1, count1);				// b

    int count2[26] {};				// a
    count(s2, count2);				// b

    bool anagram = true;			// c

    for (int i = 0; i < 26; i++)		// d
        anagram &&= count1[i] == count2[i];	// e

    return anagram;				// f
}

bool is_anagram(const string& s1, const string& s2)
{
    int count1[26] {};				// a
    count(s1, count1);				// b

    int count2[26] {};				// a
    count(s2, count2);				// b

    for (int i = 0; i < 26; i++)		// d
        if (count1[i] != count2[i])		// g
            return false;			// h

    return true;				// i
}

C++ implementations of the detection algorithm. Version 1 is close to the Figure 4(b) intermediate version. However, it combines the if-statement and assignment operation into a single assignment statement. Version 2 is, in my opinion, more straightforward. The logic is similar to the intermediate algorithm but returns when it determines that the strings do not form an anagram, potentially saving some unnecessary comparisons.

Arrays replace the individual counter variables in the original detection algorithm. Each array element is an accumulator, so the program must initialize them to 0.
The count function counts the letter frequencies in the normalized strings.
Defines and initializes a logical accumulator like the intermediate algorithm of Figure 4(b).
Steps through each of the counter pairs.
The == operator, having higher precedence than &&=, forms a Boolean-valued expression: true if the counts are equal or false otherwise. &&= an "operation with assignment" operator; it performs a logical-AND between the expression and the value stored in anagram.
Returns the value accumulated in anagram.
Compares the counts like the Figure 4(b) intermediate algorithm.
If the counts are equal, the loop continues, but if the counts are not equal, the function ends, signaling that the inputs are not an anagram.
The function can only signal that the strings form an anagram after testing all counter pairs.

For-Range Loops

We first saw C++'s for-range loops in chapter 3 but have ignored them since. We haven't neglected them because they are not generally useful, but they're not useful with the data types introduced before this chapter. For-range loops operate on data like string objects that are "iterable." Iterable data are objects that define the begin and end functions (see ipalnumber.cpp). For-range loops are especially useful when translating algorithms that use phrases like "for each" (Java calls this construct a for-each loop). The general for-range syntax is:

for (variable-definition : range)
{
	statements;
}

For-range loop syntax. For-range loops process each element stored in the range one at a time, one element per iteration. The characters colored red are a required part of the loop syntax.

variable-definition: Define a variable whose type is the same as the elements stored in the range. The loop stores the element it is currently processing in this variable.
range: Names an instance of an iterable class such as a string or a vector.

Two of the preceding algorithms, Figures 1 and 2, relied on a loop that read, in part, for each letter, c, in .... For-each loops translate that natural-language statement directly into C++ code, which allows us to rewrite the two functions developed from the algorithms as follows:

string normalize(const string& s)
{
	string normalized;

	for (char c : s)
		if (isalpha(c))
			normalized += tolower(c);

	return normalized;
}

void count(const string& s, int* counts)
{
	for (char c : s)
		counts[c - 'a']++;
}

(a)

(b)

For-range versions of two anagram algorithms. Replacing traditional for-loops with for-range loops isn't a significant change. Nevertheless, for-range loops ofttimes better match the natural language describing a solution, smoothing the translation from algorithm to working code.

The normalization function of Figure 8, rewritten with a for-range loop.
The counting function of Figure 9, rewritten with a for-range loop.

Downloadable Code

View	Download
anagram.cpp (string class version)	anagram.cpp (string class version)
canagram.cpp (C-string version)	canagram.cpp (C-string version)