Our next problem is an example of the more extensive software development process. We begin by stating a problem, solving it generally (independent of a specific programming language), successively refining the solution, and implementing several versions of the general solution as C++ functions and a complete program. The development process allows us to review functions, arrays, strings, and ASCII-encoded characters.
The Anagram Problem
"An anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once." When determining if one string is an anagram or another, we ignore spaces, punctuation characters, and character cases (upper or lower). For example, the letters in the phrase, "See the quick red fox jump over the lazy brown dog," can be rearranged to form the rather bland anagram "abcddeeeeeefghhijklmnoooopqrrrsttuuvwxyz." Cleaver anagrams are more interesting and more challenging to create and validate. The second phrase of a cleaver anagram forms a valid word or statement that is often a humorous comment on the first phrase. For example, an anagram for "Dormitory" is "Dirty Room." Some anagram aficionados have way too much time on their hands, as is illustrated by the following clever anagram:
Phrase
To be or not to be: that is the question, whether its nobler in the mind to suffer the slings and arrows of outrageous fortune.
Anagram
In one of the Bard's best-thought-of tragedies, our insistent hero, Hamlet, queries on two fronts about how life turns rotten.
Our anagram problem is to design and implement a program that compares two strings and reports if the second is an anagram of the first.
Solving The Anagram Problem
Designing and implementing a program requires more detail than the initial problem statement provides. In a "real world" situation, we would verify the refined and expanded problem statement with the client before designing or implementing the program. We begin by refining and decomposing the problem into four steps or sub-problems:
Prompt the user to enter and read two strings from the console (a familiar operation)
Normalize each string to an easily-compared standard form:
Remove all space and punctuation characters
Convert all letters to a single case (either upper or lower)
To qualify as an anagram, both strings must contain the same number of each alphabetic letter. So, the program counts the number of occurrences of each letter (the number of a's, the number of b's, etc. in each string
Compare all the counts; if all counts are equal, the second string is an anagram of the first; otherwise, it is not an anagram
Pseudocode
"Pseudocode is an informal high-level description of the operating principle of a computer program or other algorithm." Pseudocode can be anything from words written in a natural language to working code in a programming language. It is often a mixture of both. Pseudocode lets us focus on what we want to do without worrying too much about how we will do it. There isn't a single way of writing pseudocode, but it should be fairly intuitive and generally easy to understand. We'll use pseudocode to express the algorithms needed to solve the anagram problem. This approach allows us to refine our algorithms before spending the time to program them.
Normalizing The Input
It will be easier to compare our two candidate strings if we normalize them first. To normalize the strings means "to make [them] conform to or reduce [them] to a norm or standard" form (Merriam-Webster). As described previously, a convenient normalized or standard form eliminates spaces and punctuation and converts the remaining letters to the same case - we'll arbitrarily choose lowercase. As we need to normalize both input phrases, we should implement this step as a function to avoid duplicating code. We can make the solution a little easier to convert into a program by restating it in a more structured form with pseudocode.
define the variable phrase and initialize it to empty
for each character, c, in the input
{
if c is an alphabetic letter
{
make c lowercase
append c to phrase
}
}
The anagram normalization algorithm.
Create a new, empty phrase to convert the input into a normal form. Then, for each character in the input, if that character is alphabetic ('A'-'Z' or 'a'-'z'), convert it to a lowercase letter and append it to or add it at the end of the phrase. Skip or do nothing with non-alphabetic characters. After normalization, the two example strings become:
We have removed the spaces and punctuation characters from the strings and have converted all the letters to the same (lower) case. Next, we need to develop two algorithms. The first algorithm counts the number of each character in both strings: the number of a's, b's, ..., to the z's. The second algorithm compares the counts for both strings; if the counts are the same, the second string is an anagram of the first.
That means that for each normalized string, we need 26 counters - one counter for each letter in the (English) alphabet. We can outline the algorithm as follows:
define and initialize 26 counters: a_count = 0, b_count = 0, ..., z_count = 0
for each letter, c, in phrase
{
if (c == 'a')
a_count++;
else if (c == 'b')
b_count++;
. . . .
else
z_count++;
}
Algorithm for counting the number of occurrences of each letter.
The algorithm outlined by the pseudocode illustrates the necessary operations to solve the problem but not necessarily the best way to do them. The long if-else ladder appearing here is cumbersome and error-prone. We'll develop a more efficient algorithm below that replaces it with more compact code. The algorithm presented in Figure 1 introduced the variable phrase.
Detecting An Anagram: Comparing The Letter Counts
Once the program counts each letter in both strings, it's ready to compare them to determine if one string is an anagram of the other. If the corresponding counts for each letter are the same for both normalized strings, then the two phrases form an anagram. But all the counts must be the same - it only takes one pair of counts that are not equal to detect a failure.
As currently outlined, the counting (Figure 2) and detecting (Figure 3) algorithms require us to work with two sets of 26 separate counter variables. The algorithms force us to write a lot of code - a long if-else ladder in Figure 2 and a very long sequence of == tests in the if-statement of Figure 3. Both approaches are clumsy, tedious, and error-prone. We can improve both algorithms by replacing the separate counters with two arrays, using one array for each set of counters and corresponding elements in each array to hold the frequency counts for the same letter.
int count[26] {}
for each letter, c, in a normalized string
{
if (c == 'a')
count[0]++;
else if (c == 'b')
count[1]++;
. . . .
else
count[25]++;
}
count1 and count2 are two integer arrays
containing the letter frequency counts
of two normalized strings.
bool anagram = true;
for (int i = 0; i < 26; i++)
if (count1[i] != count2[i])
anagram = false;
(a)
(b)
Intermediate array-based algorithms. The first versions of the counting and detection algorithms are closely bound to the problem; therefore, the text articulated them in problem terms. The intermediate versions introduce an array - an abstract feature not found in the "real world" problem statement but introduced to facilitate a programming solution. Although they are not the final versions of the algorithms, the intermediate algorithms bridge the initial and final versions by demonstrating how the program uses the array.
The intermediate version of the array-based counting algorithm illustrates how to use an array to maintain the letter frequency counts for a normalized string. Although it is easier to define and initialize an array than to define and initialize 26 distinct counter variables, the algorithm still has the disadvantages of being long, cumbersome, and error-prone.
The array-based anagram detection algorithm is much more compact. But as we'll see in the final version, we can still improve it.
Where algorithm (b) is nearly complete, algorithm (a) is still too long. To shorten (a), we need an easy way to map a letter to an array index. If an efficient mapping exists, we can collapse the if-else ladder in (a) to a single operation.
Mapping Characters To Numbers
In the itoa problem, we used ASCII codes to convert an integer into digits by noticing that the ASCII codes for the digits 0-9 are contiguous. The numeric value of a specific digit, d, plus the ASCII code for the digit 0, is the ASCII code for the digit: d + '0' = ASCII(d).
The ASCII codes for letters form two contiguous integer ranges: 'A' = 65 to 'Z' = 90, and 'a' = 97 to 'z' = 122 (see ASCII table). The normalizing algorithm of Figure 1 guarantees that our program only needs to deal with lowercase letters. So, we can use the second range to map ASCII-encoded letters to array indexes. C++ arrays are zero-indexed, so the valid index values for the counter arrays, each one storing 26 separate counts, range from 0 to 25. Our algorithm needs to map the letters 'a'-'z' into index values 0-25. Crucially, each letter must always map to the same index value.
Mapping letters to indexes. We begin developing a mapping algorithm by recalling that in the itoa problem, we converted an integer to an ASCII character by adding the ASCII code for a character '0' to the integer. To perform a conversion in the opposite direction - to convert a character to an integer - we subtract the ASCII code for a character 'a' from the character.
The abstract mapping function illustrates what we need: a function that uniquely and consistently maps each lowercase letter to an integer in the range 0 to 25.
A concrete mapping function where c is a variable that stores the character that we want to map to an index value; the result of c-'a' is an integer in the range of 0-25.
An illustration of how (b) works: C++ automatically converts a character to its ASCII encoding when used in an arithmetic expression. The ASCII code for 'a' is 97, and the difference between any lowercase letter and 'a' is an integer in the range 0-25.
When we converted the input to a normal form, we arbitrarily chose to convert all letters to lower case; if we want to convert the letters to upper case, we only need to replace 'a' with 'A' in the algorithm.
From Algorithms To C++ Functions
Reading the initial strings and printing the final message are now familiar operations presented with little detail. Program input and output take place in main as illustrated below:
Our next step is fully converting the counting and detection algorithms into working C++ functions. The functions also allow us to gain more experience passing arrays to and receiving them from functions. We begin by writing the normalize function, which relies on two C++ API operations:
The isalpha function returns true if its argument is an alphabetic character or letter in the range 'A'-'Z' or 'a'-'z'. We use this function to skip or filter out all non-alphabetic characters in the input.
The tolower and toupper functions convert their arguments to lower and upper case letters, respectively. Non-letter characters and letters already in the correct case are returned unchanged. We arbitrarily chose lowercase in the previous algorithms, so we use tolower here.
Our final task is implementing the anagram detection algorithm presented in Figure 4(b). We have already expressed most of the algorithm with C++ code, so little work remains to implement a working function. Nevertheless, we can still make small improvements, as demonstrated by the following implementations.
Version 1
Version 2
bool is_anagram(const string& s1, const string& s2)
{
int count1[26] {}; // a
count(s1, count1); // b
int count2[26] {}; // a
count(s2, count2); // b
bool anagram = true; // c
for (int i = 0; i < 26; i++) // d
anagram &&= count1[i] == count2[i]; // e
return anagram; // f
}
bool is_anagram(const string& s1, const string& s2)
{
int count1[26] {}; // a
count(s1, count1); // b
int count2[26] {}; // a
count(s2, count2); // b
for (int i = 0; i < 26; i++) // d
if (count1[i] != count2[i]) // g
return false; // h
return true; // i
}
C++ implementations of the detection algorithm.
Version 1 is close to the Figure 4(b) intermediate version. However, it combines the if-statement and assignment operation into a single assignment statement. Version 2 is, in my opinion, more straightforward. The logic is similar to the intermediate algorithm but returns when it determines that the strings do not form an anagram, potentially saving some unnecessary comparisons.
Arrays replace the individual counter variables in the original detection algorithm. Each array element is an accumulator, so the program must initialize them to 0.
The count function counts the letter frequencies in the normalized strings.
Defines and initializes a logical accumulator like the intermediate algorithm of Figure 4(b).
Steps through each of the counter pairs.
The == operator, having higher precedence than &&=, forms a Boolean-valued expression: true if the counts are equal or false otherwise. &&= an "operation with assignment" operator; it performs a logical-AND between the expression and the value stored in anagram.
Returns the value accumulated in anagram.
Compares the counts like the Figure 4(b) intermediate algorithm.
If the counts are equal, the loop continues, but if the counts are not equal, the function ends, signaling that the inputs are not an anagram.
The function can only signal that the strings form an anagram after testing all counter pairs.
For-Range Loops
We first saw C++'s for-range loops in chapter 3 but have ignored them since. We haven't neglected them because they are not generally useful, but they're not useful with the data types introduced before this chapter.
For-range loops operate on data like string objects that are "iterable." Iterable data are objects that define the begin and end functions (see ipalnumber.cpp). For-range loops are especially useful when translating algorithms that use phrases like "for each" (Java calls this construct a for-each loop). The general for-range syntax is:
Two of the preceding algorithms, Figures 1 and 2, relied on a loop that read, in part, for each letter, c, in .... For-each loops translate that natural-language statement directly into C++ code, which allows us to rewrite the two functions developed from the algorithms as follows:
string normalize(const string& s)
{
string normalized;
for (char c : s)
if (isalpha(c))
normalized += tolower(c);
return normalized;
}
void count(const string& s, int* counts)
{
for (char c : s)
counts[c - 'a']++;
}
(a)
(b)
For-range versions of two anagram algorithms. Replacing traditional for-loops with for-range loops isn't a significant change. Nevertheless, for-range loops ofttimes better match the natural language describing a solution, smoothing the translation from algorithm to working code.
The normalization function of Figure 8, rewritten with a for-range loop.
The counting function of Figure 9, rewritten with a for-range loop.