Our next problem is an example of the more extensive software development process. We begin by stating a problem, solving it generally (independent of a specific programming language), successively refining the solution, and implementing several versions of the general solution as C++ functions and a complete program. The development process allows us to review functions, arrays, strings, and ASCII-encoded characters.
"An anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once." When determining if one string is an anagram or another, we ignore spaces, punctuation characters, and character cases (upper or lower). For example, the letters in the phrase, "See the quick red fox jump over the lazy brown dog," can be rearranged to form the rather bland anagram "abcddeeeeeefghhijklmnoooopqrrrsttuuvwxyz." Cleaver anagrams are more interesting and more challenging to create and validate. The second phrase of a cleaver anagram forms a valid word or statement that is often a humorous comment on the first phrase. For example, an anagram for "Dormitory" is "Dirty Room." Some anagram aficionados have way too much time on their hands, as is illustrated by the following clever anagram:
Our anagram problem is to design and implement a program that compares two strings and reports if the second is an anagram of the first.
Designing and implementing a program requires more detail than the initial problem statement provides. In a "real world" situation, we would verify the refined and expanded problem statement with the client before designing or implementing the program. We begin by refining and decomposing the problem into four steps or sub-problems:
"Pseudocode is an informal high-level description of the operating principle of a computer program or other algorithm." Pseudocode can be anything from words written in a natural language to working code in a programming language. It is often a mixture of both. Pseudocode lets us focus on what we want to do without worrying too much about how we will do it. There isn't a single way of writing pseudocode, but it should be fairly intuitive and generally easy to understand. We'll use pseudocode to express the algorithms needed to solve the anagram problem. This approach allows us to refine our algorithms before spending the time to program them.
It will be easier to compare our two candidate strings if we normalize them first. To normalize the strings means "to make [them] conform to or reduce [them] to a norm or standard" form (Merriam-Webster). As described previously, a convenient normalized or standard form eliminates spaces and punctuation and converts the remaining letters to the same case - we'll arbitrarily choose lowercase. As we need to normalize both input phrases, we should implement this step as a function to avoid duplicating code. We can make the solution a little easier to convert into a program by restating it in a more structured form with pseudocode.
define the variable phrase and initialize it to empty for each character, c, in the input { if c is an alphabetic letter { make c lowercase append c to phrase } } |
tobeornottobethatisthequestionwhetheritsnoblerinthemindtosuffertheslingsandarrowsofoutrageousfortune
inoneofthebardsbestthoughtoftragediesourinsistentherohamletqueriesontwofrontsabouthowlifeturnsrotten
We have removed the spaces and punctuation characters from the strings and have converted all the letters to the same (lower) case. Next, we need to develop two algorithms. The first algorithm counts the number of each character in both strings: the number of a's, b's, ..., to the z's. The second algorithm compares the counts for both strings; if the counts are the same, the second string is an anagram of the first.
That means that for each normalized string, we need 26 counters - one counter for each letter in the (English) alphabet. We can outline the algorithm as follows:
define and initialize 26 counters: a_count = 0, b_count = 0, ..., z_count = 0 for each letter, c, in phrase { if (c == 'a') a_count++; else if (c == 'b') b_count++; . . . . else z_count++; } |
Once the program counts each letter in both strings, it's ready to compare them to determine if one string is an anagram of the other. If the corresponding counts for each letter are the same for both normalized strings, then the two phrases form an anagram. But all the counts must be the same - it only takes one pair of counts that are not equal to detect a failure.
if (a_count1 == a_count2 && b_count1 == b_count2 && . . . && z_count1 == z_count2) cout << "The phrases form an anagram\n"; else cout << "The phrases DO NOT form an anagram\n";
As currently outlined, the counting (Figure 2) and detecting (Figure 3) algorithms require us to work with two sets of 26 separate counter variables. The algorithms force us to write a lot of code - a long if-else ladder in Figure 2 and a very long sequence of == tests in the if-statement of Figure 3. Both approaches are clumsy, tedious, and error-prone. We can improve both algorithms by replacing the separate counters with two arrays, using one array for each set of counters and corresponding elements in each array to hold the frequency counts for the same letter.
int count[26] {} for each letter, c, in a normalized string { if (c == 'a') count[0]++; else if (c == 'b') count[1]++; . . . . else count[25]++; } |
count1 and count2 are two integer arrays containing the letter frequency counts of two normalized strings. bool anagram = true; for (int i = 0; i < 26; i++) if (count1[i] != count2[i]) anagram = false; |
(a) | (b) |
Where algorithm (b) is nearly complete, algorithm (a) is still too long. To shorten (a), we need an easy way to map a letter to an array index. If an efficient mapping exists, we can collapse the if-else ladder in (a) to a single operation.
In the itoa
problem, we used ASCII codes to convert an integer into digits by noticing that the ASCII codes for the digits 0-9 are contiguous. The numeric value of a specific digit, d, plus the ASCII code for the digit 0, is the ASCII code for the digit: d + '0' = ASCII(d)
.
The ASCII codes for letters form two contiguous integer ranges: 'A' = 65 to 'Z' = 90, and 'a' = 97 to 'z' = 122 (see ASCII table). The normalizing algorithm of Figure 1 guarantees that our program only needs to deal with lowercase letters. So, we can use the second range to map ASCII-encoded letters to array indexes. C++ arrays are zero-indexed, so the valid index values for the counter arrays, each one storing 26 separate counts, range from 0 to 25. Our algorithm needs to map the letters 'a'-'z' into index values 0-25. Crucially, each letter must always map to the same index value.
f('a') = 0 f('b') = 1 . . . f('z') = 25 |
f(c) = c - 'a' |
'a' - 'a' = 97 - 97 = 0 'b' - 'a' = 98 - 97 = 1 'c' - 'a' = 99 - 97 = 2 . . . 'z' - 'a' = 122 - 97 = 25 |
(a) | (b) | (c) |
Reading the initial strings and printing the final message are now familiar operations presented with little detail. Program input and output take place in main as illustrated below:
int main() { string phrase1 = input(1); // a string phrase2 = input(2); // a if (phrase1.length() != phrase2.length()) // b { cout << "Phrases are NOT an anagram\n"; exit(1); } if (is_anagram(phrase1, phrase2)) // c cout << "Phrases are an anagram\n"; // d else cout << "Phrases are NOT an anagram\n"; // d return 0; }
string input(int n) { string input; // a cout << "Please enter phrase " << n << ": "; // b getline(cin, input); // c return normalize(input); // d }
Our next step is fully converting the counting and detection algorithms into working C++ functions. The functions also allow us to gain more experience passing arrays to and receiving them from functions. We begin by writing the normalize function, which relies on two C++ API operations:
string normalize(const string& s) // a { string normalized; // b for (size_t i = 0; i < s.length(); i++) // c if (isalpha(s[i])) // d normalized += tolower(s[i]); // e return normalized; }
void count(const string& s, int* counts) // a { for (size_t i = 0; i < s.length(); i++) // b counts[s[i] - 'a']++; // c }
is_anagram
function below (Figure 10) and passes them into this function. s is one of the normalized user input strings. Each counts element is 0 before the loop runs. C++ always passes arrays, like counts, by-pointer.Our final task is implementing the anagram detection algorithm presented in Figure 4(b). We have already expressed most of the algorithm with C++ code, so little work remains to implement a working function. Nevertheless, we can still make small improvements, as demonstrated by the following implementations.
Version 1 | Version 2 |
---|---|
bool is_anagram(const string& s1, const string& s2) { int count1[26] {}; // a count(s1, count1); // b int count2[26] {}; // a count(s2, count2); // b bool anagram = true; // c for (int i = 0; i < 26; i++) // d anagram &&= count1[i] == count2[i]; // e return anagram; // f } |
bool is_anagram(const string& s1, const string& s2) { int count1[26] {}; // a count(s1, count1); // b int count2[26] {}; // a count(s2, count2); // b for (int i = 0; i < 26; i++) // d if (count1[i] != count2[i]) // g return false; // h return true; // i } |
We first saw C++'s for-range loops in chapter 3 but have ignored them since. We haven't neglected them because they are not generally useful, but they're not useful with the data types introduced before this chapter.
For-range loops operate on data like string objects that are "iterable." Iterable data are objects that define the begin
and end
functions (see ipalnumber.cpp). For-range loops are especially useful when translating algorithms that use phrases like "for each" (Java calls this construct a for-each loop). The general for-range syntax is:
for (variable-definition : range) { statements; }
Two of the preceding algorithms, Figures 1 and 2, relied on a loop that read, in part, for each letter, c, in ...
. For-each loops translate that natural-language statement directly into C++ code, which allows us to rewrite the two functions developed from the algorithms as follows:
string normalize(const string& s) { string normalized; for (char c : s) if (isalpha(c)) normalized += tolower(c); return normalized; } |
void count(const string& s, int* counts) { for (char c : s) counts[c - 'a']++; } |
(a) | (b) |
View | Download |
---|---|
anagram.cpp (string class version) | anagram.cpp (string class version) |
canagram.cpp (C-string version) | canagram.cpp (C-string version) |