8.7.3. Software Development: The Anagram Problem

anagram, software development

"An anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once." The process ignores spaces, punctuation characters, and character cases (upper or lower). For example, the letters in the phrase, "See the quick red fox jump over the lazy brown dog," can be rearranged to form the rather bland anagram "abcddeeeeeefghhijklmnoooopqrrrsttuuvwxyz." Cleaver anagrams are more challenging, forming a valid word or statement that is often a humorous comment on the first phrase. For example, an anagram for "Dormitory" is "Dirty Room." Some anagram aficionados have way too much time on their hands, as the following clever anagram illustrates:

Phrase
To be or not to be: that is the question, whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune.
Anagram
In one of the Bard's best-thought-of tragedies, our insistent hero, Hamlet, queries on two fronts about how life turns rotten.

The current section develops a program that determines if one phrase is an anagram of another, focusing on developing and representing a solution before programming it. It decomposes the overall problem into smaller subproblems and develops algorithms to solve each of them. The final program demonstrates strings, arrays, and functions operating on them.

Solving The Anagram Problem

anagram solution, designing an anagram solution, abstract program solution, abstract anagram solutions

Designing and implementing a program to verify two phrases form an anagram requires detail beyond the initial definition or general problem statement. In an authentic situation, software developers often create multiple views of a solution using various notations. Some views express an overall solution, others focus on sub-problems, and each may target a different audience or serve a different purpose. Similarly, depending on a program's complexity, software developers may represent it at various levels of abstraction or degrees of specificity. Higher levels of abstraction allow developers to focus on what the program does, while lower levels focus on how it does it.

Structured Prose

structured prose, prose anagram solution, natural language

Most audiences can follow a solution stated in prose (1(a)), but the complexity of natural languages often makes these solutions ambiguous and subject to various interpretations. Structuring the prose, such as with a hierarchical list, increases precision and narrows the interpretations.

  1. Prompt the user to enter and read two strings from the console (a familiar operation)
  2. Normalize each string to an easily-compared standard form:
    • Remove all space and punctuation characters
    • Convert all letters to a single case (either upper or lower)
  3. To qualify as an anagram, both strings must contain the same number of each alphabetic letter: the program counts the number of occurrences of each letter (the number of a's, the number of b's, etc.) in each string
  4. Compare the character counts; the strings are an anagram if and only if all counts are equal
Structured prose anagram solution. A structured prose solution is suitable for a client or a subject-matter expert (SME) to understand and verify a solution's accuracy. It is also suitable for guiding the overall development and organization of a program. However, it is often too abstract and lacks sufficient detail for programmers to implement in a computer program. Developers often find a lower-level representation, one closer to a program or function, a useful step to transition from a solution to a program.

Pseudocode

pseudocode, algorithms, anagram algorithms
Pseudocode is a description of the steps in an algorithm using a mix of conventions of programming languages (like assignment operator, conditional operator, loop) with informal, usually self-explanatory, notation of actions and conditions. Although pseudocode shares features with regular programming languages, it is intended for human reading rather than machine control. Pseudocode typically omits details that are essential for machine implementation of the algorithm, meaning that pseudocode can only be verified by hand.

Pseudocode is an intermediate abstract representation that is generally more precise than prose while remaining high enough to retain a focus on logic without becoming mired in detail. As such, pseudocode is an ideal notation for developing and conveying algorithms. The following examples use pseudocode to express the algorithms for solving the anagram problem. This approach allows developers to refine algorithms before spending time programming them.

Normalizing The Input

anagram input, normalizing anagram input, anagram normalization algorithm

Comparing the two candidate strings that potentially form an anagram is easier if they are in the same format. The first algorithm normalizes the strings. To normalize the strings means "to make [them] conform to or reduce [them] to a norm or standard" form. For the anagram program, a convenient normalized form eliminates spaces and punctuation, and converts the remaining letters to the same case - the algorithm arbitrarily chooses lowercase. If implemented as a function, the algorithm can process both candidate strings.

define the variable phrase and initialize it to empty

for each character, c, in the input
{
	if c is an alphabetic letter
	{
		make c lowercase
		append c to phrase
	}
}
The anagram normalization algorithm. The algorithm creates a new, empty phrase to store the normalized input, retaining the original for later use. For each character in the input, if that character is alphabetic ('A'-'Z' or 'a'-'z'), the algorithm converts it to a lowercase letter and appends it to the new phrase. It omits non-alphabetic characters. After normalization, the two example strings become:
tobeornottobethatisthequestionwhetheritsnoblerinthemindtosuffertheslingsandarrowsofoutrageousfortune
inoneofthebardsbestthoughtoftragediesourinsistentherohamletqueriesontwofrontsabouthowlifeturnsrotten

Counting Letter Occurrences

counter letter occurrences, array-elements as counters, counting anagram characters

Once the previous algorithm has removed the spaces and punctuation characters from the candidate phrases and converted all the letters to the same (lower) case, the next step is counting the occurrences of each letter: the number of a's, b's, ..., and the z's. Consequently, to test for anagrams in English, the algorithm requires 26 counters - one counter for each letter in the (English) alphabet.

define and initialize 26 counters:
    a_count = 0, b_count = 0, ..., z_count = 0

for each letter, c, in a normalized string
{
	if (c == 'a')
		a_count++;
	else if (c == 'b')
		b_count++;
	. . . .
	else
		z_count++;
}
define an array: int count[26]
initialize each array element to 0

for each letter, c, in a normalized string
{
	if (c == 'a')
		count[0]++;
	else if (c == 'b')
		count[1]++;
	. . . .
	else
		count[25]++;
}
(a)(b)
Algorithm for counting letter occurrences. The algorithm illustrates the operations needed to count the occurrences of each letter in a normalized string. Both versions require 26 counters - accumulators that must be initialized to 0 - one for each letter in the English alphabet. The two versions are structurally the same, relying on an if-else ladder to determine which counter to increment. The ladder is long, cumbersome, and error-prone, but the second version lays the foundation for a more compact and robust solution.

Detecting An Anagram: Comparing The Letter Counts

detecting an anagram, if-else ladder, compound expression

If both candidate phrases have the same number of each letter, they form an anagram. Therefore, comparing pairs of letter counts is sufficient for assessing the candidates. All twenty-six counts must be equal to assert that the candidates form an anagram. However, a single differing count indicates that they are not an anagram.

if (a_count1 != a_count2)
    return false;
else if (b_count1 != b_count2)
    return false;
        . . .
else if(z_count1 != z_count2)
    return false;
else
    return true;
if (a_count1 == a_count2 && b_count1 == b_count2 &&
    . . . && z_count1 == z_count2)
	cout return true
else
	cout return false
for (int i = 0; i < 26; i++)
    if (count1[i] != count2[i])
        return false;
return true;
(a)(b)(c)
Algorithm accepting or rejecting an anagram. The anagram detection algorithm assumes two sets of counters: The (a) and (b) versions denote them as *_count1 and *_count2, respectively. The (c) version uses two arrays named count1 and count2, respectively. All three versions of the algorithm compare pairs of corresponding counters, one pair for each letter in the English alphabet. Stating each version as a function allows the algorithm to make a determination and end when it detects the first unequal pair.
  1. The first version continues using the if-else ladder and the discrete counters. This version has the advantage of using familiar constructs, making it easy to understand, but the ladder and the discrete counters also make it tedious and error-prone.
  2. The second version replaces the if-else ladder with a sequence of conjunctions: E1 && E2 && E3 .... Each sub-expression, En, must evaluate to true for the overall expression to be true. C++ evaluates the sub-expressions left-to-right, stopping when it determines the overall result. Each sub-expression is a test for equality: E1 == E2, which has a higher precedence than the logical-AND operator. Nevertheless, the complete compound expression is long, unwieldy, and error-prone.
  3. The array-based anagram detection algorithm is more compact, robust, and elegant.

Discrete counters clearly match features of the original problem, making them easy to understand. However, each algorithm that uses them suffers from large, awkward programming structures whose very size allows programmers to introduce trivial but difficult-to-find errors. Replacing the discrete counters with two arrays, one for each set of counters, allows programmers to replace the if-ladders and long compound expressions with compact loops. However, the array-based algorithm used to generate the initial counts, shown in Figure 3(b), still relies on an extensive if-else ladder. To take full advantage of the array-based solution requires an easy way to map a letter to an appropriate array index. An efficient mapping allows programmers to collapse the if-else ladder to a single operation.

Mapping Characters To Array Indexes

mapping characters to array indexes, ASCII encoding, converting letters to numbers

To implement the itoa function, the text used ASCII codes to convert an integer into digits by recognizing that the codes for the digits '0' through '9' are contiguous. The numeric value of a specific digit, d, added to the ASCII code for the digit '0,' is the ASCII code for the digit: d + '0' = ASCII(d). The ASCII codes for letters form two contiguous integer ranges: 'A' = 65 to 'Z' = 90, and 'a' = 97 to 'z' = 122. The normalizing algorithm (Figure 2) ensures that the counting algorithm only needs to handle lowercase letters.

f('a') = 0
f('b') = 1
. . .
f('z') = 25
f(c) = c - 'a'
'a' - 'a' = 97 - 97 = 0
'b' - 'a' = 98 - 97 = 1
'c' - 'a' = 99 - 97 = 2
	. . .
'z' - 'a' = 122 - 97 = 25
(a)(b)(c)
Mapping letters to indexes. C++ arrays are zero-indexed, so the valid index values for the counter arrays range from 0 to 25. The algorithm needs to uniquely and consistently map the letters 'a' through 'z' into the same index values 0 through 25. Whereas the itoa function converts an integer to an ASCII character by adding the ASCII code for the character '0' to the integer, the mapping function operates in the opposite direction - converting a character to an integer. Therefore, the mapping function subtracts the ASCII code for 'a' from the character.
  1. The abstract mapping function illustrates the necessary conversion, uniquely and consistently mapping each lowercase letter to an integer in the range 0 to 25.
  2. A concrete mapping function where c is a variable that stores the character mapped to an index value. The result of c-'a' is an integer in the range of 0-25. The mapping is consistent because 'a' is a constant value.
  3. An expanded illustration of the mapping. C++ automatically converts a character to its ASCII encoding when used in an arithmetic expression. The ASCII code for 'a' is 97, and the difference between any lowercase letter and 'a' is an integer in the range 0-25.
The normalization algorithm arbitrarily converts all letters to lowercase. If it instead normalizes them to uppercase, update the mapping function by replacing 'a' with 'A.'

From Algorithms To C++ Functions

anagram program, converting algorithms to C++, <cctype>, isalpha, tolower, string concatenation with assignment, string index operator, +=, []

Reading the initial strings and printing the final message are now familiar operations presented with little detail. Program input and output take place in main as illustrated in the following figure:

int main()
{
	string phrase1 = input(1);			// (a)
	string phrase2 = input(2);			// (a)

	if (phrase1.length() !=  phrase2.length())	// (b)
	{
		cout << "Phrases are NOT an anagram\n";
		exit(1);
	}

	if (is_anagram(phrase1, phrase2))		// (c)
		cout << "Phrases are an anagram\n";	// (d)
	else
		cout << "Phrases are NOT an anagram\n";	// (d)

	return 0;
}
main: A program accepting or rejecting proposed anagrams. The highlighted functions are part of the program, defined in subsequent figures.
  1. The input function normalizes and returns the user's input. The function uses the arguments "1" and "2" to label the input prompt.
  2. The normalized strings must be the same length to form an anagram. The test is optional, as the next if-statement will detect at least one unequal count, but testing the strings' lengths is fast, saving the effort of an unnecessary count.
  3. The is_anagram function returns true if the two phrases form an anagram; otherwise, it returns false.
  4. The program output.

 

string input(int n)
{
	string	input;	 				// (a)

	cout << "Please enter phrase " << n << ": ";	// (b)
	getline(cin, input);				// (c)

	return normalize(input);			// (d)
}
input: Entering program input. The input function reads data from the console and calls the normalize function (highlighted), converting the data to a normalized or standard form.
  1. A local object to temporarily contain the input.
  2. A simple user prompt labeled with the function argument.
  3. The function uses the string version of getline, reading past the spaces that would interrupt input.
  4. The normalize function converts the input to a normal form. The return operator returns input by value.

 

string normalize(const string& s)			// (a)
{
	string normalized;				// (b)

	for (size_t i = 0; i < s.length(); i++)		// (c)
		if (isalpha(s[i]))			// (d)
			normalized += tolower(s[i]);	// (e)

	return normalized;
}
normalize: Implementing the normalization algorithm. The normalization algorithm, Figure 2, focused on what the algorithm needed to accomplish, but not how to do it.
  1. The isalpha function returns true if its argument is an alphabetic character. The if-statement skips or filters out spaces and punctuation characters.
  2. The tolower function converts its argument to lowercase. It returns uppercase and non-letter characters unchanged.
  1. The function normalizes the argument s. The function passes the string by reference for efficiency.
  2. Creates an empty string object.
  3. Iterates or loops over each character in the string.
  4. Discards all space and punctuation characters.
  5. Converts all letters to lowercase and appends them to the end of the normalized string.
isalpha and tolower require the <cctype> header file.

 

void count(const string& s, int* counts)		// (a)
{
	for (size_t i = 0; i < s.length(); i++)		// (b)
		counts[s[i] - 'a']++; 			// (c)
}
:count: Implementing the letter count algorithm. The letter-counting function combines the initial array-based counting algorithm (Figure 3(b)) with the letter-mapping algorithm (Figure 5) to collapse the counting step into a single statement!
  1. The program defines and initializes the parameters in is_anagram. It passes s, one of the candidate phrases, by reference for efficiency. C++ always passes arrays, such as counts, by pointer, making them INOUT parameters.
  2. The for-loop iterates over each character in the string.
  3. The expression s[i] - 'a' converts a letter from a candidate phrase into an array index. The auto-increment operator, ++, increments the corresponding letter's count.

 

bool is_anagram(const string& s1, const string& s2)
{
    int count1[26] {};				// (a)
    count(s1, count1);				// (b)

    int count2[26] {};				// (a)
    count(s2, count2);				// (b)

    for (int i = 0; i < 26; i++)		// (c)
        if (count1[i] != count2[i])		// (d)
            return false;			// (e)

    return true;				// (f)
}
is_anagram: Implementing the anagram detection algorithm.
  1. Defines an array and initializes each element to 0.
  2. Counts the letter frequencies in the normalized string.
  3. Steps through each of the counter pairs.
  4. Compares the counts with algorithm Figure 4(c).
  5. Interrupts the loop if the counts are unequal.
  6. The function signals that the strings form an anagram if the loop completes without finding an unequal pair.

For-Range Loops

for-range loop examples

The text introduced for-range loops in Chapter 3, but has ignored them since. The neglect isn't due to a lack of usefulness but from the kind of data used throughout the first half of the text. For-range loops operate on data like string objects that are "iterable," meaning that they define the begin and end functions (see ipalnumber.cpp). For-range loops are especially useful when translating algorithms that use the phrase "for each."

for (variable-definition : range)
{
	statements;
}
For-range loop syntax. For-range loops iterate over each element in the range, one element per iteration. The red-colored characters are required parts of the loop syntax.
variable-definition
Define a variable whose type is the same as the elements stored in the range. The loop stores each element in the variable as it processes it.
range
Names an instance of an iterable class such as a string, vector, or array.

The text described two algorithms, normalization and letter counting, Figures 2 and 3, respectively, saying for each letter, c, in .... This description suggests that the implementations of these algorithms may be amenable to using for-range loops.

string normalize(const string& s)
{
	string normalized;

	for (char c : s)
		if (isalpha(c))
			normalized += tolower(c);

	return normalized;
}
void count(const string& s, int* counts)
{
	for (char c : s)
		counts[c - 'a']++;
}
 
 
 
 
 
For-range implementations of anagram algorithms. Although replacing traditional for-loops with for-range loops isn't a significant change, they are often a better match for the natural-language problem or algorithm descriptions.

Downloadable Code

ViewDownloadComments
anagram.cpp anagram.cpp string class version
canagram.cpp canagram.cpp C-string version