7.8.3. Using Arrays: The Chi-Square Statistic

array, chi-square, sigma, summation, Σ, expected frequency, observed frequency, degrees of freedom, categories, bins, formulas, converting formulas to C++, +=, χ²

Time: 00:04:12 | Download: Large, Large (CC), Small | Streaming, Streaming (CC) | Slides: PDF, PPTX

At the end of the semester, quarter, or block, students can evaluate each course, expressing their opinions about its various aspects. The evaluations typically include Likert-scale questions with a statement and a ranked list of discrete responses (similar to multiple-choice questions). For example, "Did the instructor dress appropriately for class?" The evaluation device might give students four choices: (a) always, (b) most of the time, (c) seldom, and (d) never. Evaluations help institutions select and train qualified instructors and help instructors improve courses, but only if the data is valid. If students rush through the evaluation, making random selections, the evaluation data is meaningless. To simplify the example, we assume that 100 students respond, with 40 choosing (a), 30 choosing (b), 20 choosing (c), and 10 choosing (d).

The chi-square statistic (kī, like kite) is appropriate for evaluating discrete, categorical data, comparing observed and expected frequencies. It calculates a value that, when compared with a reference value from a contingency table, indicates with some level of confidence whether the difference between the calculated and reference values is significant. The example has one question with four categories, "bins" or possible answers, called the degrees of freedom. If the responses are random, we expect students to select each choice about the same number of times. So, the expected frequency, f_e, is 100/4 or 25 per bin. If the observed frequency, f_o, is close to the expected frequency, that question is not significant (it does not demonstrate any useful information). But if the data are truly random, the observed and expected frequencies are unlikely to be exactly equal. The chi-square statistic attempts to distinguish between random and meaningful data with some degree of confidence.

\[f_e = {N \over k} \]

The expected frequency. If the events occur randomly with equal probability, then the expected frequency is the total number of responses or observations, n, divided by the number of categories or bins, k. For example, if we roll a fair die, the probability of a face showing is 1/6; if we roll it 100 times, the expected frequency of each possible value is 100/6.

\[\chi^2 = \sum { (f_o - f_e)^2 \over f_e } = {1 \over f_e} \sum (f_o - f_e)^2 \]

The chi-square statistic. In the chi-square (χ²) formula, f_o, is the observed frequency (the count of how many students chose a category), one for each bin or category; f_e, the expected frequency, is the same for all categories. The Σ operator means to sum the results of the formula for every observed frequency.

The formulas might look intimidating depending on your mathematical background. However, once you understand what the arcane symbols mean, you'll see that they translate directly to corresponding C++ features related to arrays. The formulas relate two frequencies, one expected and the other observed, and distinguish them with subscripts. The expected frequency is the same for each category: the total number of observations or participants (100). The observed frequencies are the number of times a participant selected each category: 40, 30, 20, and 10. In this context, subscripts are a typographic device that C++ can't replicate, so it's customary to form variable names by "hoisting" the subscript up to "regular" text: f_e and f_o become fe and fo, respectively. f_e is a single value and f_o is an array of size k, the number of categories or bins (4 in this example).

int k;
...
int N;
...
double fe = (double)N / k;

Programming the expected frequency. The expected frequency is the quotient of the number of responses, N, and the degrees of freedom (the number of categories), k. The program defines these variables as integers, so a typecast is necessary to avoid a truncation error.

double    sum = 0;
for (int i = 0; i < k; i++)
    sum += pow(fo[i] - fe, 2);

double chi2 = sum / fe;

Programming chi-square. Each category has a corresponding observed frequency, stored in the fo array, and forming one term in the calculation: (fo[i]-fe)². The Σ operator sums the terms to a single value. Chi-square is the quotient of the sum and the expected frequency: sum/fe.

Of all the symbols appearing in the chi-square formula (Figure 2), the Σ operator condenses the most meaning into the least amount of space. It implies iterating over the elements in the fo array, calculating a term during each iteration, and summing the terms. Consequently, programmers translate Σ to C++ with a for-loop and the addition with assignment operator: +=.

Chi-square, Version 1

#include <iostream>
#include <cmath>
using namespace std;

int main()
{
    cout << "Number of categories (0 < k <= 100):";			// the degrees of freedom, k
    int    k;
    cin >> k;

    if (k < 1 || > 100)							// validate k to avoid an out-of-bounds error
    {
        cerr << "The number of categories must be 0 < k <= 100";
        exit(1);
    }

    int    fo[100];							// prepare for data input
    int    N = 0;

    for (int i = 0; i < k;)						// input the observed frequencies;
    {									// the loop doesn't increment i
        cout << "Frequency for category " << i+1 << ": ";		// in case the frequency is invalid
        int f;
        cin >> f;

        if (f < 0)							// validate each frequency
        {
            cerr << "Negative frequencies are not allowed";
            continue;
        }

        fo[i++] = f;							// store valid frequencies and increment i
        N += f;
    }

    double    fe = (double)N / k;					// calculate chi-square
    double    sum = 0;
    for (int i = 0; i < k; i++)
        sum += pow(fo[i] - fe, 2);

    cout << "Chi-square = " << sum / fe << endl;

    return 0;
}

The chi-square solution as a single function. The focus of first version is on converting the formulas into a working C++ program, which is not restricted to processing Likert-scale data. Consequently, it makes the observed frequency array, fo, very large.

When we run the program with the input: 40, 30, 20, and 10, the output is 20. To answer the ultimate question, "Are the observed frequencies significant?" we compare the chi-square statistic to a critical value from a contingency table. If the calculated value exceeds the critical value in the table, the observed frequencies are significant and unlikely to be random.

Chi-square, Version 2

The previous solution is short enough to fit in a single function. Nevertheless, separating the code into a client-supplier architecture has two advantages. First, it makes the chi-square calculation (the supplier) easier to reuse in other programs - perhaps unrelated to Likert scale questionnaires. Second, it allows developers to tailor the client to a specific problem and determine the data's source (console, file, etc.) and the statistic's ultimate use (output or further calculations).

`lickert.cpp` (Client)	`chi2.cpp` (Supplier)
#include <iostream> #include <cmath> using namespace std; double chi2(int* fo, int k, double N); int main() { cout << "Number of categories (0 < k <= 9):"; int k = 0; cin >> k; if (k < 1 \|\| k > 9) { cerr << "The number of categories is out of bounds"; exit(1); } int fo[9]; int N = 0; for (int i = 0; i < k;) { cout << "Frequency for category " << i+1 << ": "; int f; cin >> f; if (f < 0) { cerr << "Negative frequencies are not allowed"; continue; } fo[i++] = f; N += f; } cout << "Chi-square = " << chi2(fo, k, N) << endl; return 0; }	#include <cmath> using namespace std; double chi2(int* fo, int k, double N) { double fe = N / k; double sum = 0; for (int i = 0; i < k; i++) sum += pow(fo[i] - fe, 2); return sum / fe; }

lickert.cpp (Client)

chi2.cpp (Supplier)

#include <iostream>
#include <cmath>
using namespace std;

double	chi2(int* fo, int k, double N);

int main()
{
    cout << "Number of categories (0 < k <= 9):";
    int    k = 0;
    cin >> k;

    if (k < 1 || k > 9)
    {
        cerr << "The number of categories is out of bounds";
        exit(1);
    }

    int    fo[9];
    int    N = 0;

    for (int i = 0; i < k;)
    {
        cout << "Frequency for category " << i+1 << ": ";
        int f;
        cin >> f;

        if (f < 0)
        {
            cerr << "Negative frequencies are not allowed";
            continue;
        }

        fo[i++] = f;
        N += f;
    }

    cout << "Chi-square = " << chi2(fo, k, N) << endl;

    return 0;
}

#include <cmath>
using namespace std;

double chi2(int* fo, int k, double N)
{
    double fe =  N / k;

    double    sum = 0;
    for (int i = 0; i < k; i++)
        sum += pow(fo[i] - fe, 2);

    return sum / fe;
}

Client-supplier chi-square solution. The second version moves the chi-square calculation to a separate function, making it easier to reuse. It also renames the application or client to likert.cpp, suggesting a narrower use of the chi-square program. Likert-scale questionnaires typically have an odd number of categories, generally 5 or 7, but 9 on rare occasions. Accordingly, this version shrinks the fo array to a more realistic size. The explicit typecast in the previous example, (double)N, is now implicitly implemented by the function call.