Chi-square test of independence with examples

Chi-square test of Independence

Assumptions

  • The two variables should be measured at an ordinal or nominal level.
  • Each variable should consist of two or more categories. For example,
    • the variable Socio-Economic Status : Low,medium and high,
    • the variable gender : Male, Female

Step by Step Procedure for Chi-square test of independence

Suppose that a given population consisting of $N$ items is divided into $r$ mutually exclusive and exhaustive classes with respect to attribute $A$, say, $A_1, A_2,\cdots,A_r$ and the same population is divided into $c$ mutually exclusive and exhaustive classes with respective to attribute $B$, say,$B_1, B_2, \cdots,B_c$. Such an arrangement of $r$ rows and $c$ columns is called $r\times c$ contingency table.

$A$ / $B$ $B_1$ $B_2$ $\cdots$ $B_j$ $\cdots$ $B_c$ Total
$A_1$ $(A_1B_1)$ $(A_1B_2)$ $\cdots$ $(A_1B_j)$ $\cdots$ $(A_1B_c)$ $(A_1)$
$A_2$ $(A_2B_1)$ $(A_2B_2)$ $\cdots$ $(A_2B_j)$ $\cdots$ $(A_2B_c)$ $(A_2)$
$\vdots$ $\vdots$ $\cdots$ $\vdots$ $\cdots$ $\vdots$ $\vdots$
$A_i$ $(A_iB_1)$ $(A_iB_2)$ $\cdots$ $(A_iB_j)$ $\cdots$ $(A_iB_c)$ $(A_i)$
$\vdots$ $\vdots$ $\cdots$ $\vdots$ $\cdots$ $\vdots$ $\vdots$
$A_r$ $(A_rB_1)$ $(A_rB_2)$ $\cdots$ $(A_rB_j)$ $\cdots$ $(A_rB_c)$ $(A_r)$
Total $(B_1)$ $(B_2)$ $\cdots$ $(B_j)$ $\cdots$ $(B_c)$ $N$

In the above table

  • $(A_iB_j)$ is the number of member possessing the attribute $A_i$ and $B_j$,

  • $(A_i)$ is the total frequency of $i^{th}$ row i.e., attribute $A_i$ and

  • $(B_j)$ is the total frequency of $j^{th}$ column i.e., attribute $B_j$. And $N=\sum_{i=1}^r (A_i) =\sum_{j=1}^c (B_j)$.

Step 1 The null and alternative hypothesis:

To test the independence of attributes, the null hypothesis can be setup as

$H_0$ : The two attributes $A$ and $B$ are independent.

Step 2 Test statistic

In the above contingency table $(A_iB_j)$ (say, $O_{ij}$) denote the observed frequency of attributes $A_i$ and $B_j$.

Under the null hypothesis, i.e., attributes $A$ and $B$ are independent, the expected frequency is given by

$$ \begin{aligned} E_{ij}=\frac{(A_i)(B_j)}{N},\; i=1,2, \cdots, r; j=1,2,\cdots, c. \end{aligned} $$

If two events $A$ and $B$ are independent, then we have

$$ \begin{aligned} P(A\cap B) &= P(A)\times P(B)\\ \implies \frac{n(A\cap B)}{N}&=\frac{n(A)}{N}\times \frac{n(B)}{N}\\ \implies n(A\cap B) &= \frac{n(A) \times n(B)}{N} \end{aligned} $$
where $n(A)$ is the number of elements favorable to $A$ out of $N$.

The test statistic under the null hypothesis for testing above hypothesis is

$$ \begin{aligned} \chi^2 &= \sum_{i=1}^r\sum_{j=1}^c\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\sim\chi^2_{(r-1)(c-1)}\\\nonumber & = \sum_{i=1}^r\sum_{j=1}^c\frac{O_{ij}^2}{E_{ij}}-N\sim\chi^2_{(r-1)(c-1)}. \end{aligned} $$

where $r$ is the number of rows and $c$ is the number of columns.

The calculated value of $\chi^2$ is called $\chi^2_c$.

Step 3 Specify the Level of Significance

Step 4 Critical value of Chi-square

The table value of $\chi^2$ for $(r-1)(c-1)$ degrees of freedom and at $\alpha$ level of significance is $\chi^2_t=\chi^2_{(r-1)(c-1),\alpha}$.

Step 5 Computation of Test Statistic

The test statistic under the null hypothesis for testing above hypothesis is

$$ \begin{aligned} \chi^2_{obs} &= \sum_{i=1}^r\sum_{j=1}^c\frac{(O_{ij}-E_{ij})^2}{E_{ij}} \end{aligned} $$

Step 6 Decision (Traditional approach)

If $\chi^2_{obs}\leq \chi^2_t$, then accept $H_0$ at $\alpha$ level of significance, i.e., the two attributes are independent, other wise reject $H_0$ at $\alpha$ level of significance.

OR

Step 6 Decision (p-value approach)

The p-value of the test is

$$ p = P(\chi^2_{(r-1)(c-1)}\geq\chi^2_{obs}) $$

If $p$-value of the test is less than $\alpha$, then reject the null hypothesis $H_0$ at $\alpha$ level of significance, otherwise fail to reject $H_0$ at $\alpha$ level of significance.

Chi-square test of Independence Example 1

A researcher collected data from a sample that he chose and he wishes to understand the relationship between two variables: gender and preference of public transportation. The researcher has two categories for gender (male, female) and two categories for mode of transportation (bus, train). He collects his data and perform a count off how many observations appeared in his data set. He found he has the actual counts in the table below after looking at his data set:

Gender / Transportation Bus Train Total
Male 50 30 80
Female 40 80 120
Total 90 110 200

What is the Chi-square test statistics

Solution

The observed data is

       Bus Train Sum
Male    50    30  80
Female  40    80 120
Sum     90   110 200

Number of rows $r=2$, number of columns $c=2$.

Step 1 The null and alternative hypothesis are as follows:

$H_0:$ The row variable (gender) and column variable (mode of transportation) are independent.

$H_1:$ The row variable (gender) and column variable (mode of transportation) are not independent (they are dependent).

Step 2 Test statistic

The test statistic for testing above hypothesis is

$$ \begin{equation*} \chi^2= \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ \end{equation*} $$

Step 3 Level of Significance

The level of significance is $\alpha =0.05$.

Step 4 Critical value of $\chi^2$

The level of significance is $\alpha =0.05$. Degrees of freedom $df=(r-1)(c-1)=(2-1)(2-1) =1$.

The critical value of $\chi^2$ for $df=1$ and $\alpha=0.05$ level of significance is $\chi^2_{0.05,1} =3.8415$.

Step 5 Computation of test Statistic

The expected frequency for $(i,j)^{th}$ cell is given by

$$ \begin{equation*} E_{ij} =\frac{i^{th}\text{ row total }\times j^{th}\text{ column total}}{N} \end{equation*} $$

For example, $E_{11}$ is given by

$$ \begin{eqnarray*} E_{11} & = &\frac{1^{st}\text{ row total }\times 1^{st}\text{ column total}}{N}\\ &=& \frac{80*90}{200}\\ &=&36. \end{eqnarray*} $$

Table of Expected Frequencies:

Bus Train Sum
Male 36 44 80
Female 54 66 120
Sum 90 110 200

The test statistic is

$$ \begin{eqnarray*} \chi^2_{obs}&=& \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ &=&\frac{(50-36)^2}{36}+\cdots + \frac{(80-66)^2}{66}\\ &=& 16.4983. \end{eqnarray*} $$

Step 6 Decision (Traditional approach)

The test statistic is $\chi^2_{obs} =16.4983$ which falls $inside$ the critical region bonded by the critical value $\chi^2_{0.05,1}=3.8415$, we $\textit{reject}$ the null hypothesis.

OR

Step 6 Decision ($p$-value approach)

The p-value is $P(\chi^2_{1}>16.4983) =0.00005$.

As the p-value $0.00005$ is $\textit{less than}$ the significance level of $\alpha = 0.05$, we $\textit{reject}$ the null hypothesis.

Interpretation

That is row variable (gender) and column variable (mode of transportation) are not independent (they are dependent).

Chi-square test of Independence Example 2

The National Sleep Foundation used a survey to determine whether hours of sleeping per night are independent of age (Newsweek, January 19, 2004). The following show the hours of sleep on weeknights for a sample of individuals age 49 and younger and for a sample of individuals age 50 and older.

Age / Hours of sleep Less than 6 6 to 6.9 7 to 7.9 8 or more
49 or younger 33 61 71 75
50 or older 32 61 70 97

Conduct a test of independence to determine whether the hours of sleep on weeknights are independent of age. Use $\alpha = .05$.

Compute the value of the test statistic and the p-value.

Solution

The observed data is

              Fewer than 6 6 to 6.9 7 to 7.9 8 or more Sum
49 or younger           33       61       71        75 240
50 or older             32       61       70        97 260
Sum                     65      122      141       172 500

Number of rows $r=2$, number of columns $c=4$.

Step 1 The null and alternative hypothesis are as follows:

$H_0:$ The row variable (gender) and column variable (mode of transportation) are independent.

$H_1:$ The row variable (gender) and column variable (mode of transportation) are not independent (they are dependent).

Step 2 Test statistic

The test statistic for testing above hypothesis is

$$ \begin{equation*} \chi^2= \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ \end{equation*} $$

Step 3 Level of Significance

The level of significance is $\alpha =0.05$.

Step 4 Critical value of $\chi^2$

The level of significance is $\alpha =0.05$. Degrees of freedom $df=(r-1)(c-1)=(2-1)(4-1) =3$.

The critical value of $\chi^2$ for $df=3$ and $\alpha=0.05$ level of significance is $\chi^2_{0.05,3} =7.8147$.

Step 5 Computation of test Statistic

The expected frequency for $(i,j)^{th}$ cell is given by

$$ \begin{equation*} E_{ij} =\frac{i^{th}\text{ row total }\times j^{th}\text{ column total}}{N} \end{equation*} $$

For example, $E_{11}$ is given by

$$ \begin{aligned} E_{11} & = \frac{1^{st}\text{ row total }\times 1^{st}\text{ column total}}{N}\\ &= \frac{240*65}{500}\\ &=31.2. \end{aligned} $$

Similarly one can find the other expected frequencies.

Table of Expected Frequencies:

Fewer than 6 6 to 6.9 7 to 7.9 8 or more Sum
49 or younger 31.2 58.56 67.68 82.56 240
50 or older 33.8 63.44 73.32 89.44 260
Sum 65.0 122.00 141.00 172.00 500

The test statistic is

$$ \begin{aligned} \chi^2_{obs}&= \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ &=\frac{(33-31.2)^2}{31.2}+\cdots + \frac{(97-89.44)^2}{89.44}\\ &= 2.0397. \end{aligned} $$

Step 6 Decision (Traditional approach)

The test statistic is $\chi^2_{obs} =2.0397$ which falls $outside$ the critical region bonded by the critical value $\chi^2_{0.05,3}=7.8147$, we $\textit{fail to reject}$ the null hypothesis.

OR

Step 6 Decision ($p$-value approach)

The p-value is $P(\chi^2_{3}>2.0397) =0.56421$.

As the p-value $0.56421$ is $\textit{greater than}$ the significance level of $\alpha = 0.05$, we $\textit{fail to reject}$ the null hypothesis.

Interpretation

We conclude the hours of sleep on weeknights are independent of age.

Chi-square test of Independence Example 3

Response to a survey question are broken down according to employment status and the sample results are given below. At the 0.10 significance level, test the claim that response and employment status are independent.

. Yes No Undecided
Employment 30 15 5
Unemployment 20 25 10

Solution

The observed data is

             Yes  No Undecided Sum
Employment    30  15         5  50
Unemployment  20  25        10  55
Sum           50  40        15 105

Number of rows $r=2$, number of columns $c=3$.

Step 1 The null and alternative hypothesis are as follows:

$H_0:$ The row variable (gender) and column variable (mode of transportation) are independent.

$H_1:$ The row variable (gender) and column variable (mode of transportation) are not independent (they are dependent).

Step 2 Test statistic

The test statistic for testing above hypothesis is

$$ \begin{equation*} \chi^2= \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ \end{equation*} $$

Step 3 Level of Significance

The level of significance is $\alpha =0.1$.

Step 4 Critical value of $\chi^2$

The level of significance is $\alpha =0.1$. Degrees of freedom $df=(r-1)(c-1)=(2-1)(3-1) =2$.

The critical value of $\chi^2$ for $df=2$ and $\alpha=0.1$ level of significance is $\chi^2_{0.1,2} =4.6052$.

Step 5 Computation of test Statistic

The expected frequency for $(i,j)^{th}$ cell is given by

$$ \begin{equation*} E_{ij} =\frac{i^{th}\text{ row total }\times j^{th}\text{ column total}}{N} \end{equation*} $$

For example, $E_{11}$ is given by

$$ \begin{aligned} E_{11} & = \frac{1^{st}\text{ row total }\times 1^{st}\text{ column total}}{N}\\ &= \frac{50*50}{105}\\ &=23.81. \end{aligned} $$

Similarly one can determine the other expected frequencies.

Table of Expected Frequencies:

Yes No Undecided Sum
Employment 23.80952 19.04762 7.142857 50
Unemployment 26.19048 20.95238 7.857143 55
Sum 50.00000 40.00000 15.000000 105

The test statistic is

$$ \begin{aligned} \chi^2_{obs}&= \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ &=\frac{(30-23.81)^2}{23.81}+\cdots + \frac{(10-7.86)^2}{7.86}\\ &= 5.942. \end{aligned} $$

Step 6 Decision (Traditional approach)

The test statistic is $\chi^2_{obs} =5.942$ which falls $inside$ the critical region bonded by the critical value $\chi^2_{0.1,2}=4.6052$, we $\textit{reject}$ the null hypothesis.

OR

Step 6 Decision ($p$-value approach)

The p-value is $P(\chi^2_{2}>5.942) =0.05125$.

As the p-value $0.05125$ is $\textit{less than}$ the significance level of $\alpha = 0.1$, we $\textit{reject}$ the null hypothesis.

Interpretation

We conclude that response and employment status are dependent.

Endnote

In this tutorial, you learned the chi-square test of independence. You also learned about the step by step procedure to apply chi-square test of independence and step by step solved examples on chi-square test of independence.

To learn more about other parametric and non-parametric test please refer to the following tutorials:

Let me know in the comments if you have any questions on chi-square test of independence and your thought on this article.

Leave a Comment