Chi-square test of Independence
Assumptions
- The two variables should be measured at an ordinal or nominal level.
- Each variable should consist of two or more categories. For example,
- the variable Socio-Economic Status : Low,medium and high,
- the variable gender : Male, Female
Step by Step Procedure for Chi-square test of independence
Suppose that a given population consisting of $N$ items is divided into $r$ mutually exclusive and exhaustive classes with respect to attribute $A$, say, $A_1, A_2,\cdots,A_r$ and the same population is divided into $c$ mutually exclusive and exhaustive classes with respective to attribute $B$, say,$B_1, B_2, \cdots,B_c$. Such an arrangement of $r$ rows and $c$ columns is called $r\times c$ contingency table.
$A$ / $B$ | $B_1$ | $B_2$ | $\cdots$ | $B_j$ | $\cdots$ | $B_c$ | Total |
---|---|---|---|---|---|---|---|
$A_1$ | $(A_1B_1)$ | $(A_1B_2)$ | $\cdots$ | $(A_1B_j)$ | $\cdots$ | $(A_1B_c)$ | $(A_1)$ |
$A_2$ | $(A_2B_1)$ | $(A_2B_2)$ | $\cdots$ | $(A_2B_j)$ | $\cdots$ | $(A_2B_c)$ | $(A_2)$ |
$\vdots$ | $\vdots$ | $\cdots$ | $\vdots$ | $\cdots$ | $\vdots$ | $\vdots$ | |
$A_i$ | $(A_iB_1)$ | $(A_iB_2)$ | $\cdots$ | $(A_iB_j)$ | $\cdots$ | $(A_iB_c)$ | $(A_i)$ |
$\vdots$ | $\vdots$ | $\cdots$ | $\vdots$ | $\cdots$ | $\vdots$ | $\vdots$ | |
$A_r$ | $(A_rB_1)$ | $(A_rB_2)$ | $\cdots$ | $(A_rB_j)$ | $\cdots$ | $(A_rB_c)$ | $(A_r)$ |
Total | $(B_1)$ | $(B_2)$ | $\cdots$ | $(B_j)$ | $\cdots$ | $(B_c)$ | $N$ |
In the above table
-
$(A_iB_j)$ is the number of member possessing the attribute $A_i$ and $B_j$,
-
$(A_i)$ is the total frequency of $i^{th}$ row i.e., attribute $A_i$ and
-
$(B_j)$ is the total frequency of $j^{th}$ column i.e., attribute $B_j$. And
$N=\sum_{i=1}^r (A_i) =\sum_{j=1}^c (B_j)$
.
Step 1 The null and alternative hypothesis:
To test the independence of attributes, the null hypothesis can be setup as
$H_0$ : The two attributes $A$ and $B$ are independent.
Step 2 Test statistic
In the above contingency table $(A_iB_j)$
(say, $O_{ij}$
) denote the observed frequency of attributes $A_i$ and $B_j$.
Under the null hypothesis, i.e., attributes $A$ and $B$ are independent, the expected frequency is given by
$$ \begin{aligned} E_{ij}=\frac{(A_i)(B_j)}{N},\; i=1,2, \cdots, r; j=1,2,\cdots, c. \end{aligned} $$
If two events $A$ and $B$ are independent, then we have
$$ \begin{aligned} P(A\cap B) &= P(A)\times P(B)\\ \implies \frac{n(A\cap B)}{N}&=\frac{n(A)}{N}\times \frac{n(B)}{N}\\ \implies n(A\cap B) &= \frac{n(A) \times n(B)}{N} \end{aligned} $$
where $n(A)$ is the number of elements favorable to $A$ out of $N$.
The test statistic under the null hypothesis for testing above hypothesis is
$$ \begin{aligned} \chi^2 &= \sum_{i=1}^r\sum_{j=1}^c\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\sim\chi^2_{(r-1)(c-1)}\\\nonumber & = \sum_{i=1}^r\sum_{j=1}^c\frac{O_{ij}^2}{E_{ij}}-N\sim\chi^2_{(r-1)(c-1)}. \end{aligned} $$
where $r$ is the number of rows and $c$ is the number of columns.
The calculated value of $\chi^2$ is called $\chi^2_c$.
Step 3 Specify the Level of Significance
Step 4 Critical value of Chi-square
The table value of $\chi^2$ for $(r-1)(c-1)$ degrees of freedom and at $\alpha$ level of significance is $\chi^2_t=\chi^2_{(r-1)(c-1),\alpha}$
.
Step 5 Computation of Test Statistic
The test statistic under the null hypothesis for testing above hypothesis is
$$ \begin{aligned} \chi^2_{obs} &= \sum_{i=1}^r\sum_{j=1}^c\frac{(O_{ij}-E_{ij})^2}{E_{ij}} \end{aligned} $$
Step 6 Decision (Traditional approach)
If $\chi^2_{obs}\leq \chi^2_t$, then accept $H_0$ at $\alpha$ level of significance, i.e., the two attributes are independent, other wise reject $H_0$ at $\alpha$ level of significance.
OR
Step 6 Decision (p-value approach)
The p-value of the test is
$$ p = P(\chi^2_{(r-1)(c-1)}\geq\chi^2_{obs}) $$
If $p$-value of the test is less than $\alpha$, then reject the null hypothesis $H_0$ at $\alpha$ level of significance, otherwise fail to reject $H_0$ at $\alpha$ level of significance.
Chi-square test of Independence Example 1
A researcher collected data from a sample that he chose and he wishes to understand the relationship between two variables: gender and preference of public transportation. The researcher has two categories for gender (male, female) and two categories for mode of transportation (bus, train). He collects his data and perform a count off how many observations appeared in his data set. He found he has the actual counts in the table below after looking at his data set:
Gender / Transportation | Bus | Train | Total |
---|---|---|---|
Male | 50 | 30 | 80 |
Female | 40 | 80 | 120 |
Total | 90 | 110 | 200 |
What is the Chi-square test statistics
Solution
The observed data is
Bus Train Sum
Male 50 30 80
Female 40 80 120
Sum 90 110 200
Number of rows $r=2$, number of columns $c=2$.
Step 1 The null and alternative hypothesis are as follows:
$H_0:$ The row variable (gender) and column variable (mode of transportation) are independent.
$H_1:$ The row variable (gender) and column variable (mode of transportation) are not independent (they are dependent).
Step 2 Test statistic
The test statistic for testing above hypothesis is
$$ \begin{equation*} \chi^2= \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ \end{equation*} $$
Step 3 Level of Significance
The level of significance is $\alpha =0.05$.
Step 4 Critical value of $\chi^2$
The level of significance is $\alpha =0.05$. Degrees of freedom $df=(r-1)(c-1)=(2-1)(2-1) =1$.
The critical value of $\chi^2$ for $df=1$ and $\alpha=0.05$ level of significance is $\chi^2_{0.05,1} =3.8415$.
Step 5 Computation of test Statistic
The expected frequency for $(i,j)^{th}$ cell is given by
$$ \begin{equation*} E_{ij} =\frac{i^{th}\text{ row total }\times j^{th}\text{ column total}}{N} \end{equation*} $$
For example, $E_{11}$ is given by
$$ \begin{eqnarray*} E_{11} & = &\frac{1^{st}\text{ row total }\times 1^{st}\text{ column total}}{N}\\ &=& \frac{80*90}{200}\\ &=&36. \end{eqnarray*} $$
Table of Expected Frequencies:
Bus | Train | Sum | |
---|---|---|---|
Male | 36 | 44 | 80 |
Female | 54 | 66 | 120 |
Sum | 90 | 110 | 200 |
The test statistic is
$$ \begin{eqnarray*} \chi^2_{obs}&=& \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ &=&\frac{(50-36)^2}{36}+\cdots + \frac{(80-66)^2}{66}\\ &=& 16.4983. \end{eqnarray*} $$
Step 6 Decision (Traditional approach)
The test statistic is $\chi^2_{obs} =16.4983$
which falls $inside$ the critical region bonded by the critical value $\chi^2_{0.05,1}=3.8415$
, we $\textit{reject}$ the null hypothesis.
OR
Step 6 Decision ($p$-value approach)
The p-value is $P(\chi^2_{1}>16.4983) =0.00005$
.
As the p-value $0.00005$ is $\textit{less than}$ the significance level of $\alpha = 0.05$, we $\textit{reject}$ the null hypothesis.
Interpretation
That is row variable (gender) and column variable (mode of transportation) are not independent (they are dependent).
Chi-square test of Independence Example 2
The National Sleep Foundation used a survey to determine whether hours of sleeping per night are independent of age (Newsweek, January 19, 2004). The following show the hours of sleep on weeknights for a sample of individuals age 49 and younger and for a sample of individuals age 50 and older.
Age / Hours of sleep | Less than 6 | 6 to 6.9 | 7 to 7.9 | 8 or more |
---|---|---|---|---|
49 or younger | 33 | 61 | 71 | 75 |
50 or older | 32 | 61 | 70 | 97 |
Conduct a test of independence to determine whether the hours of sleep on weeknights are independent of age. Use $\alpha = .05$.
Compute the value of the test statistic and the p-value.
Solution
The observed data is
Fewer than 6 6 to 6.9 7 to 7.9 8 or more Sum
49 or younger 33 61 71 75 240
50 or older 32 61 70 97 260
Sum 65 122 141 172 500
Number of rows $r=2$, number of columns $c=4$.
Step 1 The null and alternative hypothesis are as follows:
$H_0:$ The row variable (gender) and column variable (mode of transportation) are independent.
$H_1:$ The row variable (gender) and column variable (mode of transportation) are not independent (they are dependent).
Step 2 Test statistic
The test statistic for testing above hypothesis is
$$ \begin{equation*} \chi^2= \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ \end{equation*} $$
Step 3 Level of Significance
The level of significance is $\alpha =0.05$.
Step 4 Critical value of $\chi^2$
The level of significance is $\alpha =0.05$. Degrees of freedom $df=(r-1)(c-1)=(2-1)(4-1) =3$.
The critical value of $\chi^2$ for $df=3$ and $\alpha=0.05$ level of significance is $\chi^2_{0.05,3} =7.8147$.
Step 5 Computation of test Statistic
The expected frequency for $(i,j)^{th}$ cell is given by
$$ \begin{equation*} E_{ij} =\frac{i^{th}\text{ row total }\times j^{th}\text{ column total}}{N} \end{equation*} $$
For example, $E_{11}$
is given by
$$ \begin{aligned} E_{11} & = \frac{1^{st}\text{ row total }\times 1^{st}\text{ column total}}{N}\\ &= \frac{240*65}{500}\\ &=31.2. \end{aligned} $$
Similarly one can find the other expected frequencies.
Table of Expected Frequencies:
Fewer than 6 | 6 to 6.9 | 7 to 7.9 | 8 or more | Sum | |
---|---|---|---|---|---|
49 or younger | 31.2 | 58.56 | 67.68 | 82.56 | 240 |
50 or older | 33.8 | 63.44 | 73.32 | 89.44 | 260 |
Sum | 65.0 | 122.00 | 141.00 | 172.00 | 500 |
The test statistic is
$$ \begin{aligned} \chi^2_{obs}&= \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ &=\frac{(33-31.2)^2}{31.2}+\cdots + \frac{(97-89.44)^2}{89.44}\\ &= 2.0397. \end{aligned} $$
Step 6 Decision (Traditional approach)
The test statistic is $\chi^2_{obs} =2.0397$
which falls $outside$ the critical region bonded by the critical value $\chi^2_{0.05,3}=7.8147$
, we $\textit{fail to reject}$ the null hypothesis.
OR
Step 6 Decision ($p$-value approach)
The p-value is $P(\chi^2_{3}>2.0397) =0.56421$.
As the p-value $0.56421$ is $\textit{greater than}$ the significance level of $\alpha = 0.05$, we $\textit{fail to reject}$ the null hypothesis.
Interpretation
We conclude the hours of sleep on weeknights are independent of age.
Chi-square test of Independence Example 3
Response to a survey question are broken down according to employment status and the sample results are given below. At the 0.10 significance level, test the claim that response and employment status are independent.
. | Yes | No | Undecided |
---|---|---|---|
Employment | 30 | 15 | 5 |
Unemployment | 20 | 25 | 10 |
Solution
The observed data is
Yes No Undecided Sum
Employment 30 15 5 50
Unemployment 20 25 10 55
Sum 50 40 15 105
Number of rows $r=2$, number of columns $c=3$.
Step 1 The null and alternative hypothesis are as follows:
$H_0:$ The row variable (gender) and column variable (mode of transportation) are independent.
$H_1:$ The row variable (gender) and column variable (mode of transportation) are not independent (they are dependent).
Step 2 Test statistic
The test statistic for testing above hypothesis is
$$ \begin{equation*} \chi^2= \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ \end{equation*} $$
Step 3 Level of Significance
The level of significance is $\alpha =0.1$.
Step 4 Critical value of $\chi^2$
The level of significance is $\alpha =0.1$. Degrees of freedom $df=(r-1)(c-1)=(2-1)(3-1) =2$.
The critical value of $\chi^2$ for $df=2$ and $\alpha=0.1$ level of significance is $\chi^2_{0.1,2} =4.6052$.
Step 5 Computation of test Statistic
The expected frequency for $(i,j)^{th}$ cell is given by
$$ \begin{equation*} E_{ij} =\frac{i^{th}\text{ row total }\times j^{th}\text{ column total}}{N} \end{equation*} $$
For example, $E_{11}$
is given by
$$ \begin{aligned} E_{11} & = \frac{1^{st}\text{ row total }\times 1^{st}\text{ column total}}{N}\\ &= \frac{50*50}{105}\\ &=23.81. \end{aligned} $$
Similarly one can determine the other expected frequencies.
Table of Expected Frequencies:
Yes | No | Undecided | Sum | |
---|---|---|---|---|
Employment | 23.80952 | 19.04762 | 7.142857 | 50 |
Unemployment | 26.19048 | 20.95238 | 7.857143 | 55 |
Sum | 50.00000 | 40.00000 | 15.000000 | 105 |
The test statistic is
$$ \begin{aligned} \chi^2_{obs}&= \sum \sum \frac{(O_{ij} -E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)}\\ &=\frac{(30-23.81)^2}{23.81}+\cdots + \frac{(10-7.86)^2}{7.86}\\ &= 5.942. \end{aligned} $$
Step 6 Decision (Traditional approach)
The test statistic is $\chi^2_{obs} =5.942$
which falls $inside$ the critical region bonded by the critical value $\chi^2_{0.1,2}=4.6052$
, we $\textit{reject}$ the null hypothesis.
OR
Step 6 Decision ($p$-value approach)
The p-value is $P(\chi^2_{2}>5.942) =0.05125$.
As the p-value $0.05125$ is $\textit{less than}$ the significance level of $\alpha = 0.1$, we $\textit{reject}$ the null hypothesis.
Interpretation
We conclude that response and employment status are dependent.
Endnote
In this tutorial, you learned the chi-square test of independence. You also learned about the step by step procedure to apply chi-square test of independence and step by step solved examples on chi-square test of independence.
To learn more about other parametric and non-parametric test please refer to the following tutorials:
Let me know in the comments if you have any questions on chi-square test of independence and your thought on this article.