# Hypergeometric distribution probabilities using R

## Hypergeometric distribution probabilities using R

In this tutorial, you will learn about how to use dhyper(), phyper(), qhyper() and rhyper() functions in R programming language to compute the individual probabilities, cumulative probabilities, quantiles and to generate random sample for Hypergeometric distribution.

Before we discuss R functions for Hypergeometric distribution, let us see what is Hypergeometric distribution.

## Hypergeometric Distribution

A hypergeometric experiment is an experiment which satisfies each of the following conditions:

• The population or set to be sampled consists of $m+n$ objects, or elements (a finite population).
• Each object can be characterized as a "success" or "failure", and there are $m$ number of successes in the population and $n$ failures in the population.
• A sample of $k$ individuals is drawn in such a way that each subset of size $k$ is equally likely to be chosen.

Let $X\sim H(m,n,k)$. Then the probability distribution of $X$ is

 \begin{aligned} P(X=x) &= \frac{\binom{m}{x}\binom{n}{k-x}}{\binom{m+n}{k}},\\ & \quad x=0,1,2,\cdots,k. \end{aligned}

## Hypergeometric probabilities using dhyper() function in R

For discrete probability distribution, density is the probability of getting exactly the value $x$ (i.e., $P(X=x)$).

The syntax to compute the probability at $x$ for Hypergeometric distribution using R is

dhyper(x,m,n,k)

where

• x : the value(s) of the variable,
• m : the number of success in the population,
• n : the number of failure in the population,
• k : the sample size selected from the population.

The dhyper() function gives the probability for given value(s) x, m, n and k.

## Numerical Problem for Hypergeometric Distribution

To understand the four functions dhyper(), phyper(), qhyper() and rhyper(), let us take the following numerical problem.

### Hypergeometric Distribution Example

A company produces and ships 16 personal computers knowing that 5 of them have defective wiring. The company that purchased the computers is going to thoroughly test four of the computers. The purchasing company can detect the defective wiring.

(a) Find the probability that no defective computers.
(b) Plot the graph of Hypergeometric probability distribution.
(c) What is the probability that the purchasing company will find at most one defective computers?
(d) What is the probability that the purchasing company will find at least 2 defective computers?
(e) What is the probability that the purchasing company will find 2 to 4 (inclusive) defective computers?
(f) Plot the graph of cumulative Hypergeometric probabilities.
(g) What is the value of $c$, if $P(X\leq c) \geq 0.90$?
(h) Simulate 100 Hypergeometric distributed random variables for the given problem.

### Example 1: How to use dhyper() function in R?

To find the probability that exactly four female students are selected, we need to use dhyper() function.

Let $X$ denote defective PC's in the sample. Consider defective as a success. Then the random variable $X$ has hypergeometric distribution with Population Size $m+n = 16$, number of successes in the population $m = 5$ (hence $n=11$) and the sample size $k = 4$, i.e., $X\sim H(m = 5, n= 11, k = 4)$.

First let us define the given terms as

## number of success
m <- 5
## number of failures
n <- 11
## sample size
k <- 4

The probability mass function of $X$ is

 \begin{aligned} P(X=x) &= \frac{\binom{5}{x}\binom{11}{4-x}}{\binom{16}{4}},\\ & \quad x=0,1,2,\cdots,4 \end{aligned}

For part (a), we need to find the probability $P(X = 0)$.

First I will show you how to calculate this probability using manual calculation, then I will show you how to compute the same probability using dhyper() function in R.

(a) The probability that no defective computer is

 \begin{aligned} P(X = 0) & =\frac{\binom{5}{0}\binom{11}{4-0}}{\binom{16}{4}} \\ & = 0.1813187\\ \end{aligned}

The above probability can be calculated using dhyper(0,5,11,4) function in R.

# Compute Hypergeometric probability
result1 <- dhyper(0,m,n,k)
result1
[1] 0.1813187

### Example 2 Visualize Hypergeometric probability distribution

Using dhyper() function we can compute Hypergeometric distribution probabilities and make a table of it.

# assign values 0 to 4 to x
x <- 0:4
## Compute the Hypergeometric probabilities
px<-dhyper(x,m,n,k)
# make a table
H_table <- cbind(x,px)
# specify the column names
colnames(H_table) <- c("x", "P(X=x)")
H_table
     x      P(X=x)
[1,] 0 0.181318681
[2,] 1 0.453296703
[3,] 2 0.302197802
[4,] 3 0.060439560
[5,] 4 0.002747253

Using kable() function from knitr package, we can create table in LaTeX, HTML, Markdown and reStructured Text.

# to make table
library(knitr)
kable(H_table)
x P(X=x)
0 0.1813187
1 0.4532967
2 0.3021978
3 0.0604396
4 0.0027473

(b) Visualizing Hypergeometric Distribution with dhyper() function and plot() function in R:

The probability mass function of Hypergeometric distribution with given m, n, k can be visualized using dhyper() function in plot() function as follows:

# assign values 0 to 4 to x
x <- 0:4
## Plot the Hypergeometric probability dist
plot(x,px,type="h",xlim=c(0,5),ylim=c(0,max(px)),
lwd=10, col="darkred",ylab="P(X=x)")
title("PMF of Hypergeometric (m,n,k)")

## Hypergeometric cumulative probability using phyper() function in R

The syntax to compute the cumulative probability distribution function (CDF) for Hypergeometric distribution using R is

phyper(q,m,n,k)

where

• q : the value(s) of the variable,
• m : the number of success in the population,
• n : the number of failure in the population,
• k : the sample size selected from the population.

This function is very useful for calculating the cumulative Hypergeometric probabilities for given value(s) of q (value of the variable x), m, n, and k.

### Example 3: How to use phyper() function in R?

In the above example, for part (c), we need to find the probability $P(X\leq 1)$.

First I will show you how to calculate this probability using manual calculation, then I will show you how to compute the same probability using phyper() and dhyper() function in R.

(c) The probability that at most 1 defective computer is

 \begin{aligned} P(X\leq 1) &= P(X=0)+ P(X=1)\\ &= \frac{\binom{5}{0}\binom{11}{4-0}}{\binom{16}{4}}+\frac{\binom{5}{1}\binom{11}{4-1}}{\binom{16}{4}}\\ &= 0.1813187+0.4532967\\ &= 0.6346154 \end{aligned}

## Compute cumulative Hypergeometric probability
result2 <- phyper(1,m,n,k)
result2
[1] 0.6346154

Above probability can also be calculated using dhyper() function and the sum() function as follows:

sum(dhyper(0:1,m,n,k))
[1] 0.6346154

### Example 4: How to use phyper() function in R?

In the above example, for part (d), we need to find the probability $P(X \geq 2)$.

Numerically the probability that at least 2 defective computers can be calculated as

 \begin{aligned} P(X \geq 2) & =1-P(X\leq 1)\\ & = 1- (P(X=0)+P(X=1))\\ &= 1- \big(0.1813187+0.4532967\big)\\ & = 0.3653846\\ \end{aligned}

To calculate the probability that a random variable $X$ is greater than a given number you can use the option lower.tail=FALSE in phyper() function.

Above probability can be calculated easily using phyper() function with argument lower.tail=FALSE as

$P(X \geq 2) =$ phyper(1,m,n,k,lower.tail=FALSE)

or by using complementary event as

$P(X \geq 2) = 1- P(X\leq 1)$= 1- phyper(1,m,n,k)

# compute cumulative Hypergeometric probabilities
# with lower.tail False
phyper(1,m,n,k,lower.tail=FALSE)
[1] 0.3653846
1-phyper(1,m,n,k)
[1] 0.3653846

### Example 5: How to use phyper() function in R?

One can also use phyper() function to calculate the probability that the random variable $X$ is between two values.

(e) The probability that between 2 to 4 (inclusive) computers are defective is

 \begin{aligned} P(2 \leq X \leq 4) &= P(X=2)+P(X=3)+P(X=4)\\ &=\frac{\binom{5}{2}\binom{11}{4-2}}{\binom{16}{4}}+\frac{\binom{5}{3}\binom{11}{4-3}}{\binom{16}{4}}\\ &\quad +\frac{\binom{5}{4}\binom{11}{4-4}}{\binom{16}{4}}\\ &= 0.3021978+0.0604396+0.0027473\\ &= 0.3653846 \end{aligned}

Above event can also be written as

 \begin{aligned} P(2 \leq X \leq 4) &= P(X\leq 4) -P(X\leq 1)\\ &= 1 - 0.6346154\\ &=0.3653846 \end{aligned}

The above probability can be calculated using phyper() function as follows:

result3 <- phyper(4,m,n,k)-phyper(1,m,n,k)
result3
[1] 0.3653846

The above probability can also be calculated using dhyper() function along with sum() function.

result4 <- sum(dhyper(2:4,m,n,k))
result4
[1] 0.3653846

The first command compute the Hypergeometric probability for $x=2$, $x=3$ and $x=4$. Then add all the probabilities using sum() function and store the result in result4.

### Example 6: Visualize the cumulative Hypergeometric probability distribution

# assign values 0 to 4 to x
x <- 0:4
## Compute the Hypergeometric probabilities
px <- dhyper(x,m,n,k)
## Compute the cumulative Hypergeometric probabilities
Fx <- phyper(x,m,n,k)
## make a table
H_table2 <- cbind(x,px,Fx)
## assign column names
colnames(H_table2) <- c("x", "P(X=x)","P(X<=x)")
# display result
H_table2
     x      P(X=x)   P(X<=x)
[1,] 0 0.181318681 0.1813187
[2,] 1 0.453296703 0.6346154
[3,] 2 0.302197802 0.9368132
[4,] 3 0.060439560 0.9972527
[5,] 4 0.002747253 1.0000000
kable(H_table2)
x P(X=x) P(X<=x)
0 0.1813187 0.1813187
1 0.4532967 0.6346154
2 0.3021978 0.9368132
3 0.0604396 0.9972527
4 0.0027473 1.0000000

The cumulative probability distribution of Hypergeometric distribution with given m, n and k can be visualized using plot() function with argument type="s" (step function) as follows:

# define values of X
x <- 0:4
# Plot the cumulative Hypergeometric dist
plot(x,Fx,type="s",lwd=2,col="darkred",
ylab=expression(P(X<=x)),
main="Distribution Function of H(m,n,k)")

## Hypergeometric Distribution Quantiles using qhyper() in R

The syntax to compute the quantiles of Hypergeometric distribution using R is

qhyper(p,m,n,k)

where

• p : the value(s) of the probabilities,
• m : the number of success in the population,
• n : the number of failure in the population,
• k : the sample size selected from the population.

The function qhyper(p,m,n,k) gives $100*p^{th}$ quantile of Hypergeometric distribution for given value of p, m, n, k.

The $p^{th}$ quantile is the smallest value of Hypergeometric random variable $X$ such that $P(X\leq x) \geq p$.

It is the inverse of phyper() function. That is, inverse cumulative probability distribution function for Hypergeometric distribution.

### Example 7: How to use qhyper() function in R?

In part (g), we need to find the value of $c$ such a that $P(X\leq c) \geq 0.90$. That is we need to find the $60^{th}$ quantile of given Hypergeometric distribution.

# compute the quantile for Hypergeometric dist
qhyper(0.90,m,n,k)
[1] 2

From the above table of Hypergeometric probabilities and cumulative probabilities, it is clear that $90^{th}$ percentile is 2.

### Visualize the quantiles of Hypergeometric Distribution

The quantiles of Hypergeometric distribution with given p, m, n and k can be visualized using plot() function as follows:

p <- seq(0,1,by=0.02)
qx <- qhyper(p,m,n,k)
# Plot the quantiles of Hypergeometric dist
plot(p,qx,type="s",lwd=2,col="darkred",
ylab="quantiles",
main="Quantiles of H(m=5,n=11,k=4)")

## Simulating Hypergeometric random variable using rhyper() function in R

The general R function to generate random numbers from Hypergeometric distribution is rhyper(nn,m,n,k),

where,

• nn is the number of observations,
• m : the number of success in the population,
• n : the number of failure in the population,
• k : the sample size selected from the population.

The function rhyper(nn,,n,k) generates nn random numbers from Hypergeometric distribution with m, n, k.

### Example 8: How to use rhyper() function in R?

In part (h), we need to generate 100 random numbers from Hypergeometric distribution with $m = 5$, $n = 11$ and $k= 4$.

We can use rhyper() function to generate random numbers from Hypergeometric distribution.

## initialize number of observations to generate
nn <- 100
# Simulate 100 values From Hypergeometric dist
x_sim <- rhyper(nn,m,n,k)
# print values at console
x_sim  
  [1] 1 2 1 2 3 0 1 2 1 1 3 1 2 1 0 2 1 0 1 3 2 2 2 3 2 2 1 1 1 0 3 2 2 2 0 1 2
[38] 1 1 1 0 1 1 1 0 0 1 1 1 2 0 1 2 0 1 1 0 2 2 1 2 0 1 1 2 1 2 2 2 1 2 1 2 0
[75] 1 1 1 1 1 0 1 2 1 2 0 1 3 2 2 0 0 2 1 2 1 1 2 0 1 1

To get the frequency table of simulated hypergeometric random variables, we can use table() function in R.

## Print the frequency table
table(x_sim)
x_sim
0  1  2  3
18 44 32  6 
## Plot the simulated data
plot(table(x_sim),xlab="x",ylab="frequency",
lwd=10,col="red",
main="Simulated data from H(5,11,4) dist")

If you use same function again, R will generate another set of random numbers from $H(m=5, n = 11, k =4)$.

# Simulate 100 values From Hypergeometric dist
x_sim_2 <- rhyper(nn,m,n,k)
# print values at console
x_sim_2
  [1] 1 1 1 3 1 2 2 1 1 0 2 1 0 3 2 0 1 3 1 1 2 1 1 1 1 3 0 0 0 2 1 2 2 2 1 2 2
[38] 2 3 1 1 1 0 1 2 1 1 0 1 2 2 1 1 1 0 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 2
[75] 2 1 1 1 2 1 2 1 2 1 1 1 1 1 2 2 1 1 3 1 3 1 1 2 0 1

The frequency table of simulated data from Hypergeometric distribution is as follow:

## Print the frequency table
table(x_sim_2)
x_sim_2
0  1  2  3
10 57 26  7 
## Plot the simulated data
plot(table(x_sim_2),xlab="x",ylab="frequency",
lwd=10,col="red",
main="Simulated data from H(5,11,4) dist")

For the simulation purpose to reproduce same set of random numbers, one can use set.seed() function.

# set seed for reproducibility
set.seed(1457)
# Simulate 100 values From Hypergeometric dist
x_sim_3 <- rhyper(nn,m,n,k)
# print values at console
x_sim_3
  [1] 2 1 1 1 1 0 2 1 2 1 2 1 1 0 1 1 1 1 0 2 0 1 1 2 1 1 2 2 1 1 2 0 0 1 0 2 2
[38] 1 1 1 2 1 1 0 1 1 1 2 1 1 1 1 3 1 3 0 1 0 2 2 1 1 1 1 1 2 3 2 2 1 2 0 1 1
[75] 1 2 1 3 1 0 1 0 1 0 2 1 2 2 0 2 1 1 2 1 0 1 1 2 1 1

The frequency table of x_sim_3 is as follows:

## Print the frequency table
table(x_sim_3)
x_sim_3
0  1  2  3
16 54 26  4 
## Plot the simulated data
plot(table(x_sim_3),xlab="x",ylab="frequency",
lwd=10,col="darkred",
main="Simulated data from H(5,11,4) dist")
set.seed(1457)
# Simulate 100 values From Hypergeometric dist
x_sim_4 <- rhyper(nn,m,n,k)
# print values at console
x_sim_4
  [1] 2 1 1 1 1 0 2 1 2 1 2 1 1 0 1 1 1 1 0 2 0 1 1 2 1 1 2 2 1 1 2 0 0 1 0 2 2
[38] 1 1 1 2 1 1 0 1 1 1 2 1 1 1 1 3 1 3 0 1 0 2 2 1 1 1 1 1 2 3 2 2 1 2 0 1 1
[75] 1 2 1 3 1 0 1 0 1 0 2 1 2 2 0 2 1 1 2 1 0 1 1 2 1 1

The frequency table of x_sim_4 is as follows:

## Print the frequency table
table(x_sim_4)
x_sim_4
0  1  2  3
16 54 26  4 
## Plot the simulated data
plot(table(x_sim_4),xlab="x",ylab="frequency",
lwd=10,col="darkred",
main="Simulated data from H(5,11,4) dist")

Since we have used set.seed(1457) function for both the simulation, the x_sim_3 and x_sim_4 are same.

To learn more about other discrete and continuous probability distributions using R, go through the following tutorials:

Discrete Distributions Using R

Continuous Distributions Using R

## Endnote

In this tutorial, you learned about how to compute the probabilities, cumulative probabilities and quantiles of Hypergeometric distribution in R programming. You also learned about how to simulate a Hypergeometric distribution using R programming.