Built-in Statistical Functions in R

In this tutorial, you will learn about some built-in statistical functions in R and how to use them.

Built-in Statistical Functions in R

Some of the commonly used built-in statistical function in R are listed below:

Function Operation Performed
sum(x) Sum of elements of x
prod(x) Product of elements of x
mean(x) Mean of x
weighted.mean(x,w) Weighted mean of x with weights w
median(x) Median of x
quantile(x,probs=) Quantiles of x
sd(x) Std. Dev. of x
var(x) Variance of x
IQR(x) Inter Quartile Range of x
max(x) Maximum of all the elements of x
min(x) Minimum of all the elements of x
range(x) Return minimum and maximum of x
cov(x,y) Covariance between x and y
cor(x,y) Correlation between x and y
fivenum(x) Returns five number summary of x
cumsum(x) Cumulative sum of elements of x
cumprod(x) Cumulative product of elements of x
cummax(x) Cumulative maximum of elements of x
cummin(x) Cumulative minimum of elements of x

Examples of Statistical Functions in R

Let us discuss how to use all the above built-in statistical functions in R with the help of examples.

sum and prod function in R

sum() function in R

The sum() function returns the sum of all the values present in its arguments.

# create a vector x
x <- c(10, 12, 14, 16, 8, 9)
# compute the sum
sum(x)
[1] 69
# create a vector y
y <- c(10, 12, 14, NA,16, 8, 9)
sum(y)
[1] NA
# sum of elements of y excluding NA
sum(y,na.rm=TRUE)
[1] 69

product function in R

The prod() function returns the product of all the values present in its arguments.

# product of elements of x
prod(x)
[1] 1935360
prod(y)
[1] NA
# product of elements of y excluding NA
prod(y,na.rm=TRUE)
[1] 1935360

Mean, weighted mean and median in R

Sample mean using R

Let $x_i, i=1,2, \cdots , n$ be $n$ observations on variable $X$. Then the sample mean $X$ is denoted by $\overline{x}$ and is given by

$$\overline{x}=\frac{1}{n}\sum_{i=1}^{n}x_i$$

The mean(x) function compute the sample mean of x.

x <- c(10, 12, 14, 16, 8, 9)
## compute mean of x
mean(x)
[1] 11.5
y <- c(10, 12, 14, NA, 16, 8, 9)
mean(y)
[1] NA

Note that an NA value in y causes the result of mean(y) to be NA. To compute the mean by removing NA values, use additional argument na.rm=TRUE.

# mean of y excluding NA
mean(y,na.rm = TRUE)
[1] 11.5

Weighted mean using R

Let $(x_i,w_i), i=1,2, \cdots , n$ be $n$ pairs with variable $x$ and weight $w$. Then the weighted mean $X$ is denoted by $\overline{x}_w$ and is given by

$$\overline{x}_w=\frac{1}{\sum_{i}^n w_i}\sum_{i=1}^{n}w_ix_i$$

The weighted mean can be calculated using weighted.mean() function in R.

## weighted mean of x with weight w
w <- c(1, 2, 3, 4, 2, 1)
weighted.mean(x, w)
[1] 12.69231

Median using R

Median is the middle value of the data after arranging the data in ascending order of magnitude.

Let $x_1,x_2,\cdots, x_n$ be $n$ observations then the median of $X$ is denoted by $M$. Arrange the data in ascending order of magnitude.

Median of $X$ is defined as
$$ \begin{equation*} M= \left\{ \begin{array}{ll} \text{value of }\big(\frac{n+1}{2}\big)^{th}\text{ observation}, & \hbox{if $n$ is odd;} \\ \text{average of }\big(\frac{n}{2}\big)^{th}\text{ and }\big(\frac{n}{2}+1\big)^{th} \text{ observation}0, & \hbox{if $n$ is even.} \end{array} \right. \end{equation*} $$

The median() function compute the sample median.

# median of x
median(x)
[1] 11
# median of x excluding NA
median(y,na.rm=TRUE)
[1] 11

Quantiles using R

The $p^{th}$ quantile, $0\leq p\leq 1$, of a distribution is the $\big(p(n-1)+1\big)^{th}$ order statistic. For more detail examples check how to compute quantiles using R.

By default quantile() function returns quartiles, minimum and maximum value.

data("trees")
# quantiles for Girth
quantile(trees$Girth)
   0%   25%   50%   75%  100% 
 8.30 11.05 12.90 15.25 20.60 

Specific quantiles can be computed by using additional argument probs.

Suppose you need to compute the $10^{th}$, $30^{th}$ and $80^{th}$ percentile,

# quantiles for Girth
quantile(trees$Girth, probs = c(0.10, 0.30, 0.80))
 10%  30%  80% 
10.5 11.2 16.3 

For more detail about the quantile() function check our tutorial on how to compute quantiles using R.

Variance and Standard deviation using R

variance using R

Let $x_i, i=1,2, \cdots , n$ be $n$ observations on variable $X$. Then the sample variance of $X$ is denoted by $s_{x}^2$ and is given by

$$s_x^2 =\dfrac{1}{n-1}\sum_{i=1}^{n}(x_i -\overline{x})^2=\dfrac{1}{n-1}\bigg(\sum_{i=1}^{n}x_i^2-\frac{\big(\sum_{i=1}^n x_i\big)^2}{n}\bigg)$$

where,

  • $\overline{x}=\frac{1}{n}\sum_{i=1}^{n}x_i$ is the sample mean.

The sample variance can be calculated using var() function in R.

## compute variance of Girth
var(trees$Girth)
[1] 9.847914

Standard deviation using R

The sample standard deviation of $X$ is defined as the positive square root of the sample variance $s_x^2$. The sample standard deviation of $X$ is given by

$$s_x =\sqrt{s_x^2}$$

The standard deviation can be calculated using sd() function in R.

## compute standard deviation of Girth
sd(trees$Girth)
[1] 3.138139

Range and Inter Quartile Range Using R

Range using R

In statistics, the range is defined as the distance between the largest and the smallest observations (max - min), i.e., $R = x_{max} - x_{min}$.

In R, range() function returns a vector containing minimum and maximum of all the given arguments.

# compute range of Height (i.e.,  min and max)
range(trees$Height) 
[1] 63 87
# compute actual range
diff(range(trees$Height)) 
[1] 24

Interquartile range using R

The inter quartile range (IQR) is given by $IQR = Q_3-Q_1$,

where

  • $Q_1$ is the first quartile
  • $Q_3$ is the third quartile.

The formula for $i^{th}$ quartile is

$Q_i =$ Value of $\bigg(\dfrac{i(n+1)}{4}\bigg)^{th}$ observation, $i=1,2,3$

where $n$ is the total number of observations.

The inter quartile range can be calculated using IQR() function in R.

## compute inter quartile range of Volume
IQR(trees$Volume)
[1] 17.9

Variance-covariance matrix in R

If the argument to var() function is a data frame with all numerical variables, then R computes variance covariance matrix.

# display variance covariance matrix
var(trees) 
           Girth   Height    Volume
Girth   9.847914 10.38333  49.88812
Height 10.383333 40.60000  62.66000
Volume 49.888118 62.66000 270.20280

Note that in almost all the above function an NA value(s) causes the result to be NA. To remove NA, use additional argument na.rm=TRUE (Default value of na.rm is FALSE).

Covariance and correlation in R

Covariance using R

Let $(x_i, y_i)$ , $i=1,2,\cdots,n$ be $n$ pairs of observations then the covariance between two variables $X$ and $Y$ is denoted by $cov(x,y)$ or $s_{xy}$ and is given by

$$ \begin{aligned} Cov(x,y)=s_{xy} &=\frac{1}{n-1}\sum_{i=1}^{n}(x_i -\overline{x})(y_i -\overline{y})\\ & =\frac{1}{n-1}\bigg(\sum_{i=1}^n x_iy_i - \frac{(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n}\bigg) \end{aligned} $$

The sample covariance between two variable $x$ and $y$ can be calculated using cov(x,y) function in R.

# compute covariance between Height and Volume
cov(trees$Height, trees$Volume)
[1] 62.66

When you use the data frame having all variables of numeric type as an argument, the cov() function gives a variance-covariance matrix.

# compute variance covariance matrix
cov(trees) 
           Girth   Height    Volume
Girth   9.847914 10.38333  49.88812
Height 10.383333 40.60000  62.66000
Volume 49.888118 62.66000 270.20280

Note that the diagonal elements are the variances and off-diagonal elements are the covariances.

Same can also be obtained using var() function.

# compute variance covariance matrix
var(trees)
           Girth   Height    Volume
Girth   9.847914 10.38333  49.88812
Height 10.383333 40.60000  62.66000
Volume 49.888118 62.66000 270.20280

Correlation coefficient using R

The cor(x,y) function compute the correlation coefficient between x and y (default is method="pearson").

# compute correlation between Height and Volume
cor(trees$Height,trees$Volume) 
[1] 0.5982497

When you use the data frame having all variables of numeric type as an argument, the cor() function gives correlation coefficient matrix.

# compute correlation matrix
cor(trees) 
           Girth    Height    Volume
Girth  1.0000000 0.5192801 0.9671194
Height 0.5192801 1.0000000 0.5982497
Volume 0.9671194 0.5982497 1.0000000

Five Number Summary in R

The fivenum(x,na.rm=TRUE) function returns Tukey's five number summary of a numeric vector x.

A five-number summary consists of

  • smallest value,
  • the lower hinge,
  • the median,
  • the upper hinge and
  • the largest value

all of which are computed with R's function fivenum().

fivenum(trees$Girth)
[1]  8.30 11.05 12.90 15.25 20.60

For more detail about the five number summary, check our tutorial on how to compute five number summary statistics in R with examples.

Cumulative Sum and Cumulative product in R

cumsum() function in R

The cumsum(x) function compute the cumulative sum of elements of numeric or complex object x. It generates a vector with the same length as the input vector x. The $i^{th}$ element of the result of cumsum(x) function is the sum of first $i$ elements of x.

# define a vector
x <- 1:8
# compute cumulative sum of elements
cumsum(x) 
[1]  1  3  6 10 15 21 28 36
y<-1:8 # define a vector
cumsum(x) # compute cumulative sum of elements
[1]  1  3  6 10 15 21 28 36

cumprod() function in R

The cumprod(x) function compute the cumulative product of elements of numeric or complex object x. It generates a vector with the same length as the input vector x. The $i^{th}$ element of the result of cumprod(x) function is the product of first $i$ elements of x.

x<-c(10,12,16,20) # define a vector
cumprod(x) # compute cumulative prod of elements
[1]    10   120  1920 38400

Cumulative extremes in R

cummin() function in R

The cummin(x) function compute the cumulative minimum of elements of numeric object x. It generates a vector with the same length as the input vector x. The $i^{th}$ element of the result of cummin(x) function is the minimum of first $i$ elements of x.

# define a vector
x <- c(-1.24, 2.35, 1.67, -2.37, 5.45)
# compute cumulative minimum of elements
cummin(x) 
[1] -1.24 -1.24 -1.24 -2.37 -2.37

cummax() function in R

The cummax(x) function compute the cumulative maximum of elements of numeric object x. It generates a vector with the same length as the input vector x. The $i^{th}$ element of the result of cummax(x) function is the maximum of first $i$ elements of x.

# define a vector
x <- c(-1.24, 2.35, 1.67, -2.37, 5.45)
# compute cumulative maximum of elements
cummax(x) 
[1] -1.24  2.35  2.35  2.35  5.45

Endnote

In this tutorial you learned about some statistical functions in R and how to use these functions in R.

To learn more about other built-in functions and user-defined functions in R, please refer to the following tutorials:

Hopefully you enjoyed learning this tutorial on statistical functions in R. Hope the content is more than sufficient to understand statistical functions in R.

Leave a Comment