In this tutorial, you will learn about some built-in statistical functions in R and how to use them.

## Built-in Statistical Functions in R

Some of the commonly used built-in statistical function in R are listed below:

Function | Operation Performed |
---|---|

`sum(x)` |
Sum of elements of `x` |

`prod(x)` |
Product of elements of `x` |

`mean(x)` |
Mean of `x` |

`weighted.mean(x,w)` |
Weighted mean of `x` with weights `w` |

`median(x)` |
Median of `x` |

`quantile(x,probs=)` |
Quantiles of `x` |

`sd(x)` |
Std. Dev. of `x` |

`var(x)` |
Variance of `x` |

`IQR(x)` |
Inter Quartile Range of `x` |

`max(x)` |
Maximum of all the elements of `x` |

`min(x)` |
Minimum of all the elements of `x` |

`range(x)` |
Return minimum and maximum of `x` |

`cov(x,y)` |
Covariance between `x` and `y` |

`cor(x,y)` |
Correlation between `x` and `y` |

`fivenum(x)` |
Returns five number summary of `x` |

`cumsum(x)` |
Cumulative sum of elements of `x` |

`cumprod(x)` |
Cumulative product of elements of `x` |

`cummax(x)` |
Cumulative maximum of elements of `x` |

`cummin(x)` |
Cumulative minimum of elements of `x` |

## Examples of Statistical Functions in R

Let us discuss how to use all the above built-in statistical functions in R with the help of examples.

### sum and prod function in R

#### sum() function in R

The `sum()`

function returns the sum of all the values present in its arguments.

```
# create a vector x
x <- c(10, 12, 14, 16, 8, 9)
# compute the sum
sum(x)
```

`[1] 69`

```
# create a vector y
y <- c(10, 12, 14, NA,16, 8, 9)
sum(y)
```

`[1] NA`

```
# sum of elements of y excluding NA
sum(y,na.rm=TRUE)
```

`[1] 69`

#### product function in R

The `prod()`

function returns the product of all the values present in its arguments.

```
# product of elements of x
prod(x)
```

`[1] 1935360`

`prod(y)`

`[1] NA`

```
# product of elements of y excluding NA
prod(y,na.rm=TRUE)
```

`[1] 1935360`

### Mean, weighted mean and median in R

#### Sample mean using R

Let $x_i, i=1,2, \cdots , n$ be $n$ observations on variable $X$. Then the sample mean $X$ is denoted by $\overline{x}$ and is given by

`$$\overline{x}=\frac{1}{n}\sum_{i=1}^{n}x_i$$`

The `mean(x)`

function compute the sample mean of `x`

.

```
x <- c(10, 12, 14, 16, 8, 9)
## compute mean of x
mean(x)
```

`[1] 11.5`

```
y <- c(10, 12, 14, NA, 16, 8, 9)
mean(y)
```

`[1] NA`

Note that an `NA`

value in `y`

causes the result of `mean(y)`

to be `NA`

. To compute the mean by removing `NA`

values, use additional argument `na.rm=TRUE`

.

```
# mean of y excluding NA
mean(y,na.rm = TRUE)
```

`[1] 11.5`

#### Weighted mean using R

Let $(x_i,w_i), i=1,2, \cdots , n$ be $n$ pairs with variable $x$ and weight $w$. Then the weighted mean $X$ is denoted by $\overline{x}_w$ and is given by

`$$\overline{x}_w=\frac{1}{\sum_{i}^n w_i}\sum_{i=1}^{n}w_ix_i$$`

The weighted mean can be calculated using `weighted.mean()`

function in R.

```
## weighted mean of x with weight w
w <- c(1, 2, 3, 4, 2, 1)
weighted.mean(x, w)
```

`[1] 12.69231`

#### Median using R

Median is the middle value of the data after arranging the data in ascending order of magnitude.

Let $x_1,x_2,\cdots, x_n$ be $n$ observations then the median of $X$ is denoted by $M$. Arrange the data in ascending order of magnitude.

Median of $X$ is defined as

` $$ \begin{equation*} M= \left\{ \begin{array}{ll} \text{value of }\big(\frac{n+1}{2}\big)^{th}\text{ observation}, & \hbox{if $n$ is odd;} \\ \text{average of }\big(\frac{n}{2}\big)^{th}\text{ and }\big(\frac{n}{2}+1\big)^{th} \text{ observation}0, & \hbox{if $n$ is even.} \end{array} \right. \end{equation*} $$ `

The `median()`

function compute the sample median.

```
# median of x
median(x)
```

`[1] 11`

```
# median of x excluding NA
median(y,na.rm=TRUE)
```

`[1] 11`

### Quantiles using R

The $p^{th}$ quantile, $0\leq p\leq 1$, of a distribution is the $\big(p(n-1)+1\big)^{th}$ order statistic. For more detail examples check how to compute quantiles using R.

By default `quantile()`

function returns *quartiles*, *minimum* and *maximum* value.

```
data("trees")
# quantiles for Girth
quantile(trees$Girth)
```

```
0% 25% 50% 75% 100%
8.30 11.05 12.90 15.25 20.60
```

Specific quantiles can be computed by using additional argument `probs`

.

Suppose you need to compute the $10^{th}$, $30^{th}$ and $80^{th}$ percentile,

```
# quantiles for Girth
quantile(trees$Girth, probs = c(0.10, 0.30, 0.80))
```

```
10% 30% 80%
10.5 11.2 16.3
```

For more detail about the `quantile()`

function check our tutorial on how to compute quantiles using R.

### Variance and Standard deviation using R

#### variance using R

Let `$x_i, i=1,2, \cdots , n$`

be $n$ observations on variable $X$. Then the sample variance of $X$ is denoted by `$s_{x}^2$`

and is given by

`$$s_x^2 =\dfrac{1}{n-1}\sum_{i=1}^{n}(x_i -\overline{x})^2=\dfrac{1}{n-1}\bigg(\sum_{i=1}^{n}x_i^2-\frac{\big(\sum_{i=1}^n x_i\big)^2}{n}\bigg)$$`

where,

`$\overline{x}=\frac{1}{n}\sum_{i=1}^{n}x_i$`

is the sample mean.

The sample variance can be calculated using `var()`

function in R.

```
## compute variance of Girth
var(trees$Girth)
```

`[1] 9.847914`

#### Standard deviation using R

The sample standard deviation of $X$ is defined as the positive square root of the sample variance `$s_x^2$`

. The sample standard deviation of $X$ is given by

`$$s_x =\sqrt{s_x^2}$$`

The standard deviation can be calculated using `sd()`

function in R.

```
## compute standard deviation of Girth
sd(trees$Girth)
```

`[1] 3.138139`

### Range and Inter Quartile Range Using R

#### Range using R

In statistics, the range is defined as the distance between the largest and the smallest observations (max - min), i.e., `$R = x_{max} - x_{min}$`

.

In R, `range()`

function returns a vector containing minimum and maximum of all the given arguments.

```
# compute range of Height (i.e., min and max)
range(trees$Height)
```

`[1] 63 87`

```
# compute actual range
diff(range(trees$Height))
```

`[1] 24`

#### Interquartile range using R

The inter quartile range (IQR) is given by `$IQR = Q_3-Q_1$`

,

where

`$Q_1$`

is the first quartile`$Q_3$`

is the third quartile.

The formula for $i^{th}$ quartile is

`$Q_i =$`

Value of `$\bigg(\dfrac{i(n+1)}{4}\bigg)^{th}$`

observation, $i=1,2,3$

where $n$ is the total number of observations.

The inter quartile range can be calculated using `IQR()`

function in R.

```
## compute inter quartile range of Volume
IQR(trees$Volume)
```

`[1] 17.9`

### Variance-covariance matrix in R

If the argument to `var()`

function is a data frame with all numerical variables, then R computes variance covariance matrix.

```
# display variance covariance matrix
var(trees)
```

```
Girth Height Volume
Girth 9.847914 10.38333 49.88812
Height 10.383333 40.60000 62.66000
Volume 49.888118 62.66000 270.20280
```

Note that in almost all the above function an `NA`

value(s) causes the result to be `NA`

. To remove `NA`

, use additional argument `na.rm=TRUE`

(Default value of `na.rm`

is `FALSE`

).

### Covariance and correlation in R

#### Covariance using R

Let `$(x_i, y_i)$`

, `$i=1,2,\cdots,n$`

be $n$ pairs of observations then the covariance between two variables $X$ and $Y$ is denoted by $cov(x,y)$ or `$s_{xy}$`

and is given by

` $$ \begin{aligned} Cov(x,y)=s_{xy} &=\frac{1}{n-1}\sum_{i=1}^{n}(x_i -\overline{x})(y_i -\overline{y})\\ & =\frac{1}{n-1}\bigg(\sum_{i=1}^n x_iy_i - \frac{(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n}\bigg) \end{aligned} $$ `

The sample covariance between two variable $x$ and $y$ can be calculated using `cov(x,y)`

function in R.

```
# compute covariance between Height and Volume
cov(trees$Height, trees$Volume)
```

`[1] 62.66`

When you use the data frame having all variables of numeric type as an argument, the `cov()`

function gives a variance-covariance matrix.

```
# compute variance covariance matrix
cov(trees)
```

```
Girth Height Volume
Girth 9.847914 10.38333 49.88812
Height 10.383333 40.60000 62.66000
Volume 49.888118 62.66000 270.20280
```

Note that the diagonal elements are the variances and off-diagonal elements are the covariances.

Same can also be obtained using `var()`

function.

```
# compute variance covariance matrix
var(trees)
```

```
Girth Height Volume
Girth 9.847914 10.38333 49.88812
Height 10.383333 40.60000 62.66000
Volume 49.888118 62.66000 270.20280
```

#### Correlation coefficient using R

The `cor(x,y)`

function compute the correlation coefficient between `x`

and `y`

(default is `method="pearson"`

).

```
# compute correlation between Height and Volume
cor(trees$Height,trees$Volume)
```

`[1] 0.5982497`

When you use the data frame having all variables of numeric type as an argument, the `cor()`

function gives correlation coefficient matrix.

```
# compute correlation matrix
cor(trees)
```

```
Girth Height Volume
Girth 1.0000000 0.5192801 0.9671194
Height 0.5192801 1.0000000 0.5982497
Volume 0.9671194 0.5982497 1.0000000
```

### Five Number Summary in R

The `fivenum(x,na.rm=TRUE)`

function returns Tukey's five number summary of a numeric vector `x`

.

A five-number summary consists of

- smallest value,
- the lower hinge,
- the median,
- the upper hinge and
- the largest value

all of which are computed with R's function `fivenum()`

.

`fivenum(trees$Girth)`

`[1] 8.30 11.05 12.90 15.25 20.60`

For more detail about the five number summary, check our tutorial on how to compute five number summary statistics in R with examples.

### Cumulative Sum and Cumulative product in R

#### cumsum() function in R

The `cumsum(x)`

function compute the cumulative sum of elements of numeric or complex object `x`

. It generates a vector with the same length as the input vector `x`

. The $i^{th}$ element of the result of `cumsum(x)`

function is the sum of first $i$ elements of `x`

.

```
# define a vector
x <- 1:8
# compute cumulative sum of elements
cumsum(x)
```

`[1] 1 3 6 10 15 21 28 36`

```
y<-1:8 # define a vector
cumsum(x) # compute cumulative sum of elements
```

`[1] 1 3 6 10 15 21 28 36`

#### cumprod() function in R

The `cumprod(x)`

function compute the cumulative product of elements of numeric or complex object `x`

. It generates a vector with the same length as the input vector `x`

. The $i^{th}$ element of the result of `cumprod(x)`

function is the product of first $i$ elements of `x`

.

```
x<-c(10,12,16,20) # define a vector
cumprod(x) # compute cumulative prod of elements
```

`[1] 10 120 1920 38400`

### Cumulative extremes in R

#### cummin() function in R

The `cummin(x)`

function compute the cumulative minimum of elements of numeric object `x`

. It generates a vector with the same length as the input vector `x`

. The $i^{th}$ element of the result of `cummin(x)`

function is the minimum of first $i$ elements of `x`

.

```
# define a vector
x <- c(-1.24, 2.35, 1.67, -2.37, 5.45)
# compute cumulative minimum of elements
cummin(x)
```

`[1] -1.24 -1.24 -1.24 -2.37 -2.37`

#### cummax() function in R

The `cummax(x)`

function compute the cumulative maximum of elements of numeric object `x`

. It generates a vector with the same length as the input vector `x`

. The $i^{th}$ element of the result of `cummax(x)`

function is the maximum of first $i$ elements of `x`

.

```
# define a vector
x <- c(-1.24, 2.35, 1.67, -2.37, 5.45)
# compute cumulative maximum of elements
cummax(x)
```

`[1] -1.24 2.35 2.35 2.35 5.45`

## Endnote

In this tutorial you learned about some statistical functions in R and how to use these functions in R.

To learn more about other built-in functions and user-defined functions in R, please refer to the following tutorials:

- Built-in Mathematical functions in R
- Built-in Trigonometric functions in R
- Built-in Special Mathematical functions in R
- Built-in Character functions in R
- User-defined functions in R Part I
- User-defined functions in R Part II
- Functions in R

Hopefully you enjoyed learning this tutorial on statistical functions in R. Hope the content is more than sufficient to understand statistical functions in R.