In this tutorial, we will discuss about tapply()
function in R with some examples. tapply()
function is available in base
R package.
The tapply() function in R
The tapply()
function is very useful to aggregate the data. That is tapply()
function allows us to create a group summaries based on factor levels.
The general syntax of tapply()
function is
tapply(X, INDEX,FUN=NULL,...,simplify=TRUE)
where
- X: an atomic object, typically a vector
- INDEX: list of one or more factor each of same length as
X
- FUN: the function to be applied
- …: optional argument to
FUN
. - simplify: If FALSE,
tapply
returns an array of mode list.
The function tapply(X, INDEX,FUN)
split the data of X
into subgroups based on the levels of INDEX
variable, then apply the function FUN
to each subgroup of the data.
That is, the function tapply()
applies FUN
on X
grouped by factors in INDEX
.
tapply() function on data frame
Example 1: tapply() function on data frame
Let us create a sample data frame to understand the use of tapply()
function on data frame.
Name <- c("john", "gloria", "rajan", "mary", "sonam")
Gender <- factor(c("M", "F", "M", "F", "F"))
Height <- c(165, 158, 160, 157, 155)
Weight <- c(72, 65, 69, 58, 49)
df <- data.frame(Name, Gender, Height, Weight)
df
Name Gender Height Weight
1 john M 165 72
2 gloria F 158 65
3 rajan M 160 69
4 mary F 157 58
5 sonam F 155 49
Suppose we want to calculate the average height or average weight by gender of the respondent. We can use tapply()
function to calculate average height by gender as follows:
tapply(df$Height,df$Gender,mean)
F M
156.6667 162.5000
To compute standard deviation of weight by gender, use the tapply()
function as follows:
result <- tapply(df$Weight,df$Gender,sd)
result
F M
8.020806 2.121320
class(result)
[1] "array"
Example 2 : quantiles using tapply() function on data frame
Consider a built-in data frame PlantGrowth
. Suppose we want to calculate quantile of weight
variable grouped by factor variable group
from PlantGrowth
data frame.
To calculate quantiles of weight
by group
, we can use tapply()
function as follows:
# compute the quantiles of weight by group
tapply(PlantGrowth$weight,PlantGrowth$group, quantile, probs = c(0.25, 0.50, 0.75))
$ctrl
25% 50% 75%
4.5500 5.1550 5.2925
$trt1
25% 50% 75%
4.2075 4.5500 4.8700
$trt2
25% 50% 75%
5.2675 5.4350 5.7350
Note that as explained in the syntax of tapply()
function, we can use optional argument ...
to the function in tapply()
function, like probs=c()
for the quantile()
function.
Example 3: tapply() Function with user-defined function
We can use a user-defined function in tapply()
function to compute the summary of one variable based on the levels of some factor variable.
Let us define user-defined function for standard error as follows:
std.error <- function(x) {
sd(x) / sqrt(length(x))
}
Suppose we need to calculate the standard error of weight
variable grouped a factor variable group
from PlantGrowth
data frame.
To calculate standard errors of weight
by group
, we can use tapply()
function as follows:
# compute the standard error of weights group by group
result_1 <- tapply(PlantGrowth$weight, PlantGrowth$group, std.error)
result_1
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
class(result_1)
[1] "array"
Note that the default output of tapply()
function is array
. That is the class of the default output is array
. So the elements of the output can be accessed using square bracket [ ]
with index.
# gives the second element of result
result_1[2]
trt1
0.2509823
Example 4: Simplified result using tapply() Function
For the example discussed above, the default value of the argument simplify
is TRUE
. The list output can be obtained using an additional argument simplify=FALSE
.
To calculate standard errors of weight
by group
to get list output, we can use tapply()
function as follows:
# compute the standard error of weights group by group
result_2 <- tapply(PlantGrowth$weight, PlantGrowth$group,
std.error,simplify=FALSE)
result_2
$ctrl
[1] 0.1843897
$trt1
[1] 0.2509823
$trt2
[1] 0.139954
The component of the list can be accessed using single square bracket with index.
# extract the second component of list
result_2[2]
$trt1
[1] 0.2509823
The element of the component of list can be accessed using double square bracket with index.
# extract the element of second component of list
result_2[[2]]
[1] 0.2509823
Example 5: tapply() Function with multiple factors
The tapply()
function can also be used on multiple factor variables. To apply tapply()
function on multiple factor variables, the INDEX
argument can be used as a list.
Consider a built-in data frame warpbreaks
.
data("warpbreaks")
str(warpbreaks)
'data.frame': 54 obs. of 3 variables:
$ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
$ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
In the warpbreaks
data frame, the factor variable wool
has two levels (i.e., Wool type A and wool type B) and the factor variable tension
has three levels (i.e., L
for Low, M
for Medium and H
for High).
Let us calculate the mean number of breaks
for various levels of wool
and tension
. To calculate the mean number of breaks
grouped by wool
and tension
we can use the tapply()
function as follows:
attach(warpbreaks)
result_3 <-tapply(breaks,list(wool,tension),mean)
result_3
L M H
A 44.55556 24.00000 24.55556
B 28.22222 28.77778 18.77778
The mean number of breaks for the wool type A and the level of tension L is 44.5555556.
Note that all the apply functions (apply()
,tapply()
, sapply()
and lapply()
are more efficient than loops (for loop, while loop).
Endnote
In this tutorial you learned about tapply()
function in R and how to use tapply()
function on vector,list and data frame with illustration.
Learn more about functions in R, refer to the following tutorials:
Hopefully you enjoyed learning this tutorial on tapply()
function in R. Hope the content is more than sufficient to understand tapply()
function in R.