In this tutorial you will learn about the how to create factor variables, ordered factor variables, factor variable from numerical data and generale factor levels in R.
Factors in R
In R factors are used for handling categorical variables (Nominal or Ordinal). Factors are the variables which take on finite number of different values. The different values are called the levels. In statistical modeling, use of categorical variables is different than numeric variables.
Nominal variables are categorical where order is not important, e.g., Gender ("Male" or "Female") of a respondent is an example of nominal data.
Ordinal variables are also categorical but the order is important, e.g., Socio-economic status (SES) of a respondent ("LES", "MES" or "HES") is an example of ordinal data. Respondent with "LES" is having less earning than respondent with "HES" and "MES".
To create a factor variable in R, we use the factor()
function.
The syntax of factor()
function is
factor(x = character(), levels,
labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
where
- x: a vector of data,
- levels : set of unique value that
x
might take, - labels : character vector of labels for levels,
- exclude : a vector of values to be excluded,
- ordered : logical flag to determine if the levels should be ordered,
- nmax : an upper bound on the number of levels.
Creating a factor in R
The function factor()
is used to store the variable as factor variable. R internally stores the categorical values as a vector of integers in the range $1,2,\cdots, k$, where $k$ is the number of unique values in the variable and map these values to the categorical values.
gender<-c("Male","Male","Female","Male","Female")
gender
[1] "Male" "Male" "Female" "Male" "Female"
The above R code store gender as a character vector. To convert it into a factor variable, use factor()
function.
# Store this as (2,2,1,2,1)
gender_f<-factor(gender)
gender_f
[1] Male Male Female Male Female
Levels: Female Male
# display the structure
str(gender_f)
Factor w/ 2 levels "Female","Male": 2 2 1 2 1
The statement gender_f<-factor(gender)
store gender_f
as vector (2,2,1,2,1) and associate it with 1= Female
and 2 = Male
. Since the levels are not specified, R assigns the levels to the factor variable alphabetically (i.e., Female
first and then Male
).
# display the mode
mode(gender_f)
[1] "numeric"
# display the class
class(gender_f)
[1] "factor"
# display the levels
levels(gender_f)
[1] "Female" "Male"
# display the number of levels
nlevels(gender_f)
[1] 2
The sequence of levels can be set using the levels
argument to factor()
function.
gender<-c("Male","Male","Female","Male","Female")
gender
[1] "Male" "Male" "Female" "Male" "Female"
gender_2 <- factor(gender, levels = c("Male", "Female"))
gender_2
[1] Male Male Female Male Female
Levels: Male Female
From the output of above code, it is clear that the levels are not alphabetically but as specified in levels
argument. And the store it with 1= Male
and 2 = Female
.
# display the structure
str(gender_2)
Factor w/ 2 levels "Male","Female": 1 1 2 1 2
# display the class
class(gender_2)
[1] "factor"
# display attributes
attributes(gender_2)
$levels
[1] "Male" "Female"
$class
[1] "factor"
Sometime the categorical variable is coded as numeric. For example, we coded Male
as 0 and Female
as 1 for the data ("M","M","F","M","F")
as (0,0,1,0,1)
. Then such coded data can also be converted to factor using factor()
function by specifying the levels.
gen <- c(0, 0, 1, 0, 1)
fgen <- factor(gen, levels = 0:1)
fgen
[1] 0 0 1 0 1
Levels: 0 1
Levels can also be assigned using levels()
function.
# set levels
levels(fgen) <- c("M", "F")
fgen
[1] M M F M F
Levels: M F
as.numeric(fgen)
[1] 1 1 2 1 2
The function as.numeric()
extract the numerical coding as numbers 1 and 2. Note that the original input values are 0 and 1 but R internally stores the categorical values as a vector of integer starting with 1.
Ordered factor in R
Ordered factor is used when we have a qualitative data but the levels are assumed to be ordered. Like socio-economic status of students.
ses <- c("MES", "LES", "HES", "MES", "LES", "LES")
ses
[1] "MES" "LES" "HES" "MES" "LES" "LES"
In the above R code, the vector ses
is a character vector. If we use factor()
function with argument ordered=TRUE
for ses
, R will convert a character vector ses
to ordered factor.
# convert ses to ordered factor
ses_f <- factor(ses, ordered = TRUE)
ses_f
[1] MES LES HES MES LES LES
Levels: HES < LES < MES
Since the levels are not specified, R assign the levels alphabetically.
We know that ses
is an ordinal variable. In order to get proper ordered factors, we have to specify the levels in proper order. The levels
for a factor variable can be specified by the argument levels
in which we have to specify the proper order levels.
ses_factor <- factor(ses,
levels = c("LES", "MES", "HES"),
ordered = TRUE)
ses_factor
[1] MES LES HES MES LES LES
Levels: LES < MES < HES
Now the levels of ses_factor
variable are in the proper order.
Creating factor from numerical data
Numerical variable can be converted to factor by using cut()
function. It divides the range of values of variable into intervals and codes the values in the variable according to which interval they fall.
age <- c(10, 20, 25, 12, 30, 45, 50, 26,
24, 13, 26, 47, 48, 50)
ageGroup1 <- cut(age, breaks = c(0, 20, 30, 50))
age
[1] 10 20 25 12 30 45 50 26 24 13 26 47 48 50
First R code create a vector of values and store it in age
. The second R code cut()
divide the range of values of age
into intervals (0,20]
, (20, 30]
and (30, 50]
, and stored as a factor variable with 3 levels.
ageGroup1
[1] (0,20] (0,20] (20,30] (0,20] (20,30] (30,50] (30,50] (20,30] (20,30]
[10] (0,20] (20,30] (30,50] (30,50] (30,50]
Levels: (0,20] (20,30] (30,50]
str(ageGroup1)
Factor w/ 3 levels "(0,20]","(20,30]",..: 1 1 2 1 2 3 3 2 2 1 ...
To create an interval closed on the left side, use the argument right=FALSE
in the cut()
function.
ageGroup2<-cut(age,breaks=c(0,20,30,50),right=FALSE)
ageGroup2
[1] [0,20) [20,30) [20,30) [0,20) [30,50) [30,50) <NA> [20,30) [20,30)
[10] [0,20) [20,30) [30,50) [30,50) <NA>
Levels: [0,20) [20,30) [30,50)
The first R code cut()
divide the range of values of age
into intervals [0,20)
, [20, 30)
and [30, 50)
because we have used right=FALSE
.
From the above output, we can see that if any of the value is outside the break
range then R store it as NA
.
str(ageGroup2) # give structure of ageGroup2
Factor w/ 3 levels "[0,20)","[20,30)",..: 1 2 2 1 3 3 NA 2 2 1 ...
is.factor(ageGroup2)
[1] TRUE
is.ordered(ageGroup2)
[1] FALSE
attributes(ageGroup2)
$levels
[1] "[0,20)" "[20,30)" "[30,50)"
$class
[1] "factor"
class(ageGroup2)
[1] "factor"
To get the frequency table of a factor variable we can use table()
function.
table(ageGroup1) # Tabulate the variable ageGroup1
ageGroup1
(0,20] (20,30] (30,50]
4 5 5
table(ageGroup2) # Tabulate the variable ageGroup2
ageGroup2
[0,20) [20,30) [30,50)
3 5 4
Generating Factor Levels
The function gl()
generates regular series of factors.
The general structure of gl()
is
gl(n, k, length = n*k, labels = seq_len(n), ordered = FALSE)
n
: an integer giving the number of levels,k
: an integer giving the number of replications,length
: an integer giving the length of the result,labels
: vector of labels for the resulting factor,ordered
: whether the result should be ordered or not.
gl(2,6)
[1] 1 1 1 1 1 1 2 2 2 2 2 2
Levels: 1 2
It generates two levels 1 and 2, each level six times.
gl(3,4,length=14)
[1] 1 1 1 1 2 2 2 2 3 3 3 3 1 1
Levels: 1 2 3
gl(2,4,label=c("Smoker","Non-smoker"))
[1] Smoker Smoker Smoker Smoker Non-smoker Non-smoker Non-smoker
[8] Non-smoker
Levels: Smoker Non-smoker
Generates two levels Smoker
and Non-smoker
, each replicated six times.
Some Functions for Factors
In R variaous functions available to get the information about the factor variable are as follows:
Function | Meaning |
---|---|
length() |
number of elements in factor |
nlevels() |
number of levels in factor |
levels() |
returns value of the levels or set levels |
Length of factor variable
Below R code display the number of elements in ses.factor
factor variable.
length(ses_factor)
[1] 6
Number of Levels of factor variable
Below R code display the number of levels in ses_factor
factor variable.
nlevels(ses_factor)
[1] 3
Display levels of factor variable
Below R code display the levels of ses_factor
factor variable.
levels(ses_factor)
[1] "LES" "MES" "HES"
Setting levels of factor variable
Recall the fgen
factor variable defined earlier, in which the levels are M
and F
.
fgen
[1] M M F M F
Levels: M F
Suppose we wnat to change the levels M
as Male
and F
as Female
. Then we use the levels()
function to set the levels of factor variable.
levels(fgen)<-c("Male","Female")
fgen
[1] Male Male Female Male Female
Levels: Male Female
Accessing elements of factor in R
Any particular element or elements of factor variable in R can be accessed using a square brackets []
.
Accessing elements by Positive Indexing
The elements of factor variable can be accessed by using positive index of the position of that element, or using sequence of positive index (for adjacent elements) of elements or using c()
function with position of elements of vector.
Recall the factor ses_factor
defined earlier.
ses_factor
[1] MES LES HES MES LES LES
Levels: LES < MES < HES
Below R code return the $3^{rd}$ element from ses_factor
.
ses_factor[3]
[1] HES
Levels: LES < MES < HES
Below R code returns first and second element from ses_factor
.
ses_factor[1:2]
[1] MES LES
Levels: LES < MES < HES
Below R code returns first and third element from ses_factor
.
ses_factor[c(1,3)]
[1] MES HES
Levels: LES < MES < HES
Accessing elements by Negative Indexing
Below R code is used to access all the elements except the second element of ses_factor
.
ses_factor[-2]
[1] MES HES MES LES LES
Levels: LES < MES < HES
Below R code returns all the elements except the elements from the 1 through 3 index.
ses_factor[-(1:3)]
[1] MES LES LES
Levels: LES < MES < HES
Below R code returns all the elements except the first and fourth elements from ses.factor
.
ses_factor[-c(1,4)]
[1] LES HES LES LES
Levels: LES < MES < HES
Accessing elements by Logical Vector
In accessing elements of vector using logical vector, R will return only those elements from vector where the logical values of logical vector are TRUE.
For example, in the below R code, R returns the value of first and third element of factor variable ses_factor
because the first and third value of the logical vector is TRUE
.
ses_factor[c(TRUE,FALSE,TRUE,FALSE,FALSE,FALSE)]
[1] MES HES
Levels: LES < MES < HES
We can also use logical expression to access the elements of a factor. First, R evaluate the logical expression and create a logical vector. Then R returns those elements for which the logical expression takes value TRUE
.
ses_factor[ses_factor!="HES"]
[1] MES LES MES LES LES
Levels: LES < MES < HES
In the above R code, the logical expression ses_factor!="HES"
returns a logical vector (TRUE, TRUE, FALSE, TRUE, TRUE, TRUE). Based on the result of logical vector, R returns those elements of ses_factor
for which the logical value is TRUE
.
Membership and Coercion functions for factors
The membership and coercion functions for factor variables are as follows:
is.factor(x)
: Check whetherx
is factoras.factor(x)
: Convertx
to factoris.ordered(x)
: Check whetherx
is ordered factoras.ordered(x)
: Convertx
to ordered factor
The below R code check whether gender
is factor.
is.factor(gender)
[1] FALSE
Below R code coerce the character vector gender
to factor.
as.factor(gender)
[1] Male Male Female Male Female
Levels: Female Male
Below R check whether gender_f
is factor.
is.factor(gender_f)
[1] TRUE
The below R code check whether gender_f
is ordered factor.
is.ordered(gender_f)
[1] FALSE
The below R code check whether ses_factor
is ordered factor.
is.ordered(ses_factor)
[1] TRUE
is.factor(ses_factor)
[1] TRUE
Endnote
In this tutorial you learned about what are factors in R, how to create factors and ordered factors in R.
Learn more about data structures in R refer to the following tutorials:
- Data Types in R
- Variables and constants in R
- Data Structures in R
- Vectors in R
- Matrix in R
- Arrays in R
- Lists in R
- Data Frames in R
Hopefully you enjoyed learning this tutorial on factors in R. Hope the content is more than sufficient to understand how to create factor and ordered factor in R and how to generate factor levels in R.