Factors in R

In this tutorial you will learn about the how to create factor variables, ordered factor variables, factor variable from numerical data and generale factor levels in R.

Factors in R

In R factors are used for handling categorical variables (Nominal or Ordinal). Factors are the variables which take on finite number of different values. The different values are called the levels. In statistical modeling, use of categorical variables is different than numeric variables.

Nominal variables are categorical where order is not important, e.g., Gender ("Male" or "Female") of a respondent is an example of nominal data.

Ordinal variables are also categorical but the order is important, e.g., Socio-economic status (SES) of a respondent ("LES", "MES" or "HES") is an example of ordinal data. Respondent with "LES" is having less earning than respondent with "HES" and "MES".

To create a factor variable in R, we use the factor() function.

The syntax of factor() function is

factor(x = character(), levels,
labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)

where

• x: a vector of data,
• levels : set of unique value that x might take,
• labels : character vector of labels for levels,
• exclude : a vector of values to be excluded,
• ordered : logical flag to determine if the levels should be ordered,
• nmax : an upper bound on the number of levels.

Creating a factor in R

The function factor() is used to store the variable as factor variable. R internally stores the categorical values as a vector of integers in the range $1,2,\cdots, k$, where $k$ is the number of unique values in the variable and map these values to the categorical values.

gender<-c("Male","Male","Female","Male","Female")
gender
[1] "Male"   "Male"   "Female" "Male"   "Female"

The above R code store gender as a character vector. To convert it into a factor variable, use factor() function.

# Store this as (2,2,1,2,1)
gender_f<-factor(gender)
gender_f
[1] Male   Male   Female Male   Female
Levels: Female Male
# display the structure
str(gender_f) 
 Factor w/ 2 levels "Female","Male": 2 2 1 2 1

The statement gender_f<-factor(gender) store gender_f as vector (2,2,1,2,1) and associate it with 1= Female and 2 = Male. Since the levels are not specified, R assigns the levels to the factor variable alphabetically (i.e., Female first and then Male).

# display the mode
mode(gender_f)
[1] "numeric"
# display the class
class(gender_f)  
[1] "factor"
# display the levels
levels(gender_f) 
[1] "Female" "Male"  
# display the number of levels
nlevels(gender_f) 
[1] 2

The sequence of levels can be set using the levels argument to factor() function.

gender<-c("Male","Male","Female","Male","Female")
gender
[1] "Male"   "Male"   "Female" "Male"   "Female"
gender_2 <- factor(gender, levels = c("Male", "Female"))
gender_2
[1] Male   Male   Female Male   Female
Levels: Male Female

From the output of above code, it is clear that the levels are not alphabetically but as specified in levels argument. And the store it with 1= Male and 2 = Female.

# display the structure
str(gender_2) 
 Factor w/ 2 levels "Male","Female": 1 1 2 1 2
# display the class
class(gender_2)  
[1] "factor"
# display attributes
attributes(gender_2)  
$levels [1] "Male" "Female"$class
[1] "factor"

Sometime the categorical variable is coded as numeric. For example, we coded Male as 0 and Female as 1 for the data ("M","M","F","M","F") as (0,0,1,0,1). Then such coded data can also be converted to factor using factor() function by specifying the levels.

gen <- c(0, 0, 1, 0, 1)
fgen <- factor(gen, levels = 0:1)
fgen
[1] 0 0 1 0 1
Levels: 0 1

Levels can also be assigned using levels() function.

# set levels
levels(fgen) <- c("M", "F")
fgen
[1] M M F M F
Levels: M F
as.numeric(fgen)
[1] 1 1 2 1 2

The function as.numeric() extract the numerical coding as numbers 1 and 2. Note that the original input values are 0 and 1 but R internally stores the categorical values as a vector of integer starting with 1.

Ordered factor in R

Ordered factor is used when we have a qualitative data but the levels are assumed to be ordered. Like socio-economic status of students.

ses <- c("MES", "LES", "HES", "MES", "LES", "LES")
ses
[1] "MES" "LES" "HES" "MES" "LES" "LES"

In the above R code, the vector ses is a character vector. If we use factor() function with argument ordered=TRUE for ses, R will convert a character vector ses to ordered factor.

# convert ses to ordered factor
ses_f <- factor(ses, ordered = TRUE)
ses_f
[1] MES LES HES MES LES LES
Levels: HES < LES < MES

Since the levels are not specified, R assign the levels alphabetically.

We know that ses is an ordinal variable. In order to get proper ordered factors, we have to specify the levels in proper order. The levels for a factor variable can be specified by the argument levels in which we have to specify the proper order levels.

ses_factor <- factor(ses,
levels = c("LES", "MES", "HES"),
ordered = TRUE)
ses_factor
[1] MES LES HES MES LES LES
Levels: LES < MES < HES

Now the levels of ses_factor variable are in the proper order.

Creating factor from numerical data

Numerical variable can be converted to factor by using cut() function. It divides the range of values of variable into intervals and codes the values in the variable according to which interval they fall.

age <- c(10, 20, 25, 12, 30, 45, 50, 26,
24, 13, 26, 47, 48, 50)
ageGroup1 <- cut(age, breaks = c(0, 20, 30, 50))
age
 [1] 10 20 25 12 30 45 50 26 24 13 26 47 48 50

First R code create a vector of values and store it in age. The second R code cut() divide the range of values of age into intervals (0,20], (20, 30] and (30, 50], and stored as a factor variable with 3 levels.

ageGroup1 
 [1] (0,20]  (0,20]  (20,30] (0,20]  (20,30] (30,50] (30,50] (20,30] (20,30]
[10] (0,20]  (20,30] (30,50] (30,50] (30,50]
Levels: (0,20] (20,30] (30,50]
str(ageGroup1)
 Factor w/ 3 levels "(0,20]","(20,30]",..: 1 1 2 1 2 3 3 2 2 1 ...

To create an interval closed on the left side, use the argument right=FALSE in the cut() function.

ageGroup2<-cut(age,breaks=c(0,20,30,50),right=FALSE)
ageGroup2
 [1] [0,20)  [20,30) [20,30) [0,20)  [30,50) [30,50) <NA>    [20,30) [20,30)
[10] [0,20)  [20,30) [30,50) [30,50) <NA>
Levels: [0,20) [20,30) [30,50)

The first R code cut() divide the range of values of age into intervals [0,20), [20, 30) and [30, 50) because we have used right=FALSE.

From the above output, we can see that if any of the value is outside the break range then R store it as NA.

str(ageGroup2) # give structure of ageGroup2
 Factor w/ 3 levels "[0,20)","[20,30)",..: 1 2 2 1 3 3 NA 2 2 1 ...
is.factor(ageGroup2)
[1] TRUE
is.ordered(ageGroup2)
[1] FALSE
attributes(ageGroup2)
$levels [1] "[0,20)" "[20,30)" "[30,50)"$class
[1] "factor"
class(ageGroup2)
[1] "factor"

To get the frequency table of a factor variable we can use table() function.

table(ageGroup1) # Tabulate the variable ageGroup1
ageGroup1
(0,20] (20,30] (30,50]
4       5       5 
table(ageGroup2) # Tabulate the variable ageGroup2
ageGroup2
[0,20) [20,30) [30,50)
3       5       4 

Generating Factor Levels

The function gl() generates regular series of factors.

The general structure of gl() is

gl(n, k, length = n*k, labels = seq_len(n), ordered = FALSE)

• n : an integer giving the number of levels,
• k : an integer giving the number of replications,
• length : an integer giving the length of the result,
• labels : vector of labels for the resulting factor,
• ordered : whether the result should be ordered or not.
gl(2,6)
 [1] 1 1 1 1 1 1 2 2 2 2 2 2
Levels: 1 2

It generates two levels 1 and 2, each level six times.

gl(3,4,length=14)
 [1] 1 1 1 1 2 2 2 2 3 3 3 3 1 1
Levels: 1 2 3
gl(2,4,label=c("Smoker","Non-smoker"))
[1] Smoker     Smoker     Smoker     Smoker     Non-smoker Non-smoker Non-smoker
[8] Non-smoker
Levels: Smoker Non-smoker

Generates two levels Smoker and Non-smoker, each replicated six times.

Some Functions for Factors

In R variaous functions available to get the information about the factor variable are as follows:

Function Meaning
length() number of elements in factor
nlevels() number of levels in factor
levels() returns value of the levels or set levels

Length of factor variable

Below R code display the number of elements in ses.factor factor variable.

length(ses_factor)
[1] 6

Number of Levels of factor variable

Below R code display the number of levels in ses_factor factor variable.

nlevels(ses_factor)
[1] 3

Display levels of factor variable

Below R code display the levels of ses_factor factor variable.

levels(ses_factor)
[1] "LES" "MES" "HES"

Setting levels of factor variable

Recall the fgen factor variable defined earlier, in which the levels are M and F.

fgen
[1] M M F M F
Levels: M F

Suppose we wnat to change the levels M as Male and F as Female. Then we use the levels() function to set the levels of factor variable.

levels(fgen)<-c("Male","Female")
fgen
[1] Male   Male   Female Male   Female
Levels: Male Female

Accessing elements of factor in R

Any particular element or elements of factor variable in R can be accessed using a square brackets [].

Accessing elements by Positive Indexing

The elements of factor variable can be accessed by using positive index of the position of that element, or using sequence of positive index (for adjacent elements) of elements or using c() function with position of elements of vector.

Recall the factor ses_factor defined earlier.

ses_factor
[1] MES LES HES MES LES LES
Levels: LES < MES < HES

Below R code return the $3^{rd}$ element from ses_factor.

ses_factor[3]
[1] HES
Levels: LES < MES < HES

Below R code returns first and second element from ses_factor.

ses_factor[1:2]
[1] MES LES
Levels: LES < MES < HES

Below R code returns first and third element from ses_factor.

ses_factor[c(1,3)]
[1] MES HES
Levels: LES < MES < HES

Accessing elements by Negative Indexing

Below R code is used to access all the elements except the second element of ses_factor.

ses_factor[-2]
[1] MES HES MES LES LES
Levels: LES < MES < HES

Below R code returns all the elements except the elements from the 1 through 3 index.

ses_factor[-(1:3)]
[1] MES LES LES
Levels: LES < MES < HES

Below R code returns all the elements except the first and fourth elements from ses.factor.

ses_factor[-c(1,4)]
[1] LES HES LES LES
Levels: LES < MES < HES

Accessing elements by Logical Vector

In accessing elements of vector using logical vector, R will return only those elements from vector where the logical values of logical vector are TRUE.

For example, in the below R code, R returns the value of first and third element of factor variable ses_factor because the first and third value of the logical vector is TRUE.

ses_factor[c(TRUE,FALSE,TRUE,FALSE,FALSE,FALSE)]
[1] MES HES
Levels: LES < MES < HES

We can also use logical expression to access the elements of a factor. First, R evaluate the logical expression and create a logical vector. Then R returns those elements for which the logical expression takes value TRUE.

ses_factor[ses_factor!="HES"]
[1] MES LES MES LES LES
Levels: LES < MES < HES

In the above R code, the logical expression ses_factor!="HES" returns a logical vector (TRUE, TRUE, FALSE, TRUE, TRUE, TRUE). Based on the result of logical vector, R returns those elements of ses_factor for which the logical value is TRUE.

Membership and Coercion functions for factors

The membership and coercion functions for factor variables are as follows:

• is.factor(x) : Check whether x is factor
• as.factor(x) : Convert x to factor
• is.ordered(x): Check whether x is ordered factor
• as.ordered(x): Convert x to ordered factor

The below R code check whether gender is factor.

is.factor(gender)
[1] FALSE

Below R code coerce the character vector gender to factor.

as.factor(gender)
[1] Male   Male   Female Male   Female
Levels: Female Male

Below R check whether gender_f is factor.

is.factor(gender_f)
[1] TRUE

The below R code check whether gender_f is ordered factor.

is.ordered(gender_f)
[1] FALSE

The below R code check whether ses_factor is ordered factor.

is.ordered(ses_factor)
[1] TRUE
is.factor(ses_factor)
[1] TRUE

Endnote

In this tutorial you learned about what are factors in R, how to create factors and ordered factors in R.