Factors in R

In this tutorial you will learn about the how to create factor variables, ordered factor variables, factor variable from numerical data and generale factor levels in R.

## Factors in R

In R factors are used for handling categorical variables (Nominal or Ordinal). Factors are the variables which take on finite number of different values. The different values are called the levels. In statistical modeling, use of categorical variables is different than numeric variables.

Nominal variables are categorical where order is not important, e.g., Gender ("Male" or "Female") of a respondent is an example of nominal data.

Ordinal variables are also categorical but the order is important, e.g., Socio-economic status (SES) of a respondent ("LES", "MES" or "HES") is an example of ordinal data. Respondent with "LES" is having less earning than respondent with "HES" and "MES".

To create a factor variable in R, we use the `factor()`

function.

The syntax of `factor()`

function is

```
factor(x = character(), levels,
labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
```

where

**x:**a vector of data,**levels :**set of unique value that`x`

might take,**labels :**character vector of labels for levels,**exclude :**a vector of values to be excluded,**ordered :**logical flag to determine if the levels should be ordered,**nmax :**an upper bound on the number of levels.

### Creating a factor in R

The function `factor()`

is used to store the variable as factor variable. R internally stores the categorical values as a vector of integers in the range $1,2,\cdots, k$, where $k$ is the number of unique values in the variable and map these values to the categorical values.

```
gender<-c("Male","Male","Female","Male","Female")
gender
```

`[1] "Male" "Male" "Female" "Male" "Female"`

The above R code store gender as a character vector. To convert it into a factor variable, use `factor()`

function.

```
# Store this as (2,2,1,2,1)
gender_f<-factor(gender)
gender_f
```

```
[1] Male Male Female Male Female
Levels: Female Male
```

```
# display the structure
str(gender_f)
```

` Factor w/ 2 levels "Female","Male": 2 2 1 2 1`

The statement `gender_f<-factor(gender)`

store `gender_f`

as vector (2,2,1,2,1) and associate it with `1= Female`

and `2 = Male`

. Since the levels are not specified, R assigns the levels to the factor variable alphabetically (i.e., `Female`

first and then `Male`

).

```
# display the mode
mode(gender_f)
```

`[1] "numeric"`

```
# display the class
class(gender_f)
```

`[1] "factor"`

```
# display the levels
levels(gender_f)
```

`[1] "Female" "Male" `

```
# display the number of levels
nlevels(gender_f)
```

`[1] 2`

The sequence of levels can be set using the `levels`

argument to `factor()`

function.

```
gender<-c("Male","Male","Female","Male","Female")
gender
```

`[1] "Male" "Male" "Female" "Male" "Female"`

```
gender_2 <- factor(gender, levels = c("Male", "Female"))
gender_2
```

```
[1] Male Male Female Male Female
Levels: Male Female
```

From the output of above code, it is clear that the levels are not alphabetically but as specified in `levels`

argument. And the store it with `1= Male`

and `2 = Female`

.

```
# display the structure
str(gender_2)
```

` Factor w/ 2 levels "Male","Female": 1 1 2 1 2`

```
# display the class
class(gender_2)
```

`[1] "factor"`

```
# display attributes
attributes(gender_2)
```

```
$levels
[1] "Male" "Female"
$class
[1] "factor"
```

Sometime the categorical variable is coded as numeric. For example, we coded `Male`

as 0 and `Female`

as 1 for the data `("M","M","F","M","F")`

as `(0,0,1,0,1)`

. Then such coded data can also be converted to factor using `factor()`

function by specifying the levels.

```
gen <- c(0, 0, 1, 0, 1)
fgen <- factor(gen, levels = 0:1)
fgen
```

```
[1] 0 0 1 0 1
Levels: 0 1
```

Levels can also be assigned using `levels()`

function.

```
# set levels
levels(fgen) <- c("M", "F")
fgen
```

```
[1] M M F M F
Levels: M F
```

`as.numeric(fgen)`

`[1] 1 1 2 1 2`

The function `as.numeric()`

extract the numerical coding as numbers 1 and 2. Note that the original input values are 0 and 1 but R internally stores the categorical values as a vector of integer starting with 1.

### Ordered factor in R

Ordered factor is used when we have a qualitative data but the levels are assumed to be ordered. Like socio-economic status of students.

```
ses <- c("MES", "LES", "HES", "MES", "LES", "LES")
ses
```

`[1] "MES" "LES" "HES" "MES" "LES" "LES"`

In the above R code, the vector `ses`

is a character vector. If we use `factor()`

function with argument `ordered=TRUE`

for `ses`

, R will convert a character vector `ses`

to ordered factor.

```
# convert ses to ordered factor
ses_f <- factor(ses, ordered = TRUE)
ses_f
```

```
[1] MES LES HES MES LES LES
Levels: HES < LES < MES
```

Since the levels are not specified, R assign the levels alphabetically.

We know that `ses`

is an ordinal variable. In order to get proper ordered factors, we have to specify the levels in proper order. The `levels`

for a factor variable can be specified by the argument `levels`

in which we have to specify the proper order levels.

```
ses_factor <- factor(ses,
levels = c("LES", "MES", "HES"),
ordered = TRUE)
ses_factor
```

```
[1] MES LES HES MES LES LES
Levels: LES < MES < HES
```

Now the levels of `ses_factor`

variable are in the proper order.

### Creating factor from numerical data

Numerical variable can be converted to factor by using `cut()`

function. It divides the range of values of variable into intervals and codes the values in the variable according to which interval they fall.

```
age <- c(10, 20, 25, 12, 30, 45, 50, 26,
24, 13, 26, 47, 48, 50)
ageGroup1 <- cut(age, breaks = c(0, 20, 30, 50))
age
```

` [1] 10 20 25 12 30 45 50 26 24 13 26 47 48 50`

First R code create a vector of values and store it in `age`

. The second R code `cut()`

divide the range of values of `age`

into intervals `(0,20]`

, `(20, 30]`

and `(30, 50]`

, and stored as a factor variable with 3 levels.

`ageGroup1 `

```
[1] (0,20] (0,20] (20,30] (0,20] (20,30] (30,50] (30,50] (20,30] (20,30]
[10] (0,20] (20,30] (30,50] (30,50] (30,50]
Levels: (0,20] (20,30] (30,50]
```

`str(ageGroup1)`

` Factor w/ 3 levels "(0,20]","(20,30]",..: 1 1 2 1 2 3 3 2 2 1 ...`

To create an interval closed on the left side, use the argument `right=FALSE`

in the `cut()`

function.

```
ageGroup2<-cut(age,breaks=c(0,20,30,50),right=FALSE)
ageGroup2
```

```
[1] [0,20) [20,30) [20,30) [0,20) [30,50) [30,50) <NA> [20,30) [20,30)
[10] [0,20) [20,30) [30,50) [30,50) <NA>
Levels: [0,20) [20,30) [30,50)
```

The first R code `cut()`

divide the range of values of `age`

into intervals `[0,20)`

, `[20, 30)`

and `[30, 50)`

because we have used `right=FALSE`

.

From the above output, we can see that if any of the value is outside the `break`

range then R store it as `NA`

.

`str(ageGroup2) # give structure of ageGroup2`

` Factor w/ 3 levels "[0,20)","[20,30)",..: 1 2 2 1 3 3 NA 2 2 1 ...`

`is.factor(ageGroup2)`

`[1] TRUE`

`is.ordered(ageGroup2)`

`[1] FALSE`

`attributes(ageGroup2)`

```
$levels
[1] "[0,20)" "[20,30)" "[30,50)"
$class
[1] "factor"
```

`class(ageGroup2)`

`[1] "factor"`

To get the frequency table of a factor variable we can use `table()`

function.

`table(ageGroup1) # Tabulate the variable ageGroup1`

```
ageGroup1
(0,20] (20,30] (30,50]
4 5 5
```

`table(ageGroup2) # Tabulate the variable ageGroup2`

```
ageGroup2
[0,20) [20,30) [30,50)
3 5 4
```

### Generating Factor Levels

The function `gl()`

generates regular series of factors.

The general structure of `gl()`

is

`gl(n, k, length = n*k, labels = seq_len(n), ordered = FALSE)`

an integer giving the number of levels,`n`

:an integer giving the number of replications,`k`

:an integer giving the length of the result,`length`

:vector of labels for the resulting factor,`labels`

:whether the result should be ordered or not.`ordered`

:

`gl(2,6)`

```
[1] 1 1 1 1 1 1 2 2 2 2 2 2
Levels: 1 2
```

It generates two levels 1 and 2, each level six times.

`gl(3,4,length=14)`

```
[1] 1 1 1 1 2 2 2 2 3 3 3 3 1 1
Levels: 1 2 3
```

`gl(2,4,label=c("Smoker","Non-smoker"))`

```
[1] Smoker Smoker Smoker Smoker Non-smoker Non-smoker Non-smoker
[8] Non-smoker
Levels: Smoker Non-smoker
```

Generates two levels `Smoker`

and `Non-smoker`

, each replicated six times.

## Some Functions for Factors

In R variaous functions available to get the information about the factor variable are as follows:

Function | Meaning |
---|---|

`length()` |
number of elements in factor |

`nlevels()` |
number of levels in factor |

`levels()` |
returns value of the levels or set levels |

### Length of factor variable

Below R code display the number of elements in `ses.factor`

factor variable.

`length(ses_factor)`

`[1] 6`

### Number of Levels of factor variable

Below R code display the number of levels in `ses_factor`

factor variable.

`nlevels(ses_factor)`

`[1] 3`

### Display levels of factor variable

Below R code display the levels of `ses_factor`

factor variable.

`levels(ses_factor)`

`[1] "LES" "MES" "HES"`

### Setting levels of factor variable

Recall the `fgen`

factor variable defined earlier, in which the levels are `M`

and `F`

.

`fgen`

```
[1] M M F M F
Levels: M F
```

Suppose we wnat to change the levels `M`

as `Male`

and `F`

as `Female`

. Then we use the `levels()`

function to set the levels of factor variable.

```
levels(fgen)<-c("Male","Female")
fgen
```

```
[1] Male Male Female Male Female
Levels: Male Female
```

## Accessing elements of factor in R

Any particular element or elements of factor variable in R can be accessed using a square brackets `[]`

.

### Accessing elements by Positive Indexing

The elements of factor variable can be accessed by using positive index of the position of that element, or using sequence of positive index (for adjacent elements) of elements or using `c()`

function with position of elements of vector.

Recall the factor `ses_factor`

defined earlier.

`ses_factor`

```
[1] MES LES HES MES LES LES
Levels: LES < MES < HES
```

Below R code return the $3^{rd}$ element from `ses_factor`

.

`ses_factor[3]`

```
[1] HES
Levels: LES < MES < HES
```

Below R code returns first and second element from `ses_factor`

.

`ses_factor[1:2]`

```
[1] MES LES
Levels: LES < MES < HES
```

Below R code returns first and third element from `ses_factor`

.

`ses_factor[c(1,3)]`

```
[1] MES HES
Levels: LES < MES < HES
```

### Accessing elements by Negative Indexing

Below R code is used to access all the elements except the second element of `ses_factor`

.

`ses_factor[-2]`

```
[1] MES HES MES LES LES
Levels: LES < MES < HES
```

Below R code returns all the elements except the elements from the 1 through 3 index.

`ses_factor[-(1:3)]`

```
[1] MES LES LES
Levels: LES < MES < HES
```

Below R code returns all the elements except the first and fourth elements from `ses.factor`

.

`ses_factor[-c(1,4)]`

```
[1] LES HES LES LES
Levels: LES < MES < HES
```

### Accessing elements by Logical Vector

In accessing elements of vector using logical vector, R will return only those elements from vector where the logical values of logical vector are TRUE.

For example, in the below R code, R returns the value of first and third element of factor variable `ses_factor`

because the first and third value of the logical vector is `TRUE`

.

`ses_factor[c(TRUE,FALSE,TRUE,FALSE,FALSE,FALSE)]`

```
[1] MES HES
Levels: LES < MES < HES
```

We can also use logical expression to access the elements of a factor. First, R evaluate the logical expression and create a logical vector. Then R returns those elements for which the logical expression takes value `TRUE`

.

`ses_factor[ses_factor!="HES"]`

```
[1] MES LES MES LES LES
Levels: LES < MES < HES
```

In the above R code, the logical expression `ses_factor!="HES"`

returns a logical vector (TRUE, TRUE, FALSE, TRUE, TRUE, TRUE). Based on the result of logical vector, R returns those elements of `ses_factor`

for which the logical value is `TRUE`

.

## Membership and Coercion functions for factors

The membership and coercion functions for factor variables are as follows:

`is.factor(x)`

: Check whether`x`

is factor`as.factor(x)`

: Convert`x`

to factor`is.ordered(x)`

: Check whether`x`

is ordered factor`as.ordered(x)`

: Convert`x`

to ordered factor

The below R code check whether `gender`

is factor.

`is.factor(gender)`

`[1] FALSE`

Below R code coerce the character vector `gender`

to factor.

`as.factor(gender)`

```
[1] Male Male Female Male Female
Levels: Female Male
```

Below R check whether `gender_f`

is factor.

`is.factor(gender_f)`

`[1] TRUE`

The below R code check whether `gender_f`

is ordered factor.

`is.ordered(gender_f)`

`[1] FALSE`

The below R code check whether `ses_factor`

is ordered factor.

`is.ordered(ses_factor)`

`[1] TRUE`

`is.factor(ses_factor)`

`[1] TRUE`

## Endnote

In this tutorial you learned about what are factors in R, how to create factors and ordered factors in R.

Learn more about data structures in R refer to the following tutorials:

- Data Types in R
- Variables and constants in R
- Data Structures in R
- Vectors in R
- Matrix in R
- Arrays in R
- Lists in R
- Data Frames in R

Hopefully you enjoyed learning this tutorial on factors in R. Hope the content is more than sufficient to understand how to create factor and ordered factor in R and how to generate factor levels in R.