Data frames in R

In this tutorial, you will learn about what is data frame in R?, how to create data frame in R?, and how to access variable(s) and/or observation(s ) from a data frame?

What is a Data Frame in R?

In R language, a data frame is a primary data structure for handling tabular data sets like a spreadsheet. Data frames is an atomic data structure in R. Data frames are like matrices except that the columns are allowed to be of different types. That is data frames stores heterogeneous data types whereas matrix stores homogeneous data types.

Each row in the data frame corresponds to different observational units and each column in the data frame corresponds to different variables.

data frames in R
data frames in R

How to create data frame in R?

In R data frames can be created using the data.frame() function. It converts collection of vectors or a matrix into a data frame.

Creating an empty data frame in R

Sometimes you need to initialize an empty data frame with only variable names and their storage type. The function data.frame() is used to create an empty structure of a data frame.

Suppose you want to create an empty data frame with 5 variables Name, Gender, Age, Weight and Height. Following R code create an empty data frame with these variable names.

data1 <- data.frame(
  Name = character(),
  Gender = character(),
  Age = numeric(),
  Weight = numeric(),
  height = numeric()
)
str(data1)
'data.frame': 0 obs. of  5 variables:
 $ Name  : chr 
 $ Gender: chr 
 $ Age   : num 
 $ Weight: num 
 $ height: num 

Sometimes we need to create an empty data frame structure from an existing data frame. Following R code copy only the structure of data1 data frame to data2 and create an empty data frame data2.

data2 <- data1[FALSE, ]
str(data2)
'data.frame': 0 obs. of  5 variables:
 $ Name  : chr 
 $ Gender: chr 
 $ Age   : num 
 $ Weight: num 
 $ height: num 

Creating a data frame using data.frame() function

Suppose we have some data about the students as follows:

Name Gender Age Weight
A Male 10 26
B Female 20 35
C Female 12 28
D Male 14 30
E Male 16 31
F Female 15 29
G Male 17 34
student <- data.frame(
  name = c("A", "B", "C", "D", "E", "F", "G"),
  gender = c('Male', 'Female', 'Female',
          'Male', 'Male', 'Female', 'Male'),
  age = c(10, 20, 12, 14, 16, 15, 17),
  weight = c(26, 35, 28, 30, 31, 29, 34))
str(student)
'data.frame': 7 obs. of  4 variables:
 $ name  : chr  "A" "B" "C" "D" ...
 $ gender: chr  "Male" "Female" "Female" "Male" ...
 $ age   : num  10 20 12 14 16 15 17
 $ weight: num  26 35 28 30 31 29 34

Creating a data frame from vectors

Data frame can also be created from vectors. To construct a data frame from the above data, begin by constructing four vectors corresponding to each column of the data.

name <- c("A", "B", "C", "D", "E", "F", "G")
gender <- c("M", "F", "F", "M", "M", "F", "M")
age <- c(10, 20, 12, 14, 16, 15, 17)
weight <- c(26, 35, 28, 30, 31, 29, 34)

Use a data.frame() function to combine all the four vectors into a single data frame entity.

The data.frame function creates an object called student and within that it stores values of the four variables name, gender, age and weight.

student <- data.frame(name, gender, age, weight)
class(student) # display class of a data frame
[1] "data.frame"

The function names() display the name of each variable in a data frame.

The str() function display status of each variable in a data frame.

# display the name of the variables from the data frame
names(student)  
[1] "name"   "gender" "age"    "weight"
# Display the structure of a data frame
str(student)  
'data.frame': 7 obs. of  4 variables:
 $ name  : chr  "A" "B" "C" "D" ...
 $ gender: chr  "M" "F" "F" "M" ...
 $ age   : num  10 20 12 14 16 15 17
 $ weight: num  26 35 28 30 31 29 34

Creating a data frame from list()

Create a list of students using all the four vectors defined above. Then create a data frame from a list using data.frame() function.

# Make a list from a vector
student.list <- list(
  name = name,
  gender = gender,
  age = age,
  weight = weight
)
class(student.list)
[1] "list"
# make a data frame from list
student <- data.frame(student.list)
# display the structure of a data frame
str(student) 
'data.frame': 7 obs. of  4 variables:
 $ name  : chr  "A" "B" "C" "D" ...
 $ gender: chr  "M" "F" "F" "M" ...
 $ age   : num  10 20 12 14 16 15 17
 $ weight: num  26 35 28 30 31 29 34
# display dimension of a data frame
dim(student) 
[1] 7 4
attributes(student)
$names
[1] "name"   "gender" "age"    "weight"

$class
[1] "data.frame"

$row.names
[1] 1 2 3 4 5 6 7

Important functions for Data Frame

Some important functions related to data frame are as follows:

Function Output
str(dataframe) Explore the data structure of a data frame
class(dataframe) Display the class of a data frame
dim(dataframe) Display the dimension of a data frame
nrow(dataframe) Display number of rows in a data frame
ncol(dataframe) Display number of columns in a data frame
names(dataframe) Display the names of the variables of the data frame
colnames(dataframe) Display name of columns of a data frame
rownames(dataframe) Display name of rows of a data frame
dimnames(dataframe) Display list with names of rows and columns
is.data.frame(dataframe) Check whether the argument is data frame
as.data.frame(x) Convert argument x to data frame
attributes(dataframe) access attributes of data frame

Before using data from a data frame, it is good practice to check the summary of the structure of data frame. To get the summary of the structure of a data frame, use the str() function.

# display structure of a data frame
str(student) 
'data.frame': 7 obs. of  4 variables:
 $ name  : chr  "A" "B" "C" "D" ...
 $ gender: chr  "M" "F" "F" "M" ...
 $ age   : num  10 20 12 14 16 15 17
 $ weight: num  26 35 28 30 31 29 34

If you apply the str() function on a data frame, it will provides the following information:

  • number of observations
  • number of variables
  • name of each variable
  • mode (i.e., type of data) of each variable
  • few observations for each of the variables
attributes(student)
$names
[1] "name"   "gender" "age"    "weight"

$class
[1] "data.frame"

$row.names
[1] 1 2 3 4 5 6 7

The top rows and bottom rows of a data frame can be displayed using head() and tail() function respectively. By default head() and tail() function display top 6 and bottom 6 rows of a data frame respectively.

# display top 6 rows of a data frame
head(student) 
  name gender age weight
1    A      M  10     26
2    B      F  20     35
3    C      F  12     28
4    D      M  14     30
5    E      M  16     31
6    F      F  15     29
# display top 3 rows of a data frame
head(student, 3) 
  name gender age weight
1    A      M  10     26
2    B      F  20     35
3    C      F  12     28
# display bottom 6 rows of a data frame
tail(student) 
  name gender age weight
2    B      F  20     35
3    C      F  12     28
4    D      M  14     30
5    E      M  16     31
6    F      F  15     29
7    G      M  17     34
# display bottom 3 rows of a data frame
tail(student, 3) 
  name gender age weight
5    E      M  16     31
6    F      F  15     29
7    G      M  17     34

Accessing Elements of a data frame

Accessing Rows/Columns using index

Elements of a data frame can be accessed by specifying row number(s) and/or column number(s). Like

  • df[i,] returns $i^{th}$ row of data frame df,
  • df[,j] returns $j^{th}$ column of data frame df and
  • df[i,j] returns $(i,j)^{th}$ element of data frame df.

Consider a data frame student defined above.

Below R code returns only first row of data frame student.

# returns 1st row of data frame
student[1, ]   
  name gender age weight
1    A      M  10     26

Below R code returns only third column of data frame student.

# returns 3rd column of data frame
student[, 3]   
[1] 10 20 12 14 16 15 17

Above R code returns third column of data frame student but it return a vector, even though the object is a data frame. To prevent this from happening we use drop=FALSE argument as follows:

# returns 3rd column of data frame
student[, 3, drop = FALSE]   
  age
1  10
2  20
3  12
4  14
5  16
6  15
7  17

Below R code return the element from second row and third column of data frame student.

# returns value from 2nd row and 3rd column
# of data frame student
student[2, 3]  
[1] 20

Below R code returns the 1 to 2 rows and third column of data frame student.

# returns the elements from the first 2
# rows and 3rd column of student data frame
student[1:2, 3] 
[1] 10 20

To extract non-adjacent rows or columns, use c() (combine) function.

Below R code returns first and third row of data frame student.

# returns the elements from first and
# third row of student data frame
student[c(1, 3), ] 
  name gender age weight
1    A      M  10     26
3    C      F  12     28

Negative indexing is used to omit the specific row(s) and/or column(s). Below R code display all the rows except second row of data frame student.

# returns the elements from all
# rows except 2nd row
student[-2, ] 
  name gender age weight
1    A      M  10     26
3    C      F  12     28
4    D      M  14     30
5    E      M  16     31
6    F      F  15     29
7    G      M  17     34

Accessing variables of data.frame

Variables (columns) of data frame can also be accessed using column names.

Variables from data frame can be accessed using three different ways.

Using square bracket

Single variable can be retrieved using square bracket with column index of variable or variable name

# retrieve column no. 3 (age) of data frame
student[, 3]
[1] 10 20 12 14 16 15 17
# retrieve age column of data frame 
student[, "age"]
[1] 10 20 12 14 16 15 17
# retrieve "age" variable of data frame
student["age"]
  age
1  10
2  20
3  12
4  14
5  16
6  15
7  17

Using $ sign and the name of the variable

Any variable can be retrieved using the data frame name followed by $ symbol and the variable name.

# retrieve "age" column of data frame
student$age
[1] 10 20 12 14 16 15 17

Using double square bracket

Variable can also be retrieved with name of the variable
in double square bracket.

# retrieve age column of data frame 
student[["age"]]
[1] 10 20 12 14 16 15 17

Using c() function

More than one variables can be retrieved using concatenation c() function

# retrieve name, age and weight column of data frame 
student[, c("name", "age", "weight")]
  name age weight
1    A  10     26
2    B  20     35
3    C  12     28
4    D  14     30
5    E  16     31
6    F  15     29
7    G  17     34

Accessing cases from data frame

Rows or observations from a data frame can be accessed by specifying row index or indexes in square bracket.

Selecting single row

Single row of a data frame can be accessed using row index in a square bracket.

# select 3rd row from data frame
student[3, ]
  name gender age weight
3    C      F  12     28

Selecting adjacent rows/cases

# select adjacent rows from data frame
student[1:3, ]
  name gender age weight
1    A      M  10     26
2    B      F  20     35
3    C      F  12     28

Selecting non-adjacent rows

Non-adjacent rows of a data frame can be selected using concatenation function c().

student[c(1, 3, 5), ]
  name gender age weight
1    A      M  10     26
3    C      F  12     28
5    E      M  16     31

Selecting random sample of rows

Use the sample(x,size) function to select row index randomly.

# select 3 rows randomly from the student data frame
k <- sample(nrow(student), 3)
student[k, ]
  name gender age weight
2    B      F  20     35
1    A      M  10     26
7    G      M  17     34

Note that sample(x,size,replace=FALSE, prob=NULL) function is used to select a random sample of specified size from x.

Conditional selection from a data frame

Many times we need to extract the data from a data frame that satisfies certain criteria.

For example, we need to extract data from student data frame for Female candidate only. In such a situation instead of indexing we can use relational expression.

# display data for only female students
student[student$gender == "F", ] 
  name gender age weight
2    B      F  20     35
3    C      F  12     28
6    F      F  15     29

Suppose we need to extract data for Female candidate with age > 12 from student data frame.

# display data for female with age >12
student[student$gender == "F" & student$age > 12, ] 
  name gender age weight
2    B      F  20     35
6    F      F  15     29

Suppose we need to extract student data for which age > 12 and age <= 15.

# display data for age between 12 and 15 (inclusive)
student[student$age > 12 & student$age <= 15, ] 
  name gender age weight
4    D      M  14     30
6    F      F  15     29

Extracting subset from data frame

Subsets from a data frame can be extracted using subset() function. It returns subsets of a data frame which meet the specified condition.

# Display all columns for Female candidate only
subset(student, gender == "F") 
  name gender age weight
2    B      F  20     35
3    C      F  12     28
6    F      F  15     29
# Display all columns for age < 14
subset(student, age < 14) 
  name gender age weight
1    A      M  10     26
3    C      F  12     28
# Display all columns for age < 14 and gender = F
subset(student, age < 14 & gender == "F") 
  name gender age weight
3    C      F  12     28

Some specific variable can be selected or deselected Using select argument in subset() function.

# display only age and weight column for Female candidate.
subset(student, gender == "F", select = c(age, weight)) 
  age weight
2  20     35
3  12     28
6  15     29
# display all columns except Age for Male candidate.
subset(student, gender == "M", select = -age) 
  name gender weight
1    A      M     26
4    D      M     30
5    E      M     31
7    G      M     34

Note that subset() function in R is a kind of filtering a data frame that meet the specified condition.

Adding or removing column and rows to a data frame

Adding column using simple assignment

Suppose we need to add height data to the existing student data frame.

# create a new vector height
height <- c(155, 153, 165, 162, 158, 156, 168)

We can add height column to a student data frame using $ symbol as follows :

## Add height column to data frame
student$height <- height
student
  name gender age weight height
1    A      M  10     26    155
2    B      F  20     35    153
3    C      F  12     28    165
4    D      M  14     30    162
5    E      M  16     31    158
6    F      F  15     29    156
7    G      M  17     34    168

Adding Column to a data frame using cbind() function

Column can also be added to existing data frame using cbind() function as follows:

# create a new vector result
result <- c("Pass", "Fail", "Pass", "Fail", "Pass",
            "Pass", "Pass")
student <- cbind(student, result)
student
  name gender age weight height result
1    A      M  10     26    155   Pass
2    B      F  20     35    153   Fail
3    C      F  12     28    165   Pass
4    D      M  14     30    162   Fail
5    E      M  16     31    158   Pass
6    F      F  15     29    156   Pass
7    G      M  17     34    168   Pass

Removing column from a data frame

Column can be removed from a data frame just by assigning NULL to that column.

student$result <- NULL
student
  name gender age weight height
1    A      M  10     26    155
2    B      F  20     35    153
3    C      F  12     28    165
4    D      M  14     30    162
5    E      M  16     31    158
6    F      F  15     29    156
7    G      M  17     34    168

Adding row to a data frame using rbind() function

new <- c("H", "M", 23, 40, 159)
student <- rbind(student, new)
student
  name gender age weight height
1    A      M  10     26    155
2    B      F  20     35    153
3    C      F  12     28    165
4    D      M  14     30    162
5    E      M  16     31    158
6    F      F  15     29    156
7    G      M  17     34    168
8    H      M  23     40    159

Removing rows from a data frame

Rows from a data frame can be removed using negative index for rows or using concatenate function c() as follows:

student <- student[-c(7,8),]
student
  name gender age weight height
1    A      M  10     26    155
2    B      F  20     35    153
3    C      F  12     28    165
4    D      M  14     30    162
5    E      M  16     31    158
6    F      F  15     29    156

Endnote

In this tutorial you learned about what is data frame in R, how to create data frame in R and how to access elements of data frames using different methods.

Learn more about data structures in R refer to the following tutorials:

Hope you enjoyed learning data frame in R. The content is more than sufficient to understand data frame in R and how to perform various operations on data frame in R.

Leave a Comment