Session 4. The tidyverse
Questions
- What kind of pre-processing is required before data anaylsis?
- How can we do data cleaning/pre-processing using R?
Learning Objectives
- Introduce the concept of tidy data
- Learn to work directly with data frames
- Learn the
dplyr
package for manipulating data frames
Data wrangling
Data wrangling is the process of transforming and mapping data from one “raw” data form (e.g., web pages, tweets, unstructured table, or PDFs) into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and usability of data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.
The tidyverse
We will focus on a specific data format referred to as tidy and on specific collection of packages that are particularly helpful for working with tidy data referred to as the tidyverse.
Load all the tidyverse packages at once by installing and loading the tidyverse
package:
The core tidyverse packages loaded with this command are:
- ggplot2, for data visualization
- dplyr, for data manipulation
- tidyr, for data tidying
- readr, for data import
- purrr, for functional programming
- tibble, for tibbles, a modern re-imagining of data frames
- stringr, for strings
- forcats, for factors
- lubridate, for date/times
Tidy data
We say that a data table is in tidy format if:
- Each variable has its own column
- Each observation has its own row
- Each value must have its own cell
The murders
dataset is an example of a tidy data frame.
state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
Manipulating data frames
The dplyr package from the tidyverse introduces functions that perform some of the most common operations when working with data frames. We’ll check three of them:
-
mutate
: change the data table by adding a new column
-
filter
: filter the data table to a subset of rows
-
select
: subset the data by selecting specific columns
mutate
The function mutate
takes the data frame as a first argument and the name and values of the variable as a second argument using the convention name = values
. So, if we want to add the murder rated to the muders data frame:
Notice that here we used total
and population
inside the function, which are objects that are not defined in our workspace. But why don’t we get an error?
This is one of dplyr’s main features. Functions in this package, such as mutate
, know to look for variables in the data frame provided in the first argument. In the call to mutate above, total
will have the values in murders$total
. This approach makes the code much more readable.
filter
The filter
function takes the data table as the first argument and then the conditional statement as the second. Like mutate
, we can use the unquoted variable names from murders
inside the function and it will know we mean the columns and not objects in the workspace.
To subset the table with the murder rate lower than 0.71:
filter(murders, rate <= 0.71)
## state abb region population total rate
## 1 Hawaii HI West 1360301 7 0.5145920
## 2 Iowa IA North Central 3046355 21 0.6893484
## 3 New Hampshire NH Northeast 1316470 5 0.3798036
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Vermont VT Northeast 625741 2 0.3196211
select
If we want to view only a subset of columns, we can use the select
function. In the code below we select three columns, assign this to a new object and then filter the new object:
In the call to select
, the first argument murders
is an object, but state
, region
, and rate
are variable names.
The pipe: |>
or %>%
In R we can perform a series of operations, for example select
and then filter
, by sending the results of one function to another using what is called the pipe operator: %>%
. Since R version 4.1.0, you can also use |>
.
For example, to show three variables (state, region, rate) for states that have murdre rate below 0.71, we wrote this code:
new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
For such an operation, we can use the pipe |>
instead:
In general, the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe.So the below two statements are same:
the pipe sends values to the first argument, so we can define other arguments as if the first argument is already defined. Therefore, when using the pipe with data frames and dplyr, we no longer need to specify the required first argument since the dplyr functions we have described all take the data as the first argument.
The placeholder
One of the advantages of using the pipe |>
is that we do not have to keep naming new objects as we manipulate the data frame. The object on the left-hand side of the pipe is used as the first argument of the function on the right-hand side of the pipe. If you want to used the passed object as the non-first argument, you can use the placeholder operator _
(for the %>%
pipe the placeholder is .
). Below is a simple example that passes the base
argument to the log
function. The following three are equivalent:
Summarizing data
We will cover two new dplyr functions that can be useful for summarizing data: summarize
and group_by
.
summarize
We start with a simple example based on heights. The heights
dataset (from dslabs package) includes heights and sex reported by students in an in-class survey.
The following code computes the average and standard deviation for females:
This takes our original data table as input, filters it to keep only females, and then produces a new summarized table with just the average and the standard deviation of heights. We get to choose the names of the columns of the resulting table.
Because the resulting table stored in s is a data frame, we can access the components with the accessor $
:
s$average
## [1] 64.93942
s$standard_deviation
## [1] 3.760656
group_by
A common operation in data exploration is to first split data into groups and then compute summaries for each group. For example, we may want to compute the average and standard deviation for men’s and women’s heights separately.
The summarize
function applies the summarization to each group separately.
Sorting data frames
When examining a dataset, it is often convenient to sort the table by the different columns. We can use order
and sort
functions for that. However, for ordering entire tables, the dplyr function arrange
can be more useful.
## Order the states by population size
murders |>
arrange(population) |>
head()
## state abb region population total rate
## 1 Wyoming WY West 563626 5 0.8871131
## 2 District of Columbia DC South 601723 99 16.4527532
## 3 Vermont VT Northeast 625741 2 0.3196211
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Alaska AK West 710231 19 2.6751860
## 6 South Dakota SD North Central 814180 8 0.9825837
With arrange
we get to decide which column to sort by.
## The states by murder rate, from lowest to highest
murders |>
arrange(rate) |>
head()
## state abb region population total rate
## 1 Vermont VT Northeast 625741 2 0.3196211
## 2 New Hampshire NH Northeast 1316470 5 0.3798036
## 3 Hawaii HI West 1360301 7 0.5145920
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Iowa IA North Central 3046355 21 0.6893484
## 6 Idaho ID West 1567582 12 0.7655102
Nested sorting
You can use multiple columns to order the table.
state abb region population total rate
1 Vermont VT Northeast 625741 2 0.3196211
2 New Hampshire NH Northeast 1316470 5 0.3798036
3 Maine ME Northeast 1328361 11 0.8280881
4 Rhode Island RI Northeast 1052567 16 1.5200933
5 Massachusetts MA Northeast 6547629 118 1.8021791
6 New York NY Northeast 19378102 517 2.6679599
Tibbles
tibble vs. data.frame
The tbl
, pronounced tibble, is a special kind of data frame. The functions group_by
and summarize
always return this type of data frame. For consistency, the dplyr manipulation verbs (select
, filter
, mutate
, and arrange
) preserve the class of the input: if they receive a regular data frame they return a regular data frame, while if they receive a tibble they return a tibble.
Tibbles are very similar to data frames. In fact, you can think of them as a modern version of data frames. Nonetheless there are three important differences between them:
1. Tibbles display better
The print method for tibbles is more readable than that of a data frame.
2. Subsets of tibbles are tibbles
If you subset the columns of a data frame, you may get back an object that is not a data frame, such as a vector or scalar. For example:
class(murders[,4])
## [1] "numeric"
is not a data frame. With tibbles this does not happen:
This is useful in the tidyverse since functions require data frames as input.
With tibbles, if you want to access the vector that defines a column, and not get back a data frame, you need to use the accessor $
:
A related feature is that tibbles will give you a warning if you try to access a column that does not exist. If we accidentally write Population
instead of population
this:
murders$Population
## NULL
returns a NULL
with no warning, which can make it harder to debug. In contrast, if we try this with a tibble we get an informative warning:
as_tibble(murders)$Population
## Warning: Unknown or uninitialised column: `Population`.
## NULL
3. Tibbles can have complex entries
While data frame columns need to be vectors of numbers, strings, or logical values, tibbles can have more complex objects, such as lists or functions. Also, we can create tibbles with functions:
Create a tibble
To create a data frame in the tibble format, you can do this by using the tibble
function.
Note that base R (without packages loaded) has a function with a very similar name, data.frame
, that can be used to create a regular data frame rather than a tibble.
grades <- data.frame(
names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
To convert a regular data frame to a tibble, you can use the as_tibble
function.
Reference
https://en.wikipedia.org/wiki/Data_wrangling
http://rafalab.dfci.harvard.edu/dsbook-part-1/R/tidyverse.html
https://datacarpentry.org/R-ecology-lesson/03-dplyr.html