Session 4. The tidyverse

Questions

What kind of pre-processing is required before data anaylsis?
How can we do data cleaning/pre-processing using R?

Learning Objectives

Introduce the concept of tidy data
Learn to work directly with data frames
Learn the dplyr package for manipulating data frames

Data wrangling

Data wrangling is the process of transforming and mapping data from one “raw” data form (e.g., web pages, tweets, unstructured table, or PDFs) into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and usability of data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

The tidyverse

We will focus on a specific data format referred to as tidy and on specific collection of packages that are particularly helpful for working with tidy data referred to as the tidyverse.

Load all the tidyverse packages at once by installing and loading the tidyverse package:

library(tidyverse)

The core tidyverse packages loaded with this command are:

ggplot2, for data visualization
dplyr, for data manipulation
tidyr, for data tidying
readr, for data import
purrr, for functional programming
tibble, for tibbles, a modern re-imagining of data frames
stringr, for strings
forcats, for factors
lubridate, for date/times

Tidy data

We say that a data table is in tidy format if:

Each variable has its own column
Each observation has its own row
Each value must have its own cell

The murders dataset is an example of a tidy data frame.

       state abb region population total
1    Alabama  AL  South    4779736   135
2     Alaska  AK   West     710231    19
3    Arizona  AZ   West    6392017   232
4   Arkansas  AR  South    2915918    93
5 California  CA   West   37253956  1257
6   Colorado  CO   West    5029196    65

Manipulating data frames

The dplyr package from the tidyverse introduces functions that perform some of the most common operations when working with data frames. We’ll check three of them:

mutate: change the data table by adding a new column
filter: filter the data table to a subset of rows
select: subset the data by selecting specific columns

`mutate`

The function mutate takes the data frame as a first argument and the name and values of the variable as a second argument using the convention name = values. So, if we want to add the murder rated to the muders data frame:

library(dslabs)
murders <- mutate(murders, rate = total / population * 100000)

Notice that here we used total and population inside the function, which are objects that are not defined in our workspace. But why don’t we get an error?

This is one of dplyr’s main features. Functions in this package, such as mutate, know to look for variables in the data frame provided in the first argument. In the call to mutate above, total will have the values in murders$total. This approach makes the code much more readable.

`filter`

The filter function takes the data table as the first argument and then the conditional statement as the second. Like mutate, we can use the unquoted variable names from murders inside the function and it will know we mean the columns and not objects in the workspace.

To subset the table with the murder rate lower than 0.71:

filter(murders, rate <= 0.71)
##           state abb        region population total      rate
## 1        Hawaii  HI          West    1360301     7 0.5145920
## 2          Iowa  IA North Central    3046355    21 0.6893484
## 3 New Hampshire  NH     Northeast    1316470     5 0.3798036
## 4  North Dakota  ND North Central     672591     4 0.5947151
## 5       Vermont  VT     Northeast     625741     2 0.3196211

`select`

If we want to view only a subset of columns, we can use the select function. In the code below we select three columns, assign this to a new object and then filter the new object:

new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
##           state        region      rate
## 1        Hawaii          West 0.5145920
## 2          Iowa North Central 0.6893484
## 3 New Hampshire     Northeast 0.3798036
## 4  North Dakota North Central 0.5947151
## 5       Vermont     Northeast 0.3196211

In the call to select, the first argument murders is an object, but state, region, and rate are variable names.

Exercises

Using dplyr functions, write the codes performing the following actions on the murders table:

Show only the states and population sizes
Show just the New York row
Remove Florida
Data from New York and Connecticut

Answers

select(murders, state, population)
filter(murders, state == "New York")
filter(murders, state != "Florida)
filter(murders, state %in% c("New York", "Connecticut"))

The pipe: `|>` or `%>%`

In R we can perform a series of operations, for example select and then filter, by sending the results of one function to another using what is called the pipe operator: %>%. Since R version 4.1.0, you can also use |>.

For example, to show three variables (state, region, rate) for states that have murdre rate below 0.71, we wrote this code:

new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)

For such an operation, we can use the pipe |> instead:

murders |> 
    select(state, region, rate) |> 
    filter(rate <= 0.71)
##           state        region      rate
## 1        Hawaii          West 0.5145920
## 2          Iowa North Central 0.6893484
## 3 New Hampshire     Northeast 0.3798036
## 4  North Dakota North Central 0.5947151
## 5       Vermont     Northeast 0.3196211

In general, the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe.So the below two statements are same:

16 |> 
    sqrt() |> 
    log2()
## [1] 2
log2(sqrt(16))
## [1] 2

the pipe sends values to the first argument, so we can define other arguments as if the first argument is already defined. Therefore, when using the pipe with data frames and dplyr, we no longer need to specify the required first argument since the dplyr functions we have described all take the data as the first argument.

The placeholder

One of the advantages of using the pipe |> is that we do not have to keep naming new objects as we manipulate the data frame. The object on the left-hand side of the pipe is used as the first argument of the function on the right-hand side of the pipe. If you want to used the passed object as the non-first argument, you can use the placeholder operator _ (for the %>% pipe the placeholder is .). Below is a simple example that passes the base argument to the log function. The following three are equivalent:

log(8, base = 2)
2 |> log(8, base = _)
2 %>% log(8, base = .)

Summarizing data

We will cover two new dplyr functions that can be useful for summarizing data: summarize and group_by.

`summarize`

We start with a simple example based on heights. The heights dataset (from dslabs package) includes heights and sex reported by students in an in-class survey.

The following code computes the average and standard deviation for females:

s <- heights |> 
  filter(sex == "Female") |>
  summarize(average = mean(height), 
            standard_deviation = sd(height))
s
##    average standard_deviation
## 1 64.93942           3.760656

This takes our original data table as input, filters it to keep only females, and then produces a new summarized table with just the average and the standard deviation of heights. We get to choose the names of the columns of the resulting table.

Because the resulting table stored in s is a data frame, we can access the components with the accessor $:

s$average
## [1] 64.93942
s$standard_deviation
## [1] 3.760656

`group_by`

A common operation in data exploration is to first split data into groups and then compute summaries for each group. For example, we may want to compute the average and standard deviation for men’s and women’s heights separately.

heights |> 
  group_by(sex) |>
  summarize(average = mean(height), 
            standard_deviation = sd(height))
## # A tibble: 2 × 3
##   sex    average standard_deviation
##   <fct>    <dbl>              <dbl>
## 1 Female    64.9               3.76
## 2 Male      69.3               3.61

The summarize function applies the summarization to each group separately.

Sorting data frames

When examining a dataset, it is often convenient to sort the table by the different columns. We can use order and sort functions for that. However, for ordering entire tables, the dplyr function arrange can be more useful.

## Order the states by population size
murders |>
  arrange(population) |>
  head()
##                  state abb        region population total       rate
## 1              Wyoming  WY          West     563626     5  0.8871131
## 2 District of Columbia  DC         South     601723    99 16.4527532
## 3              Vermont  VT     Northeast     625741     2  0.3196211
## 4         North Dakota  ND North Central     672591     4  0.5947151
## 5               Alaska  AK          West     710231    19  2.6751860
## 6         South Dakota  SD North Central     814180     8  0.9825837

With arrange we get to decide which column to sort by.

## The states by murder rate, from lowest to highest
murders |> 
  arrange(rate) |>
  head()
##           state abb        region population total      rate
## 1       Vermont  VT     Northeast     625741     2 0.3196211
## 2 New Hampshire  NH     Northeast    1316470     5 0.3798036
## 3        Hawaii  HI          West    1360301     7 0.5145920
## 4  North Dakota  ND North Central     672591     4 0.5947151
## 5          Iowa  IA North Central    3046355    21 0.6893484
## 6         Idaho  ID          West    1567582    12 0.7655102

Note

Check the examples of order and sort functions:

## Sort the state in an alphabetical order
sort(murders$region)
sort(as.character(murders$region))

## Sort the state based on the population size from smallest to largest
ind <- order(murders$population)
murders$state[ind]

Nested sorting

You can use multiple columns to order the table.

murders |>
    arrange(region, rate) |>
    head()

          state abb    region population total      rate
1       Vermont  VT Northeast     625741     2 0.3196211
2 New Hampshire  NH Northeast    1316470     5 0.3798036
3         Maine  ME Northeast    1328361    11 0.8280881
4  Rhode Island  RI Northeast    1052567    16 1.5200933
5 Massachusetts  MA Northeast    6547629   118 1.8021791
6      New York  NY Northeast   19378102   517 2.6679599

Exercises

Using the dplyr functions and the pipe operator, summarize the murder rate for different region from the murders table.

Answers

murders |> 
    group_by(region) |> 
    summarize(rate_by_region = sum(total)/sum(population) * 100000) |> 
    arrange(rate_by_region)
## # A tibble: 4 × 2
##   region        rate_by_region
##   <fct>                  <dbl>
## 1 Northeast               2.66
## 2 West                    2.66
## 3 North Central           2.73
## 4 South                   3.63

Tibbles

tibble vs. data.frame

The tbl, pronounced tibble, is a special kind of data frame. The functions group_by and summarize always return this type of data frame. For consistency, the dplyr manipulation verbs (select, filter, mutate, and arrange) preserve the class of the input: if they receive a regular data frame they return a regular data frame, while if they receive a tibble they return a tibble.

Tibbles are very similar to data frames. In fact, you can think of them as a modern version of data frames. Nonetheless there are three important differences between them:

1. Tibbles display better

The print method for tibbles is more readable than that of a data frame.

2. Subsets of tibbles are tibbles

If you subset the columns of a data frame, you may get back an object that is not a data frame, such as a vector or scalar. For example:

class(murders[,4])
## [1] "numeric"

is not a data frame. With tibbles this does not happen:

class(as_tibble(murders)[,4])
## [1] "tbl_df"     "tbl"        "data.frame"

This is useful in the tidyverse since functions require data frames as input.

With tibbles, if you want to access the vector that defines a column, and not get back a data frame, you need to use the accessor $:

class(as_tibble(murders)$population)
## [1] "numeric"

A related feature is that tibbles will give you a warning if you try to access a column that does not exist. If we accidentally write Population instead of population this:

murders$Population
## NULL

returns a NULL with no warning, which can make it harder to debug. In contrast, if we try this with a tibble we get an informative warning:

as_tibble(murders)$Population
## Warning: Unknown or uninitialised column: `Population`.
## NULL

3. Tibbles can have complex entries

While data frame columns need to be vectors of numbers, strings, or logical values, tibbles can have more complex objects, such as lists or functions. Also, we can create tibbles with functions:

tibble(id = c(1, 2, 3), func = c(mean, median, sd))
## # A tibble: 3 × 2
##      id func  
##   <dbl> <list>
## 1     1 <fn>  
## 2     2 <fn>  
## 3     3 <fn>

Create a tibble

To create a data frame in the tibble format, you can do this by using the tibble function.

grades <- tibble(
    names = c("John", "Juan", "Jean", "Yao"), 
    exam_1 = c(95, 80, 90, 85), 
    exam_2 = c(90, 85, 85, 90))

Note that base R (without packages loaded) has a function with a very similar name, data.frame, that can be used to create a regular data frame rather than a tibble.

grades <- data.frame(
    names = c("John", "Juan", "Jean", "Yao"), 
    exam_1 = c(95, 80, 90, 85), 
    exam_2 = c(90, 85, 85, 90))

To convert a regular data frame to a tibble, you can use the as_tibble function.

as_tibble(grades) |> class()

[1] "tbl_df"     "tbl"        "data.frame"

Reference
https://en.wikipedia.org/wiki/Data_wrangling
http://rafalab.dfci.harvard.edu/dsbook-part-1/R/tidyverse.html
https://datacarpentry.org/R-ecology-lesson/03-dplyr.html