Session 5_2. Reproducible projects with Rmarkdown

Questions

  • What is R Markdown?
  • How can I integrate my R code with text and plots?
  • How can I convert .Rmd files to .html?

Learning Objectives

  • Create a .Rmd document containing R code, text, and plots
  • Create a YAML header to control output
  • Understand basic syntax of (R)Markdown
  • Customize code chunks to control formatting

Introduction to R Markdown

R Markdown is a flexible type of document that allows you to seamlessly combine executable R code, and its output, with text in a single document. These documents can be readily converted to multiple static and dynamic output formats, including PDF (.pdf), Word (.docx), and HTML (.html).

Creating an R Markdown file

To create a new R Markdown document in RStudio, click File -> New File -> R Markdown:

Screenshot of the New R Markdown file dialogue box in RStudio

Then click on ‘Create Empty Document’. Normally you could enter the title of your document, your name (Author), and select the type of output, but we will be learning how to start from a blank document.

Basic components of R Markdown

To control the output, a YAML header is needed:

---
title: "My Awesome Report"
author: "Alison Goff"
date: ""
output: html_document
---

The header is defined by the three hyphens at the beginning (---) and the three hyphens at the end (---). In this header, the only required field is the output:, which specifies the type of output you want. This can be an html_document, a pdf_document, or a word_document. We will use an HTML document for now.

The rest of the fields can be deleted, if you don’t need them. After the header, to begin the body of the document, you start typing after the end of the YAML header (i.e. after the second ---).

YAML is a human-readable data serialization language that is often used for writing configuration files. Depending on whom you ask, YAML stands for yet another markup language or YAML ain’t markup language (a recursive acronym), which emphasizes that YAML is for data, not documents.

YAML files are simpler to read as they use indentation to determine the structure and indicate nesting. Tab characters are not allowed by design, to maintain portability across systems, so white-spaces (i.e., literal space characters) are used instead.

Comments can be identified with a pound or hash symbol (#). It’s always a best practice to use comments, as they describe the intention of the code. YAML does not support multi-line comment, each line needs to be suffixed with the pound character.

Markdown syntax

Markdown is a popular markup language that allows you to add formatting elements to text, such as bold, italics, and code.

Headers

A # in front of text indicates to Markdown that this text is a heading. Adding more #s make the heading smaller, i.e. one # is a first level heading, two ##s is a second level heading, etc. upto the 6th level heading.

# Title
## Section
### Sub-section
#### Sub-sub section
##### Sub-sub-sub section
###### Sub-sub-sub-sub section
What is Markup language?

Markup languages are computer languages that are used to structure, format, or define relationships between different parts of text documents with the help of symbols or tags inserted in the document.

Bold and Italics

  • You can make things bold by surrounding the word with double asterisks, **bold**, or double underscores, __bold__
  • You can make things italicize using single asterisks, *italics*, or single underscores, _italics_.
  • You can also combine bold and italics to write something really important with triple-asterisks, ***really***, or underscores, ___really___ or a combination of asterisks and underscores, **_really_**, _**really**_.

More

To create code-type font, surround the word with backticks, `code-type`. You can also create a list for the variables (using -, +, * keys), an ordered list (using numbers), and nested items (using tab-indenting). For more Markdown syntax see the following reference guide.

Rendering

You can render the document into HTML by clicking the Knit button in the top of the Source pane (top left). The knit function takes an input file, extracts the R code in it according to a list of patterns, evaluates the code and writes the output in another file. If you haven’t saved the document yet, you will be prompted to do so when you knit for the first time.

Writing an R Markdown report

You need to load both packages and data within your R markdown document - it is not enough to load packages and data from the console. To load these, we will need to create a ‘code’ chunk’ at the top of our document (below the YAML header).

A code chunk can be inserted by clicking Code > Insert Chunk, or by using the keyboard shortcuts Ctrl+Alt+I on Windows and Linux, and Cmd+Option+I on Mac.

The syntax of a code chunk is:

```{r chunk-name}
"Here is where you place the R code that you want to run."
```

An R Markdown document knows that this text is not part of the report from the three backticks, ```, that begins and ends the chunk. It also knows that the code inside of the chunk is R code from the r inside of the curly braces ({}). After the r you can add a name for the code chunk . Naming a chunk is optional, but recommended. Each chunk name must be unique, and only contain alphanumeric characters and -.

To load packages (e.g., tidyverse) and the surveys data table (from session3_2), we will insert a chunk and call it ‘setup’. Since we don’t want this code or the output to show in our knitted HTML document, we add an include = FALSE option after the code chunk name ({r setup, include = FALSE}).

library(tidyverse)
surveys <- read_csv("data/portal_data_joined.csv")
Important

The file paths you give in a .Rmd document, e.g. to load a .csv file, are relative to the .Rmd document, not the project root.

Insert table

When you add/modify your code chunks in you rmarkdown file, you don’t need to knit the whole document. Instead, you can run the code chunk with the green triangle in the top right corner of the the chunk.

surveys %>%
    filter(!is.na(weight),           # remove missing weight
           !is.na(hindfoot_length),  # remove missing hindfoot_length
           !is.na(sex)) %>%          # remove missing sex
    group_by(plot_type) %>% 
    summarize(plots = paste(unique(plot_id), collapse = ",")) %>%
    knitr::kable(col.names = c("Plot Type", "Plot Number")) # format nicely
Plot Type Plot Number
Control 2,17,12,11,22,14,4,8
Long-term Krat Exclosure 3,15,19,21
Rodent Exclosure 5,24,10,16,23,7
Short-term Krat Exclosure 18,20,6,13
Spectab exclosure 1,9

To make the table in our output document formatted nicely, we can use the kable() function from the knitr package. The kable() function takes the output of your R code and knits it into a nice looking HTML table. You can also specify different aspects of the table, e.g. the column names, a caption, etc.

Many different R packages can be used to generate tables. Some of the more commonly used options are listed in the table below.

Name Creator(s) Description
condformat Oller Moreno (2022) Apply and visualize conditional formatting to data frames in R. It renders a data frame with cells formatted according to criteria defined by rules, using a tidy evaluation syntax.
DT Xie et al. (2023) Data objects in R can be rendered as HTML tables using the JavaScript library ‘DataTables’ (typically via R Markdown or Shiny). The ‘DataTables’ library has been included in this R package.
formattable Ren and Russell (2021) Provides functions to create formattable vectors and data frames. ‘Formattable’ vectors are printed with text formatting, and formattable data frames are printed with multiple types of formatting in HTML to improve the readability of data presented in tabular form rendered on web pages.
flextable Gohel and Skintzos (2023) Use a grammar for creating and customizing pretty tables. The following formats are supported: ‘HTML’, ‘PDF’, ‘RTF’, ‘Microsoft Word’, ‘Microsoft PowerPoint’ and R ‘Grid Graphics’. ‘R Markdown’, ‘Quarto’, and the package ‘officer’ can be used to produce the result files.
gt Iannone et al. (2022) Build display tables from tabular data with an easy-to-use set of functions. With its progressive approach, we can construct display tables with cohesive table parts. Table values can be formatted using any of the included formatting functions.
huxtable Hugh-Jones (2022) Creates styled tables for data presentation. Export to HTML, LaTeX, RTF, ‘Word’, ‘Excel’, and ‘PowerPoint’. Simple, modern interface to manipulate borders, size, position, captions, colours, text styles and number formatting.
pander Daróczi and Tsegelskyi (2022) Contains some functions catching all messages, ‘stdout’ and other useful information while evaluating R code and other helpers to return user specified text elements (e.g., header, paragraph, table, image, lists etc.) in ‘pandoc’ markdown or several types of R objects similarly automatically transformed to markdown format.
pixiedust Nutter and Kretch (2021) ‘pixiedust’ provides tidy data frames with a programming interface intended to be similar to ’ggplot2’s system of layers with fine-tuned control over each cell of the table.
reactable Lin et al. (2023) Interactive data tables for R, based on the ‘React Table’ JavaScript library. Provides an HTML widget that can be used in ‘R Markdown’ or ‘Quarto’ documents, ‘Shiny’ applications, or viewed from an R console.
rhandsontable Owen et al. (2021) An R interface to the ‘Handsontable’ JavaScript library, which is a minimalist Excel-like data grid editor.
stargazer Hlavac (2022) Produces LaTeX code, HTML/CSS code and ASCII text for well-formatted tables that hold regression analysis results from several models side-by-side, as well as summary statistics.
tables Murdoch (2022) Computes and displays complex tables of summary statistics. Output may be in LaTeX, HTML, plain text, or an R matrix for further processing.
tangram Garbett et al. (2023) Provides an extensible formula system to quickly and easily create production quality tables. The processing steps are a formula parser, statistical content generation from data defined by a formula, and rendering into a table.
xtable Dahl et al. (2019) Coerce data to LaTeX and HTML tables.
ztable Moon (2021) Makes zebra-striped tables (tables with alternating row colors) in LaTeX and HTML formats easily from a data.frame, matrix, lm, aov, anova, glm, coxph, nls, fitdistr, mytable and cbind.mytable objects.

Customizing chunk output

We mentioned using include = FALSE in a code chunk above. There are additional options available to customize how the code chunks are presented in the output document. The full R Markdown code chunk option can be found here. Below are some of the widely used options:

Option Options Output
eval TRUE or FALSE Whether or not the code within the code chunk should be run.
echo TRUE or FALSE Choose if you want to show your code chunk in the output document. echo = TRUE will show the code chunk.
include TRUE or FALSE Choose if the output of a code chunk should be included in the document. FALSE means that your code will run, but will not show up in the document.
warning TRUE or FALSE Whether or not you want your output document to display potential warning messages produced by your code.
message TRUE or FALSE Whether or not you want your output document to display potential messages produced by your code.
fig.align default, left, right, center Where the figure from your R code chunk should be output on the page.

Note that the default settings for the above chunk options are all TRUE

Insert plots

We are using the murders data table from the dslabs package.

library(dslabs)
library(ggthemes)
library(ggrepel)

r <- murders |> 
  summarize(rate = sum(total) /  sum(population) * 10^6) |>
  pull(rate)

plots <- murders |> ggplot(aes(population/10^6, total, label = abb)) +   
  geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") +
  geom_point(aes(col=region), size = 3) +
  geom_text_repel() + 
  scale_x_log10() +
  scale_y_log10() +
  xlab("Populations in millions (log scale)") + 
  ylab("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in 2010") + 
  scale_color_discrete(name = "Region") +
  theme_economist()

plots

We can use R Markdown chunk options. For example, we can add a caption with the chunk option fig.cap and resize the plot size using out.width and out.height:

```{r fig.cap = "Figure 1. Summary", out.width="60%", out.height="60%"}
plots
```

Figure 1. Summary


References
https://datacarpentry.org/r-socialsci/06-rmarkdown.html
http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/ggplot2.html