In this lab, we will learn how to convert data tables into a tidy form. We will learn about functions in the tidyr
package, which is part of the tidyverse
meta-package.
library(tidyverse)
library(babynames)
Goal: by the end of this lab, you will understand what tidy data is and how to convert a wide table into a narrow table, and vice versa.
pivot_longer()
Tidy data are usually long and narrow. That is, it may have many rows, but relatively few columns. However, data are often stored in spreadsheets in a wide format, which may have more columns that rows. I believe this is because the wide format is easier to see in a spreadsheet. However, it turns out that there are lots of good reasons to keep data in a tidy
format. For one, tools like dplyr
and ggplot2
expect tidy data. For two, it is basically the Third Normal Form that we mentioned last week and is standard in relational databases (more on that later). For three, now that we know how to wrangle and visualize data, we don’t need to see it all anyway.
The pivot_longer()
and pivot_wider()
functions from the tidyr
package (written by guess who?) are huge time-savers. In short, pivot_longer()
takes a data table in a wide format and converts it into a tidy format. pivot_wider()
does the opposite (and accordingly, is used less frequently).
We (unwittingly) came up with an example earlier. Recall that the total_births
data frame looks like this:
<- babynames %>%
ssa_births group_by(year) %>%
summarize(num_rows = n(), births = sum(n))
<- births %>%
total_births left_join(ssa_births, by = "year")
head(total_births)
Let’s forget about num_rows
and rename the variables.
<- total_births %>%
total_births select(-num_rows) %>%
rename(census = births.x, ssa = births.y)
Each row corresponds to a single year
, and for each year, we have two variables: census
(from the Census) and ssa
(from the SSA). This data frame is not tidy. (Why?)
When we graph them, we are forced to use a second y
aesthetic instead of a direct mapping.
ggplot(data = total_births, aes(x = year, y = census)) +
geom_line() +
geom_line(aes(y = ssa))
Note that we don’t get a legend, and it’s not obvious how we could map color to a variable.
Our data frame has 109 rows and 3 columns,
dim(total_births)
but what we want is a data frame that has 218 rows and three columns: year
, source
, and births
. This data frame would be tidy, because like is stored with like. The pivot_longer()
function will help us get there.
<- total_births %>%
births_tidy pivot_longer(-year, names_to = "source", values_to = "births")
head(births_tidy)
dim(births_tidy)
Do you see what happened? View births_tidy
in several different ways if necessary, and make sure that you understand where each piece of data went.
Now we can use the color aesthetic directly, and we get the legend for free:
ggplot(data = births_tidy, aes(x = year, y = births)) +
geom_line(aes(color = source))
Note that even though we have two sources for birth numbers, the numbers themselves represent the same kind of thing.
Now you will try this for yourself with a Gapminder data set. We’ll walk you through it step-by-step.
Pick a data set and download it as an Excel file. Open the file in Excel and then “Save As…” and select “CSV”. Make sure to save the file as a .csv
!
Read the CSV into R using read_csv()
. Pay careful attention to the path names! You code will look something like this:
<- read.csv("~/Desktop/file_from_gapminder.csv") mydata
# SAMPLE SOLUTION
# I chose "Alcohol consumption per adult 15+ (litres)".
<- read_csv("/tmp/alcohol_consumption_per_adult_15plus_litres.csv") alcohol
rename()
to change the name of the variable in the first row to "country"
. [Hint: Use backticks if you have a ghastly variable name.]# SAMPLE SOLUTION
<- alcohol %>%
alcohol rename(country = blah)
dim()
to find the dimensions of your data frame. Make a note of these.# SAMPLE SOLUTION
dim(alcohol)
country
, year
, and value
. How many rows will it have?# SAMPLE SOLUTION
# It should have $n \cdot (k-1)$ rows, where $n$ is the number of rows and $k$ is the number of variables.
pivot_longer()
to convert your data into a tidy format. Check that it has the right dimensions.# SAMPLE SOLUTION
<- alcohol %>%
alcohol_tidy pivot_longer(-country, names_to = "year", values_to = "consumption")
year
column is not stored as a number, but rather it has type character
. Verify this, and then use mutate()
and parse_number()
to fix it.# SAMPLE SOLUTION
glimpse(alcohol_tidy)
<- alcohol_tidy %>%
alcohol_tidy mutate(year = parse_number(year))
ggplot2
to plot your variable as a time series. Use color to distinguish between countries.pivot_wider()
pivot_wider()
does the opposite.
<- births_tidy %>%
wide_births pivot_wider(names_from = source, values_from = births)
head(wide_births)
dim(wide_births)
Please respond to the following prompt on Slack in the #mod-wrangling
channel.
Prompt: There are 8 exercises in Lab 9. Which one was the most challenging for you? If you got stuck somewhere along the way, tell us which Exercise you got stuck on, and what the issue was.