Tidy data

In this lab, we will learn how to convert data tables into a tidy form. We will learn about functions in the tidyr package, which is part of the tidyverse meta-package.

library(tidyverse)
library(babynames)

Goal: by the end of this lab, you will understand what tidy data is and how to convert a wide table into a narrow table, and vice versa.

Make wide tables tidy with `pivot_longer()`

Tidy data are usually long and narrow. That is, it may have many rows, but relatively few columns. However, data are often stored in spreadsheets in a wide format, which may have more columns that rows. I believe this is because the wide format is easier to see in a spreadsheet. However, it turns out that there are lots of good reasons to keep data in a tidy format. For one, tools like dplyr and ggplot2 expect tidy data. For two, it is basically the Third Normal Form that we mentioned last week and is standard in relational databases (more on that later). For three, now that we know how to wrangle and visualize data, we don’t need to see it all anyway.

The pivot_longer() and pivot_wider() functions from the tidyr package (written by guess who?) are huge time-savers. In short, pivot_longer() takes a data table in a wide format and converts it into a tidy format. pivot_wider() does the opposite (and accordingly, is used less frequently).

We (unwittingly) came up with an example earlier. Recall that the total_births data frame looks like this:

ssa_births <- babynames %>%
  group_by(year) %>%
  summarize(num_rows = n(), births = sum(n))

total_births <- births %>%
  left_join(ssa_births, by = "year")

head(total_births)

Let’s forget about num_rows and rename the variables.

total_births <- total_births %>%
  select(-num_rows) %>%
  rename(census = births.x, ssa = births.y)

Each row corresponds to a single year, and for each year, we have two variables: census (from the Census) and ssa (from the SSA). This data frame is not tidy. (Why?)

When we graph them, we are forced to use a second y aesthetic instead of a direct mapping.

ggplot(data = total_births, aes(x = year, y = census)) +
  geom_line() + 
  geom_line(aes(y = ssa))

Note that we don’t get a legend, and it’s not obvious how we could map color to a variable.

Our data frame has 109 rows and 3 columns,

dim(total_births)

but what we want is a data frame that has 218 rows and three columns: year, source, and births. This data frame would be tidy, because like is stored with like. The pivot_longer() function will help us get there.

births_tidy <- total_births %>%
  pivot_longer(-year, names_to = "source", values_to = "births")

head(births_tidy)
dim(births_tidy)

Do you see what happened? View births_tidy in several different ways if necessary, and make sure that you understand where each piece of data went.

Now we can use the color aesthetic directly, and we get the legend for free:

ggplot(data = births_tidy, aes(x = year, y = births)) +
  geom_line(aes(color = source))

Note that even though we have two sources for birth numbers, the numbers themselves represent the same kind of thing.

Exercises

Now you will try this for yourself with a Gapminder data set. We’ll walk you through it step-by-step.

Pick a data set and download it as an Excel file. Open the file in Excel and then “Save As…” and select “CSV”. Make sure to save the file as a .csv!
Read the CSV into R using read_csv(). Pay careful attention to the path names! You code will look something like this:

mydata <- read.csv("~/Desktop/file_from_gapminder.csv")

# SAMPLE SOLUTION
# I chose "Alcohol consumption per adult 15+ (litres)". 
alcohol <- read_csv("/tmp/alcohol_consumption_per_adult_15plus_litres.csv")

If necessary, use rename() to change the name of the variable in the first row to "country". [Hint: Use backticks if you have a ghastly variable name.]

# SAMPLE SOLUTION
alcohol <- alcohol %>%
  rename(country = blah)

Use dim() to find the dimensions of your data frame. Make a note of these.

# SAMPLE SOLUTION
dim(alcohol)

Take out a piece of paper and draw what you want your “tidy” data frame to look like. It should have three columns: country, year, and value. How many rows will it have?

# SAMPLE SOLUTION

# It should have $n \cdot (k-1)$ rows, where $n$ is the number of rows and $k$ is the number of variables.

Use pivot_longer() to convert your data into a tidy format. Check that it has the right dimensions.

# SAMPLE SOLUTION
alcohol_tidy <- alcohol %>%
  pivot_longer(-country, names_to = "year", values_to = "consumption")

Most likely the year column is not stored as a number, but rather it has type character. Verify this, and then use mutate() and parse_number() to fix it.

# SAMPLE SOLUTION
glimpse(alcohol_tidy)
alcohol_tidy <- alcohol_tidy %>%
  mutate(year = parse_number(year))

Use ggplot2 to plot your variable as a time series. Use color to distinguish between countries.

Make narrow tables wide with `pivot_wider()`

pivot_wider() does the opposite.

wide_births <- births_tidy %>%
  pivot_wider(names_from = source, values_from = births)
head(wide_births)
dim(wide_births)

Your learning

Please respond to the following prompt on Slack in the #mod-wrangling channel.

Prompt: There are 8 exercises in Lab 9. Which one was the most challenging for you? If you got stuck somewhere along the way, tell us which Exercise you got stuck on, and what the issue was.

Tidy data

Make wide tables tidy with pivot_longer()

Exercises

Make narrow tables wide with pivot_wider()

Your learning

Make wide tables tidy with `pivot_longer()`

Make narrow tables wide with `pivot_wider()`