In this lab, we will learn how to convert data tables into a tidy form. We will learn about functions in the
tidyr package, which is part of the
Goal: by the end of this lab, you will understand what tidy data is and how to convert a wide table into a narrow table, and vice versa.
Tidy data are usually long and narrow. That is, it may have many rows, but relatively few columns. However, data are often stored in spreadsheets in a wide format, which may have more columns that rows. I believe this is because the wide format is easier to see in a spreadsheet. However, it turns out that there are lots of good reasons to keep data in a
tidy format. For one, tools like
ggplot2 expect tidy data. For two, it is basically the Third Normal Form that we mentioned last week and is standard in relational databases (more on that later). For three, now that we know how to wrangle and visualize data, we don’t need to see it all anyway.
pivot_wider() functions from the
tidyr package (written by guess who?) are huge time-savers. In short,
pivot_longer() takes a data table in a wide format and converts it into a tidy format.
pivot_wider() does the opposite (and accordingly, is used less frequently).
We (unwittingly) came up with an example earlier. Recall that the
total_births data frame looks like this:
<- babynames %>% ssa_births group_by(year) %>% summarize(num_rows = n(), births = sum(n)) <- births %>% total_births left_join(ssa_births, by = "year") head(total_births)
Let’s forget about
num_rows and rename the variables.
<- total_births %>% total_births select(-num_rows) %>% rename(census = births.x, ssa = births.y)
Each row corresponds to a single
year, and for each year, we have two variables:
census (from the Census) and
ssa (from the SSA). This data frame is not tidy. (Why?)
When we graph them, we are forced to use a second
y aesthetic instead of a direct mapping.
ggplot(data = total_births, aes(x = year, y = census)) + geom_line() + geom_line(aes(y = ssa))
Note that we don’t get a legend, and it’s not obvious how we could map color to a variable.
Our data frame has 109 rows and 3 columns,
but what we want is a data frame that has 218 rows and three columns:
births. This data frame would be tidy, because like is stored with like. The
pivot_longer() function will help us get there.
<- total_births %>% births_tidy pivot_longer(-year, names_to = "source", values_to = "births") head(births_tidy) dim(births_tidy)
Do you see what happened? View
births_tidy in several different ways if necessary, and make sure that you understand where each piece of data went.
Now we can use the color aesthetic directly, and we get the legend for free:
ggplot(data = births_tidy, aes(x = year, y = births)) + geom_line(aes(color = source))
Note that even though we have two sources for birth numbers, the numbers themselves represent the same kind of thing.
Now you will try this for yourself with a Gapminder data set. We’ll walk you through it step-by-step.
Pick a data set and download it as an Excel file. Open the file in Excel and then “Save As…” and select “CSV”. Make sure to save the file as a
Read the CSV into R using
read_csv(). Pay careful attention to the path names! You code will look something like this:
# SAMPLE SOLUTION # I chose "Alcohol consumption per adult 15+ (litres)". <- read_csv("/tmp/alcohol_consumption_per_adult_15plus_litres.csv")alcohol
rename()to change the name of the variable in the first row to
"country". [Hint: Use backticks if you have a ghastly variable name.]
# SAMPLE SOLUTION <- alcohol %>% alcohol rename(country = blah)
dim()to find the dimensions of your data frame. Make a note of these.
# SAMPLE SOLUTION dim(alcohol)
value. How many rows will it have?
# SAMPLE SOLUTION # It should have $n \cdot (k-1)$ rows, where $n$ is the number of rows and $k$ is the number of variables.
pivot_longer()to convert your data into a tidy format. Check that it has the right dimensions.
# SAMPLE SOLUTION <- alcohol %>% alcohol_tidy pivot_longer(-country, names_to = "year", values_to = "consumption")
yearcolumn is not stored as a number, but rather it has type
character. Verify this, and then use
parse_number()to fix it.
# SAMPLE SOLUTION glimpse(alcohol_tidy) <- alcohol_tidy %>% alcohol_tidy mutate(year = parse_number(year))
ggplot2to plot your variable as a time series. Use color to distinguish between countries.
pivot_wider() does the opposite.
<- births_tidy %>% wide_births pivot_wider(names_from = source, values_from = births) head(wide_births) dim(wide_births)
Please respond to the following prompt on Slack in the
Prompt: There are 8 exercises in Lab 9. Which one was the most challenging for you? If you got stuck somewhere along the way, tell us which Exercise you got stuck on, and what the issue was.