In this lab, we will learn how to join two tables together.
library(tidyverse)
library(babynames)Goal: by the end of this lab, you will be able to use inner_join() and left_join() to combine data from two different tables.
In the babynames package, there is a table called births. It has just two columns:
head(births)These data come from the U.S. Census Bureau (see help(births) for more information). For clarity, let’s copy this table into a new object with a clearer name.
census_births <- birthsThe information about baby names is stored in the babynames table that we have seen previously. These data come from the Social Security Administration (see help(babynames) for more details). It stands to reason that the total number of births recorded by the Census Bureau should match the total number of births as recorded by the Social Security Administration. We will investigate this claim.
We have to do a little bit of work in order to condense the SSA data into the same yearly form as the Census Bureau data. We can do this with group_by() and summarize().
Note: I am also filtering to exclude data from 2012 on for pedagogical purposes. There is no practical reason to do this – I just don’t want the two tables to end in the same year.
ssa_births <- babynames %>%
filter(year < 2012) %>%
group_by(year) %>%
summarize(N = n(), births = sum(n))Now we have two separate tables with what should ostensibly be the same information: census_births and ssa_births. Note that they don’t cover the same set of years.
census_births %>%
summarize(
N = n(),
earliest = min(year),
latest = max(year)
)
ssa_births %>%
summarize(
N = n(),
earliest = min(year),
latest = max(year)
)The technique for combining two tables is called a “join.” There are several different kinds of joins, which we explore in this section. Our goal is to match rows from one table that correspond to rows in a different table.
Using an inner_join(), the only rows in the result set will be those with corresponding matches in both tables.
census_births %>%
inner_join(ssa_births, by = "year") %>%
summarize(
N = n(),
earliest = min(year),
latest = max(year)
)Note that the set of years returns is the intersection of the two sets of years in the original tables.
Using a left_join(), the rows in census_births will always be returned, even if there is no corresponding entry in the ssa_births table. Rows with no corresponding entry have NAs that appear for the missing data.
total_births <- census_births %>%
left_join(ssa_births, by = "year")
total_births %>%
filter(is.na(births.y))The data from 2012 exists in the census_births table, but not in the ssa_births table. Note that the set of years returns is no longer just the intersection.
total_births %>%
summarize(
N = n(),
earliest = min(year),
latest = max(year)
)Conversely, using a right_join() will have the opposite effect. Now there are many years of SSA data that have no match in the Census data.
total_births <- census_births %>%
right_join(ssa_births, by = "year")
total_births %>%
filter(is.na(births.x))The number of years present is also neither the intersection, nor the union.
total_births %>%
summarize(
N = n(),
earliest = min(year),
latest = max(year)
)Switching the order of the tables and switching back to right_join() will have the same effect. That is, left_join(a, b) is equivalent to right_join(b, a).
ssa_births %>%
left_join(census_births, by = "year") %>%
summarize(
N = n(),
earliest = min(year),
latest = max(year)
)A full_join() returns all rows, regardless of whether they are matched.
total_births <- census_births %>%
full_join(ssa_births, by = "year")
total_births %>%
filter(is.na(births.x) | is.na(births.y))Now the set of years returned is the union of the years in the two tables.
total_births %>%
summarize(
N = n(),
earliest = min(year),
latest = max(year)
)With the tables joined, we can compare the birth numbers directly. While they are strongly correlated,
total_births %>%
summarize(
N = n(),
correlation = cor(births.x, births.y,
use = "complete.obs"
)
)the numbers are not the same.
library(ggplot2)
ggplot(data = total_births, aes(x = year, y = births.x)) +
geom_line() +
geom_line(aes(y = births.y))Can you do some detective work to find out why?
Hint: read the documentation for
babynames.
The following exercises use the nycflights13 package.
# sample solution
library(nycflights13)
flights %>%
group_by(carrier) %>%
summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
left_join(airlines, by = c("carrier" = "carrier"))# sample solution
flights %>%
group_by(dest) %>%
summarize(N = n()) %>%
arrange(desc(N)) %>%
left_join(airports, by = c("dest" = "faa"))# sample solution
flights %>%
left_join(planes, by = c("tailnum" = "tailnum")) %>%
group_by(manufacturer, model) %>%
summarize(N = n()) %>%
arrange(desc(N))airports table)?# sample solution
flights %>%
left_join(airports, by = c("dest" = "faa")) %>%
filter(is.na(name)) %>%
group_by(dest) %>%
summarize(N = n())planes table)?# sample solution
flights %>%
left_join(planes, by = c("tailnum" = "tailnum")) %>%
filter(is.na(model)) %>%
group_by(tailnum) %>%
summarize(N = n()) %>%
arrange(desc(N))Please respond to the following prompt on Slack in the #mod-wrangling channel.
Prompt: You have now learned the basics of data wrangling. What remains unclear?