In this lab, we will learn how to join two tables together.
library(tidyverse)
library(babynames)
Goal: by the end of this lab, you will be able to use inner_join()
and left_join()
to combine data from two different tables.
In the babynames
package, there is a table called births
. It has just two columns:
head(births)
These data come from the U.S. Census Bureau (see help(births)
for more information). For clarity, let’s copy this table into a new object with a clearer name.
<- births census_births
The information about baby names is stored in the babynames
table that we have seen previously. These data come from the Social Security Administration (see help(babynames)
for more details). It stands to reason that the total number of births recorded by the Census Bureau should match the total number of births as recorded by the Social Security Administration. We will investigate this claim.
We have to do a little bit of work in order to condense the SSA data into the same yearly form as the Census Bureau data. We can do this with group_by()
and summarize()
.
Note: I am also filtering to exclude data from 2012 on for pedagogical purposes. There is no practical reason to do this – I just don’t want the two tables to end in the same year.
<- babynames %>%
ssa_births filter(year < 2012) %>%
group_by(year) %>%
summarize(N = n(), births = sum(n))
Now we have two separate tables with what should ostensibly be the same information: census_births
and ssa_births
. Note that they don’t cover the same set of years.
%>%
census_births summarize(
N = n(),
earliest = min(year),
latest = max(year)
)%>%
ssa_births summarize(
N = n(),
earliest = min(year),
latest = max(year)
)
The technique for combining two tables is called a “join.” There are several different kinds of joins, which we explore in this section. Our goal is to match rows from one table that correspond to rows in a different table.
Using an inner_join()
, the only rows in the result set will be those with corresponding matches in both tables.
%>%
census_births inner_join(ssa_births, by = "year") %>%
summarize(
N = n(),
earliest = min(year),
latest = max(year)
)
Note that the set of years returns is the intersection of the two sets of years in the original tables.
Using a left_join()
, the rows in census_births
will always be returned, even if there is no corresponding entry in the ssa_births
table. Rows with no corresponding entry have NA
s that appear for the missing data.
<- census_births %>%
total_births left_join(ssa_births, by = "year")
%>%
total_births filter(is.na(births.y))
The data from 2012 exists in the census_births
table, but not in the ssa_births
table. Note that the set of years returns is no longer just the intersection.
%>%
total_births summarize(
N = n(),
earliest = min(year),
latest = max(year)
)
Conversely, using a right_join()
will have the opposite effect. Now there are many years of SSA data that have no match in the Census data.
<- census_births %>%
total_births right_join(ssa_births, by = "year")
%>%
total_births filter(is.na(births.x))
The number of years present is also neither the intersection, nor the union.
%>%
total_births summarize(
N = n(),
earliest = min(year),
latest = max(year)
)
Switching the order of the tables and switching back to right_join()
will have the same effect. That is, left_join(a, b)
is equivalent to right_join(b, a)
.
%>%
ssa_births left_join(census_births, by = "year") %>%
summarize(
N = n(),
earliest = min(year),
latest = max(year)
)
A full_join()
returns all rows, regardless of whether they are matched.
<- census_births %>%
total_births full_join(ssa_births, by = "year")
%>%
total_births filter(is.na(births.x) | is.na(births.y))
Now the set of years returned is the union of the years in the two tables.
%>%
total_births summarize(
N = n(),
earliest = min(year),
latest = max(year)
)
With the tables joined, we can compare the birth numbers directly. While they are strongly correlated,
%>%
total_births summarize(
N = n(),
correlation = cor(births.x, births.y,
use = "complete.obs"
) )
the numbers are not the same.
library(ggplot2)
ggplot(data = total_births, aes(x = year, y = births.x)) +
geom_line() +
geom_line(aes(y = births.y))
Can you do some detective work to find out why?
Hint: read the documentation for
babynames
.
The following exercises use the nycflights13
package.
# sample solution
library(nycflights13)
%>%
flights group_by(carrier) %>%
summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
left_join(airlines, by = c("carrier" = "carrier"))
# sample solution
%>%
flights group_by(dest) %>%
summarize(N = n()) %>%
arrange(desc(N)) %>%
left_join(airports, by = c("dest" = "faa"))
# sample solution
%>%
flights left_join(planes, by = c("tailnum" = "tailnum")) %>%
group_by(manufacturer, model) %>%
summarize(N = n()) %>%
arrange(desc(N))
airports
table)?# sample solution
%>%
flights left_join(airports, by = c("dest" = "faa")) %>%
filter(is.na(name)) %>%
group_by(dest) %>%
summarize(N = n())
planes
table)?# sample solution
%>%
flights left_join(planes, by = c("tailnum" = "tailnum")) %>%
filter(is.na(model)) %>%
group_by(tailnum) %>%
summarize(N = n()) %>%
arrange(desc(N))
Please respond to the following prompt on Slack in the #mod-wrangling
channel.
Prompt: You have now learned the basics of data wrangling. What remains unclear?