dplyr
In the same way that ggplot2
provide a grammar for graphics, dplyr
provides a grammar for data transformation. A grammar is a set of rules that govern a language. In this case, dplyr
provides a grammar that will allow you to express ideas about how to transform data.
Note that a grammar consists of verbs, nouns, and direct objects.
library(tidyverse)
Goal: by the end of this lab, you will be able to use dplyr
to transform a single data table.
The creator of dplyr
(and ggplot2
), Hadley Wickham, argues that most of the operations that you need to perform on a data table can be achieve using combinations of the following five “verbs”:
select()
filter()
mutate()
(and rename()
)arrange()
summarize()
Each of the five data “verbs” takes a data frame as its first argument and returns a data frame (actually a tbl_df
). Because of this, these operations can be chained using the pipe operator (%>%
) (see below).
When in doubt, reference the data transformation cheat sheet. Pay careful attention to these pictures!
library(babynames)
head(babynames)
Find a subset of the columns using select()
:
%>%
babynames select(year, name, n) %>%
head()
Find a subset of the rows using filter()
:
%>%
babynames filter(name == "Bella") %>%
head()
Do both and assign the result to a new object:
<- babynames %>%
bella filter(name == "Bella") %>%
select(year, name, sex, n)
Check the dimensions of that object, view the first few rows, and verify what kind of object it is:
dim(bella)
head(bella)
class(bella)
Note that bella
is a tbl_df
, a tbl
, and a data.frame
. Objects in R can have more than one type!
Let’s make a quick plot of the popularity of Bella
over time.
library(ggplot2)
ggplot(data = bella, aes(x = year, y = n)) +
geom_line(aes(color = sex))
Use mutate()
to create new variables. Here, we define a variable called popular
that is TRUE
if the name was assigned to more than 1% of all babies in that year.
<- babynames %>%
babynames mutate(popular = prop > 0.01)
Use rename()
to rename a variable:
<- babynames %>%
babynames rename(is_popular = popular)
Use the new variable (and filter()
) to create a subset of the rows:
<- babynames %>%
popular filter(is_popular)
nrow(popular)
head(popular)
What are the single most popular names of all time? To find them we can arrange()
the table in descending order of the proportion of babies who got that name.
%>%
popular arrange(desc(prop))
What does the value of the prop
column in the first line of output above mean? Write one sentence to explain what it means to someone who has never taken a statistics course.
Choose a name, and find the year
in which that name was used most frequently.
# SAMPLE SOLUTION:
%>%
babynames filter(name == "Benjamin") %>%
arrange(desc(n)) %>%
head()
# SAMPLE SOLUTION:
%>%
babynames filter(year == 1989) %>%
arrange(desc(n)) %>%
head()
Think about the following question, but do not attempt to actually solve it!
The last single table verb is summarize()
and it works a bit differently. Like all of the verbs, it takes a data table and returns a data table, but on its own, summarize()
only returns a single row of output. In order to do this, it has to collapse entire columns into a single values. Thus, unless you tell summarize()
how to condense the many pieces of information in a variable into a single value, it won’t know what to do.
Here, we summarize bella
to find the greatest number of Bella
s born in a single year (to a single gender):
%>%
bella summarize(most_bellas = max(n))
Note the difference between this and simply sorting the data table.
When using summarize()
, it is almost always a good idea to count the number of rows that you collapsed. The value of this may not be immediately obvious, but it serves as a handy sanity check that I promise you will save you from lots of mistakes. We can do this using the n()
function.
Note: Do not confuse the function
n()
, which is always used inside asummarize()
command, and the variable namen
, which happens to be a column in thebabynames
table. It is a coincidence that these have the same name!
%>%
bella summarize(num_rows = n(), most_bellas = max(n))
num_rows
represent (in real-world terms) in the previous result? Explain what it means to your neighbor and argue about it until you agree.The pipe operator is provided by the magrittr
package. It is automatically loaded by dplyr
.
Note: The term
pipe
is an old allusion to the use of|
in Unix to perform analogous operations.
The expression
%>%
mydata verb(arguments)
is the same as:
verb(mydata, arguments)
In effect, function(x, args) = x %>% function(args)
.
In the grammar of data transformation, mydata
is a noun, verb()
is a verb, and arguments
are the direct objects. The pipeline is closer to how we speak in English, while the nested syntax is more like Polish Notation.
This means that instead of having to do:
select(filter(mutate(data, arguments), arguments), arguments)
You can do:
%>%
data mutate(arguments) %>%
filter(arguments) %>%
select(arguments)
Which is easier to read and understand?
arrange(select(filter(babynames, name == "Jordan"), -name), desc(prop))
# Sample solution
%>%
babynames filter(name == "Jordan") %>%
select(-name) %>%
arrange(desc(prop))
Please respond to the following prompt on Slack in the #mod-wrangling
channel.
Prompt: after completing the Single Table Analysis lab, what questions do you still have about the five verbs and/or the pipe operator?