In the same way that
ggplot2 provide a grammar for graphics,
dplyr provides a grammar for data transformation. A grammar is a set of rules that govern a language. In this case,
dplyr provides a grammar that will allow you to express ideas about how to transform data.
Note that a grammar consists of verbs, nouns, and direct objects.
Goal: by the end of this lab, you will be able to use
dplyr to transform a single data table.
The creator of
ggplot2), Hadley Wickham, argues that most of the operations that you need to perform on a data table can be achieve using combinations of the following five “verbs”:
Each of the five data “verbs” takes a data frame as its first argument and returns a data frame (actually a
tbl_df). Because of this, these operations can be chained using the pipe operator (
%>%) (see below).
When in doubt, reference the data transformation cheat sheet. Pay careful attention to these pictures!
Find a subset of the columns using
%>% babynames select(year, name, n) %>% head()
Find a subset of the rows using
%>% babynames filter(name == "Bella") %>% head()
Do both and assign the result to a new object:
<- babynames %>% bella filter(name == "Bella") %>% select(year, name, sex, n)
Check the dimensions of that object, view the first few rows, and verify what kind of object it is:
dim(bella) head(bella) class(bella)
bella is a
tbl, and a
data.frame. Objects in R can have more than one type!
Let’s make a quick plot of the popularity of
Bella over time.
library(ggplot2) ggplot(data = bella, aes(x = year, y = n)) + geom_line(aes(color = sex))
mutate() to create new variables. Here, we define a variable called
popular that is
TRUE if the name was assigned to more than 1% of all babies in that year.
<- babynames %>% babynames mutate(popular = prop > 0.01)
rename() to rename a variable:
<- babynames %>% babynames rename(is_popular = popular)
Use the new variable (and
filter()) to create a subset of the rows:
<- babynames %>% popular filter(is_popular) nrow(popular) head(popular)
What are the single most popular names of all time? To find them we can
arrange() the table in descending order of the proportion of babies who got that name.
%>% popular arrange(desc(prop))
What does the value of the
prop column in the first line of output above mean? Write one sentence to explain what it means to someone who has never taken a statistics course.
Choose a name, and find the
year in which that name was used most frequently.
# SAMPLE SOLUTION: %>% babynames filter(name == "Benjamin") %>% arrange(desc(n)) %>% head()
# SAMPLE SOLUTION: %>% babynames filter(year == 1989) %>% arrange(desc(n)) %>% head()
Think about the following question, but do not attempt to actually solve it!
The last single table verb is
summarize() and it works a bit differently. Like all of the verbs, it takes a data table and returns a data table, but on its own,
summarize() only returns a single row of output. In order to do this, it has to collapse entire columns into a single values. Thus, unless you tell
summarize() how to condense the many pieces of information in a variable into a single value, it won’t know what to do.
Here, we summarize
bella to find the greatest number of
Bellas born in a single year (to a single gender):
%>% bella summarize(most_bellas = max(n))
Note the difference between this and simply sorting the data table.
summarize(), it is almost always a good idea to count the number of rows that you collapsed. The value of this may not be immediately obvious, but it serves as a handy sanity check that I promise you will save you from lots of mistakes. We can do this using the
Note: Do not confuse the function
n(), which is always used inside a
summarize()command, and the variable name
n, which happens to be a column in the
babynamestable. It is a coincidence that these have the same name!
%>% bella summarize(num_rows = n(), most_bellas = max(n))
num_rowsrepresent (in real-world terms) in the previous result? Explain what it means to your neighbor and argue about it until you agree.
The pipe operator is provided by the
magrittr package. It is automatically loaded by
Note: The term
pipeis an old allusion to the use of
|in Unix to perform analogous operations.
%>% mydata verb(arguments)
is the same as:
function(x, args) = x %>% function(args).
In the grammar of data transformation,
mydata is a noun,
verb() is a verb, and
arguments are the direct objects. The pipeline is closer to how we speak in English, while the nested syntax is more like Polish Notation.
This means that instead of having to do:
select(filter(mutate(data, arguments), arguments), arguments)
You can do:
%>% data mutate(arguments) %>% filter(arguments) %>% select(arguments)
Which is easier to read and understand?
arrange(select(filter(babynames, name == "Jordan"), -name), desc(prop))
# Sample solution %>% babynames filter(name == "Jordan") %>% select(-name) %>% arrange(desc(prop))
Please respond to the following prompt on Slack in the
Prompt: after completing the Single Table Analysis lab, what questions do you still have about the five verbs and/or the pipe operator?