A Grammar for Data Transformation

In the same way that ggplot2 provide a grammar for graphics, dplyr provides a grammar for data transformation. A grammar is a set of rules that govern a language. In this case, dplyr provides a grammar that will allow you to express ideas about how to transform data.

Note that a grammar consists of verbs, nouns, and direct objects.

library(tidyverse)

Goal: by the end of this lab, you will be able to use dplyr to transform a single data table.

The Five Verbs

The creator of dplyr (and ggplot2), Hadley Wickham, argues that most of the operations that you need to perform on a data table can be achieve using combinations of the following five “verbs”:

  • select()
  • filter()
  • mutate() (and rename())
  • arrange()
  • summarize()

Each of the five data “verbs” takes a data frame as its first argument and returns a data frame (actually a tbl_df). Because of this, these operations can be chained using the pipe operator (%>%) (see below).

When in doubt, reference the data transformation cheat sheet. Pay careful attention to these pictures!

library(babynames)
head(babynames)

Find a subset of the columns using select():

babynames %>%
  select(year, name, n) %>%
  head()

Find a subset of the rows using filter():

babynames %>%
  filter(name == "Bella") %>%
  head()

Do both and assign the result to a new object:

bella <- babynames %>%
  filter(name == "Bella") %>%
  select(year, name, sex, n)

Check the dimensions of that object, view the first few rows, and verify what kind of object it is:

dim(bella)
head(bella)
class(bella)

Note that bella is a tbl_df, a tbl, and a data.frame. Objects in R can have more than one type!

Let’s make a quick plot of the popularity of Bella over time.

library(ggplot2)
ggplot(data = bella, aes(x = year, y = n)) + 
  geom_line(aes(color = sex))

Use mutate() to create new variables. Here, we define a variable called popular that is TRUE if the name was assigned to more than 1% of all babies in that year.

babynames <- babynames %>%
  mutate(popular = prop > 0.01)

Use rename() to rename a variable:

babynames <- babynames %>%
  rename(is_popular = popular)

Use the new variable (and filter()) to create a subset of the rows:

popular <- babynames %>%
  filter(is_popular)
nrow(popular)
head(popular)

What are the single most popular names of all time? To find them we can arrange() the table in descending order of the proportion of babies who got that name.

popular %>%
  arrange(desc(prop))
  1. What does the value of the prop column in the first line of output above mean? Write one sentence to explain what it means to someone who has never taken a statistics course.

  2. Choose a name, and find the year in which that name was used most frequently.

# SAMPLE SOLUTION:
babynames %>%
  filter(name == "Benjamin") %>%
  arrange(desc(n)) %>%
  head()
  1. What was the most popular name that year?
# SAMPLE SOLUTION:
babynames %>%
  filter(year == 1989) %>%
  arrange(desc(n)) %>%
  head()

Think about the following question, but do not attempt to actually solve it!

  1. In which year was that name given to M and F babies most equally (i.e. closest to a 50/50 split)?

The last single table verb is summarize() and it works a bit differently. Like all of the verbs, it takes a data table and returns a data table, but on its own, summarize() only returns a single row of output. In order to do this, it has to collapse entire columns into a single values. Thus, unless you tell summarize() how to condense the many pieces of information in a variable into a single value, it won’t know what to do.

Here, we summarize bella to find the greatest number of Bellas born in a single year (to a single gender):

bella %>%
  summarize(most_bellas = max(n))

Note the difference between this and simply sorting the data table.

When using summarize(), it is almost always a good idea to count the number of rows that you collapsed. The value of this may not be immediately obvious, but it serves as a handy sanity check that I promise you will save you from lots of mistakes. We can do this using the n() function.

Note: Do not confuse the function n(), which is always used inside a summarize() command, and the variable name n, which happens to be a column in the babynames table. It is a coincidence that these have the same name!

bella %>%
  summarize(num_rows = n(), most_bellas = max(n))
  1. What does the value of num_rows represent (in real-world terms) in the previous result? Explain what it means to your neighbor and argue about it until you agree.

The pipe operator

The pipe operator is provided by the magrittr package. It is automatically loaded by dplyr.

Note: The term pipe is an old allusion to the use of | in Unix to perform analogous operations.

The expression

mydata %>%
  verb(arguments)

is the same as:

verb(mydata, arguments)

In effect, function(x, args) = x %>% function(args).

In the grammar of data transformation, mydata is a noun, verb() is a verb, and arguments are the direct objects. The pipeline is closer to how we speak in English, while the nested syntax is more like Polish Notation.

This means that instead of having to do:

select(filter(mutate(data, arguments), arguments), arguments)

You can do:

data %>%
  mutate(arguments) %>%
  filter(arguments) %>%
  select(arguments)

Which is easier to read and understand?

  1. Re-write the following expression using the pipe operator:
arrange(select(filter(babynames, name == "Jordan"), -name), desc(prop))
# Sample solution
babynames %>%
  filter(name == "Jordan") %>%
  select(-name) %>%
  arrange(desc(prop))

Your learning

Please respond to the following prompt on Slack in the #mod-wrangling channel.

Prompt: after completing the Single Table Analysis lab, what questions do you still have about the five verbs and/or the pipe operator?