In this lab, we will learn how to use map()
to iterate a function over a vector of values.
library(tidyverse)
library(babynames)
Goal: by the end of this lab, you should be able to apply operations to all items in a vector
Part of thinking like a data scientist is recognizing when to automate a task. Computers are really good at doing things repeatedly – but not good at knowing what to do. Your job is to tell the computer what to do!
We can save a lot of time by automating certain operations. In this lab we will discuss two main ways of iterating operations over a set of values.
Recall the most_popular_year()
function that we defined previously:
<- function(name_arg) {
most_popular_year %>%
babynames filter(name == name_arg) %>%
group_by(year) %>%
summarize(total = sum(prop)) %>%
arrange(desc(total)) %>%
head(1) %>%
select(year)
}
We wrote this function so that we could simplify the task of finding the most popular year for a specific name. If we only wanted to perform the task once, there would be no need to write the function, since we could just write the pipeline (that makes up the body of the function). If we only wanted to perform the task a few times, then just calling the function a few times would probably be OK:
most_popular_year(name_arg = "Larry")
most_popular_year(name_arg = "Moe")
most_popular_year(name_arg = "Curly")
map()
But what if we had a long list of names? We wouldn’t want to have to copy and paste that code repeatedly. The solution is to create a vector that contains the names we’re interested in, and then use the map()
function to apply the function to each item in the vector. To read more about this function, type ?map
at the console.
Note: many other programming languages have a similar operation
map()
.
For example, the people in Ben’s family have the following names:
<- c("Benjamin", "Cory", "Alice", "Arlo") bens_people
In order to call the most_popular_year()
function on each of those names, we can write:
map(bens_people, most_popular_year)
The map()
function will always return a list
– a data structure that we have not talked about much. In order to flatten the list, we need to use one of the map_*()
functions that specifies the return data type. In this case, since most_popular_year()
returns a data.frame
, we use the map_df()
function.
map_df(bens_people, most_popular_year)
for
loopsLogically, using map()
is very much like using a for
loop. for
loops work just fine in R, but are not part of the en vogue R coding style. Additionally there are situations in which for
loops are less efficient than map()
. If you come across a problem in R and you think you want to use a for
loop, ask yourself if you really need to know about the index values. If not (which is nearly always), then you can do what you want without using a for
loop. Instead, write a function to perform the task once, and then iterate the function over the things you want to do the function to with map()
.
Use map()
and the function that you wrote in Exercise 1 of the Writing Functions lab to find the five most common airport destinations for Delta, American, and United.
Use map()
and the function that you wrote in Exercise 3 of the Writing Functions lab to find the five most common carriers to Bradley International, Los Angeles International, and San Francisco International airports.
Suppose we want to compute the top 10 most popular names. This function will do the trick:
<- function(data) {
top10 %>%
data group_by(name) %>%
summarize(births = sum(n)) %>%
arrange(desc(births)) %>%
head(10)
}top10(data = babynames)
But now suppose we want to apply this function to each decade. First, we can use group_by()
to set a grouping variable. You might be tempted to then use map()
to iterate the function top10()
over the result—but this won’t work because group_by()
returns a grouped tibble, which is not a list (or vector). Instead, we use the function group_modify()
, which works like map()
, but on a grouped tibble instead of a list (or vector). Note that the resulting data frame has a variable called decade
that indicates the first year of the corresponding decade.
<- babynames %>%
top_by_decade mutate(decade = 10 * floor(year / 10)) %>%
group_by(decade) %>%
group_modify(~top10(.x))
The .x
here is a placeholder that indicates the data frame corresponding to each decade. Note (from help(group_modify)
):
.
or.x
to refer to the subset of rows of.tbl
for the given group.
Since top10()
returns a tbl_df
, the top_by_decade
object is also a tbl_df
that has 10 rows for each decade, all stacked on top of one another.
top_by_decade
. How many rows does it have? Make sure that you understand what this data frame contains and how it got there.nrow(top_by_decade)
top_by_decade
is: decade name births
<dbl> <chr> <int>
1 1890 Mary 131630
What does this say about the name Mary
?
top10()
to each score (i.e., set of 20 consecutive years).You should be aware that purrr
is a relatively new package. Many of the map_*()
functions have rough equivalents in base R like lapply()
, sapply()
, mapply()
, etc. Like other tidyverse
tools, in my professional opinion purrr
makes the syntax for these operations cleaner and more consistent, in a way that is conducive to learning and professional use. For more on this, see Jenny Bryan’s syntax comparison page.
Please respond to the following prompt on Slack in the #mod-programming
channel.
Prompt: Share a previous experience in which
map()
would have been useful to you. If you don’t have one or can’t think of one, think of a situation in your life (i.e., not about programming or data) where the concept of automating the execution of a task over a list would have been useful to you.