Iteration

In this lab, we will learn how to use map() to iterate a function over a vector of values.

library(tidyverse)
library(babynames)

Goal: by the end of this lab, you should be able to apply operations to all items in a vector

Automating repetitive tasks

Part of thinking like a data scientist is recognizing when to automate a task. Computers are really good at doing things repeatedly – but not good at knowing what to do. Your job is to tell the computer what to do!

We can save a lot of time by automating certain operations. In this lab we will discuss two main ways of iterating operations over a set of values.

Applying a function to a vector of values

Recall the most_popular_year() function that we defined previously:

most_popular_year <- function(name_arg) {
  babynames %>%
    filter(name == name_arg) %>%
    group_by(year) %>%
    summarize(total = sum(prop)) %>%
    arrange(desc(total)) %>%
    head(1) %>%
    select(year)
}

We wrote this function so that we could simplify the task of finding the most popular year for a specific name. If we only wanted to perform the task once, there would be no need to write the function, since we could just write the pipeline (that makes up the body of the function). If we only wanted to perform the task a few times, then just calling the function a few times would probably be OK:

most_popular_year(name_arg = "Larry")
most_popular_year(name_arg = "Moe")
most_popular_year(name_arg = "Curly")

`map()`

But what if we had a long list of names? We wouldn’t want to have to copy and paste that code repeatedly. The solution is to create a vector that contains the names we’re interested in, and then use the map() function to apply the function to each item in the vector. To read more about this function, type ?map at the console.

Note: many other programming languages have a similar operation map().

For example, the people in Ben’s family have the following names:

bens_people <- c("Benjamin", "Cory", "Alice", "Arlo")

In order to call the most_popular_year() function on each of those names, we can write:

map(bens_people, most_popular_year)

The map() function will always return a list – a data structure that we have not talked about much. In order to flatten the list, we need to use one of the map_*() functions that specifies the return data type. In this case, since most_popular_year() returns a data.frame, we use the map_df() function.

map_df(bens_people, most_popular_year)

A note on `for` loops

Logically, using map() is very much like using a for loop. for loops work just fine in R, but are not part of the en vogue R coding style. Additionally there are situations in which for loops are less efficient than map(). If you come across a problem in R and you think you want to use a for loop, ask yourself if you really need to know about the index values. If not (which is nearly always), then you can do what you want without using a for loop. Instead, write a function to perform the task once, and then iterate the function over the things you want to do the function to with map().

Use map() and the function that you wrote in Exercise 1 of the Writing Functions lab to find the five most common airport destinations for Delta, American, and United.
Use map() and the function that you wrote in Exercise 3 of the Writing Functions lab to find the five most common carriers to Bradley International, Los Angeles International, and San Francisco International airports.

Applying a function to a grouped data frame

Suppose we want to compute the top 10 most popular names. This function will do the trick:

top10 <- function(data) {
  data %>%
    group_by(name) %>%
    summarize(births = sum(n)) %>%
    arrange(desc(births)) %>%
    head(10)
}
top10(data = babynames)

But now suppose we want to apply this function to each decade. First, we can use group_by() to set a grouping variable. You might be tempted to then use map() to iterate the function top10() over the result—but this won’t work because group_by() returns a grouped tibble, which is not a list (or vector). Instead, we use the function group_modify(), which works like map(), but on a grouped tibble instead of a list (or vector). Note that the resulting data frame has a variable called decade that indicates the first year of the corresponding decade.

top_by_decade <- babynames %>%
  mutate(decade = 10 * floor(year / 10)) %>%
  group_by(decade) %>%
  group_modify(~top10(.x))

The .x here is a placeholder that indicates the data frame corresponding to each decade. Note (from help(group_modify)):

. or .x to refer to the subset of rows of .tbl for the given group.

Since top10() returns a tbl_df, the top_by_decade object is also a tbl_df that has 10 rows for each decade, all stacked on top of one another.

Investigate top_by_decade. How many rows does it have? Make sure that you understand what this data frame contains and how it got there.

nrow(top_by_decade)

The 11th row of top_by_decade is:

  decade  name births
   <dbl> <chr>  <int>
1   1890  Mary 131630

What does this say about the name Mary?

Apply the function top10() to each score (i.e., set of 20 consecutive years).

Other paradigms

You should be aware that purrr is a relatively new package. Many of the map_*() functions have rough equivalents in base R like lapply(), sapply(), mapply(), etc. Like other tidyverse tools, in my professional opinion purrr makes the syntax for these operations cleaner and more consistent, in a way that is conducive to learning and professional use. For more on this, see Jenny Bryan’s syntax comparison page.

Year learning

Please respond to the following prompt on Slack in the #mod-programming channel.

Prompt: Share a previous experience in which map() would have been useful to you. If you don’t have one or can’t think of one, think of a situation in your life (i.e., not about programming or data) where the concept of automating the execution of a task over a list would have been useful to you.