Control Flow

In this lab, we will learn how to use ifelse() for vectorized control flow, and to avoid writing for loops.

Goal: by the end of this lab, you will be able to assign values conditionally and re-write a for loop using map().

ifelse()

The if () ... else syntax is for control flow. However, ifelse() is a function that returns a vector of the same length as the vector you put in, based on some logical conditions. These are often useful inside mutate().

Note

Note that there is also a function called if_else() that does the same thing, but is more strict about data types. You can use either function.

In the starwars data set, most characters have a species. However, there are many different species.

starwars |>
  group_by(species) |>
  count() |>
  arrange(desc(n))
# A tibble: 38 × 2
# Groups:   species [38]
   species      n
   <chr>    <int>
 1 Human       35
 2 Droid        6
 3 <NA>         4
 4 Gungan       3
 5 Kaminoan     2
 6 Mirialan     2
 7 Twi'lek      2
 8 Wookiee      2
 9 Zabrak       2
10 Aleena       1
# ℹ 28 more rows

Suppose that we wanted to lump all of the non-human and non-droid species together. We can use ifelse() to create a new variable.

sw2 <- starwars |>
  mutate(
    species_update = ifelse(
      !species %in% c("Human", "Droid"), 
      "Other", 
      species
    )
  ) |>
  select(name, species, species_update)

Note the behavior around NA. Some characters have unknown species.

starwars |>
  filter(is.na(species))
# A tibble: 4 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Jek Tono…    180   110 brown      fair       blue              NA <NA>  <NA>  
2 Gregar T…    185    85 black      dark       brown             NA <NA>  <NA>  
3 Cordé        157    NA brown      light      brown             NA <NA>  <NA>  
4 Sly Moore    178    48 none       pale       white             NA <NA>  <NA>  
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Our previous construction led to everyone non-human or non-droid being classified as Other, when maybe some should be left as NA.

sw2 |>
  group_by(species_update) |>
  count() |>
  arrange(desc(n))
# A tibble: 3 × 2
# Groups:   species_update [3]
  species_update     n
  <chr>          <int>
1 Other             46
2 Human             35
3 Droid              6

By capturing NAs in our condition, we can leave them as NAs.

starwars |>
  mutate(
    species_update = ifelse(
      !species %in% c("Human", "Droid", NA),
      "Other", species
    )
  ) |>
  filter(is.na(species)) |>
  select(name, species, species_update)
# A tibble: 4 × 3
  name             species species_update
  <chr>            <chr>   <chr>         
1 Jek Tono Porkins <NA>    <NA>          
2 Gregar Typho     <NA>    <NA>          
3 Cordé            <NA>    <NA>          
4 Sly Moore        <NA>    <NA>          
  1. Create a new variable called is_bald and set it to FALSE if the character has hair of any color, TRUE if the character has no hair, and NA if the character is a droid.
# SAMPLE SOLUTION

starwars <- starwars |>
  mutate(
    is_bald = ifelse(species == "Droid", NA, TRUE), 
    is_bald = ifelse(is_bald & hair_color != "none", 
                     FALSE, is_bald)
  )
  1. Use the following code to check your previous answer. Pay careful attention to NAs. Do you have them in all the right places?
starwars |>
  select(hair_color, is_bald) |>
  table(useNA = "always")

for loops

As noted in the book, there are many reasons to avoid writing loops in R. I have never written a repeat loop. There are only rare occasions when a while loop is necessary. Unless you need to explicitly access indices, you can and should rewrite a for loop as a map() statement. I will strongly encourage you to do this!!

Warning

I will consistently and strongly encourage you to eliminate for loops in your R code.

Vectorized operations

Many operations in R are vectorized already, so you often don’t need a loop at all.

Consider generating the first 10 number in some integer sequences. For the perfect squares, you don’t need a loop at all, because the square operator is vectorized. Recall that vectors are built into the fundamental design of R, so things are supposed to work this way!

Note

Built-in vectorization is one of the key ideas that separates R from other programming languages.

x <- 1:10

x^2
 [1]   1   4   9  16  25  36  49  64  81 100

However, consider generating the Fibbonaci sequence. This can’t be vectorized, because each entry depends on the previous two entries. You could write a for loop.

fib <- c(1, 1)
for (i in 3:length(x)) {
  fib[i] <- fib[i-1] + fib[i-2]
}
fib
 [1]  1  1  2  3  5  8 13 21 34 55

If we had the Fibbonacci sequence already, we could use R’s vector-based operation lag() to decompose the sequence.

fib_df <- tibble(
  fib, 
  prev_x = lag(fib), 
  prev_prev_x = lag(fib, 2)
)
fib_df
# A tibble: 10 × 3
     fib prev_x prev_prev_x
   <dbl>  <dbl>       <dbl>
 1     1     NA          NA
 2     1      1          NA
 3     2      1           1
 4     3      2           1
 5     5      3           2
 6     8      5           3
 7    13      8           5
 8    21     13           8
 9    34     21          13
10    55     34          21

But this won’t help us generate new values in the sequence.

Using map()

Instead, we can write a recursive function to generate the \(n\)th value in the sequence, and then map() over that function.

fibonacci <- function(x) {
  if (x == 1 | x == 2) {
    return(1L);
  } else {
    return(fibonacci(x - 1) + fibonacci(x - 2));
  }
}

map_int(x, fibonacci)
 [1]  1  1  2  3  5  8 13 21 34 55

Choosing a paradigm

Generally, when you have a vector x as input, and you want to produce a vector y of the same length as output, you can use one of two paradigms:

  • If the operation can be vectorized, write a function that will take the whole input vector x and compute the whole y vector at once. I suspect that this will be the most efficient method in nearly every case.
  • If the operation can’t be vectorized, write a function that will compute a single value of y for a single value of x, and then map() that function over x.

Only if NEITHER of these is possible should you write a for loop!

Recall that we saw map() previously in the context of list-columns.

  1. Use the vectorized nchar() function to compute the number of characters in each character’s name, without writing any kind of loop.
# SAMPLE SOLUTION
nchar(starwars$name)
 [1] 14  5  5 11 11  9 18  5 17 14 16 14  9  8  6 21 14 16  4  9  9  5  5 16  5
[26]  6 10 12 21  9 12 11 13 13 13 12 10  8  5  7 13 14 10 11 11 12  8  7 14 10
[51] 12  9  9 10 11 11  8 10 12  5 11 17 15 13  5  5 19 10 10 15  7  7 10  6 10
[76]  8  8  8  7 15  9 10  4  3 11  3 14
  1. Now compute the same output, but using map_int() and nchar(). Make sure you understand the difference between these two approaches.
# SAMPLE SOLUTION

map_int(starwars$name, nchar)
 [1] 14  5  5 11 11  9 18  5 17 14 16 14  9  8  6 21 14 16  4  9  9  5  5 16  5
[26]  6 10 12 21  9 12 11 13 13 13 12 10  8  5  7 13 14 10 11 11 12  8  7 14 10
[51] 12  9  9 10 11 11  8 10 12  5 11 17 15 13  5  5 19 10 10 15  7  7 10  6 10
[76]  8  8  8  7 15  9 10  4  3 11  3 14
  1. Now use map_int() and length() to compute a numeric vector of the number of vehicles associated with each character.
# SAMPLE SOLUTION

map_int(starwars$vehicles, length)
 [1] 2 0 0 0 1 0 0 0 0 1 2 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
[39] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0
[77] 0 1 0 0 0 0 0 0 0 0 0
  1. Use map() and nchar() to compute the total number of characters in the number of starships associated with each character. For example, Luke Skywalker primarily flew an X-wing fighter, but also briefly piloted an Imperial shuttle in Return of the Jedi. So the number of characters in his starships list is 6 + 16 = 22.
# SAMPLE SOLUTION

map_int(starwars$starships, ~sum(nchar(.x)))
 [1] 22  0  0 15  0  0  0  0  6 96 53  0 33 33  0  0  6  6  0  0  7  0  0 17  0
[26]  0  0  6  0 17  0  0  0 48  0  0  0 20  0  0  0  0  8  0  0  0  0  0  0  0
[51]  0  0  0  0  0  0 16  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[76]  0  0 24  0  0  0  0  0  0  6  0  0
# SAMPLE SOLUTION

starwars |>
  pull(starships) |>
  map(nchar) |>
  map_int(sum)
 [1] 22  0  0 15  0  0  0  0  6 96 53  0 33 33  0  0  6  6  0  0  7  0  0 17  0
[26]  0  0  6  0 17  0  0  0 48  0  0  0 20  0  0  0  0  8  0  0  0  0  0  0  0
[51]  0  0  0  0  0  0 16  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[76]  0  0 24  0  0  0  0  0  0  6  0  0
  1. Rewrite the following for loop as a call to map(). The output should be a list of length 2.
mpg_by_year <- group_split(mpg, year)

mods <- list()

for (i in seq_along(mpg_by_year)) {
  mods[[i]] <- lm(hwy ~ displ + cyl, data = mpg_by_year[[i]])
}
# SAMPLE SOLUTION

map(mpg_by_year, ~lm(hwy ~ displ + cyl, data = .x))
[[1]]

Call:
lm(formula = hwy ~ displ + cyl, data = .x)

Coefficients:
(Intercept)        displ          cyl  
   35.95548     -3.67442     -0.08285  


[[2]]

Call:
lm(formula = hwy ~ displ + cyl, data = .x)

Coefficients:
(Intercept)        displ          cyl  
    40.5275      -0.4355      -2.5437  
# SAMPLE SOLUTION
map(mpg_by_year, lm, formula = "hwy ~ displ + cyl")
[[1]]

Call:
.f(formula = "hwy ~ displ + cyl", data = .x[[i]])

Coefficients:
(Intercept)        displ          cyl  
   35.95548     -3.67442     -0.08285  


[[2]]

Call:
.f(formula = "hwy ~ displ + cyl", data = .x[[i]])

Coefficients:
(Intercept)        displ          cyl  
    40.5275      -0.4355      -2.5437  

Engagement

Prompt: What #questions to you still have about control flow and/or loops?