Subsetting

In this lab, we will learn how to subset vectors and lists.

Goal: by the end of this lab, you will be able to understand why different subset operators return objects of different types.

Operators

The three main subset operators are [, [[ and $. In addition, you will see functions that perform subsetting operations like:

  • dplyr::filter(): the tidyverse way to select a subset of the rows of a data frame.
  • dplyr::select(): the tidyverse way to select a subset of the columns of a data frame.
  • dplyr::pull(): a tidyverse function analogous to [[.data.frame.
  • purrr::pluck(): a tidyverse function analogous to [[.
Note

Technically, [, [[ and $ are also functions.

It’s probably best to avoid these other similar functions:

  • subset(): the base R way to select a subset of the rows of a data frame.
  • rvest::pluck(): similar to purrr::pluck() but not as good
  • magrittr::extract(): a wrapper to [
  • magrittr::extract2(): a wrapper to [[

This post is helpful. If x is a list, then:

I like this one, too:

Indexing

Key idea: there are six different ways to index a vector (or list).

The three main most commonly used ways are:

  • with a numeric vector that selects the elements by index
  • with a logical vector that selects the elements that are TRUE
  • with a character vector that selects the elements by name

Note that indexing by logical vector will generally return an object of the same length as the original (or smaller), whereas indexing by numeric vector can return an object of any length.

Using dplyr, we would normally find the blue-eyed characters using filter().

starwars |>
  filter(eye_color == "blue")

Instead, we’ll use the base R functionality for subsetting vectors. First, we compute a logical vector that indicates whether each character has blue eyes.

lgl <- starwars$eye_color == "blue"
lgl
 [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE
[13]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[25]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
[37] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
[61]  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE
  1. Use [ and lgl to compute the subset of starwars characters who have blue eyes.
# SAMPLE SOLUTION

starwars[lgl, ]
# A tibble: 19 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172  77   blond      fair       blue            19   male  mascu…
 2 Owen La…    178 120   brown, gr… light      blue            52   male  mascu…
 3 Beru Wh…    165  75   brown      light      blue            47   fema… femin…
 4 Anakin …    188  84   blond      fair       blue            41.9 male  mascu…
 5 Wilhuff…    180  NA   auburn, g… fair       blue            64   male  mascu…
 6 Chewbac…    228 112   brown      unknown    blue           200   male  mascu…
 7 Jek Ton…    180 110   brown      fair       blue            NA   <NA>  <NA>  
 8 Lobot       175  79   none       light      blue            37   male  mascu…
 9 Mon Mot…    150  NA   auburn     fair       blue            48   fema… femin…
10 Qui-Gon…    193  89   brown      fair       blue            92   male  mascu…
11 Finis V…    170  NA   blond      fair       blue            91   male  mascu…
12 Ric Olié    183  NA   brown      fair       blue            NA   male  mascu…
13 Adi Gal…    184  50   none       dark       blue            NA   fema… femin…
14 Mas Ame…    196  NA   none       blue       blue            NA   male  mascu…
15 Cliegg …    183  NA   brown      fair       blue            82   male  mascu…
16 Luminar…    170  56.2 black      yellow     blue            58   fema… femin…
17 Barriss…    166  50   black      yellow     blue            40   fema… femin…
18 Jocasta…    167  NA   white      fair       blue            NA   fema… femin…
19 Tarfful     234 136   brown      brown      blue            NA   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Alternatively, we could use the which() function to return an integer vector of the corresponding indices.

num <- which(starwars$eye_color == "blue")
num
 [1]  1  6  7 11 12 13 18 25 27 31 33 38 54 58 61 63 64 73 79
  1. Use [ and num to compute the subset of starwars characters who have blue eyes.
# SAMPLE SOLUTION

starwars[num, ]
# A tibble: 19 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172  77   blond      fair       blue            19   male  mascu…
 2 Owen La…    178 120   brown, gr… light      blue            52   male  mascu…
 3 Beru Wh…    165  75   brown      light      blue            47   fema… femin…
 4 Anakin …    188  84   blond      fair       blue            41.9 male  mascu…
 5 Wilhuff…    180  NA   auburn, g… fair       blue            64   male  mascu…
 6 Chewbac…    228 112   brown      unknown    blue           200   male  mascu…
 7 Jek Ton…    180 110   brown      fair       blue            NA   <NA>  <NA>  
 8 Lobot       175  79   none       light      blue            37   male  mascu…
 9 Mon Mot…    150  NA   auburn     fair       blue            48   fema… femin…
10 Qui-Gon…    193  89   brown      fair       blue            92   male  mascu…
11 Finis V…    170  NA   blond      fair       blue            91   male  mascu…
12 Ric Olié    183  NA   brown      fair       blue            NA   male  mascu…
13 Adi Gal…    184  50   none       dark       blue            NA   fema… femin…
14 Mas Ame…    196  NA   none       blue       blue            NA   male  mascu…
15 Cliegg …    183  NA   brown      fair       blue            82   male  mascu…
16 Luminar…    170  56.2 black      yellow     blue            58   fema… femin…
17 Barriss…    166  50   black      yellow     blue            40   fema… femin…
18 Jocasta…    167  NA   white      fair       blue            NA   fema… femin…
19 Tarfful     234 136   brown      brown      blue            NA   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  1. Compute the length of lgl and num. Are they the same? Why or why not?

  2. Make sure that you understand the difference between what is happening in the first two exercises.

Resampling

In addition to subsetting, you can also use index vectors to resample, or even oversample, a vector.

For example, we could double the previous results by repeating the index vector.

# note error!
starwars[c(lgl, lgl), ]
Error in `starwars[c(lgl, lgl), ]`:
! Can't subset rows with `c(lgl, lgl)`.
✖ Logical subscript `c(lgl, lgl)` must be size 1 or 87, not 174.
# works, but not necessarily as intended -- output suppressed
# as.data.frame(starwars)[c(lgl, lgl), ]

# no warning
starwars[c(num, num), ]
# A tibble: 38 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 3 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 4 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 5 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
 6 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 7 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
 8 Lobot       175    79 none       light      blue            37   male  mascu…
 9 Mon Mot…    150    NA auburn     fair       blue            48   fema… femin…
10 Qui-Gon…    193    89 brown      fair       blue            92   male  mascu…
# ℹ 28 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Application

Remember that a data frame is just a list of vectors (of the same length)! Thus, the subsetting rules governing lists also apply to data frames.

  1. What is the type of the result of starwars["name"]?
# SAMPLE SOLUTION

class(starwars["name"])
[1] "tbl_df"     "tbl"        "data.frame"
  1. What is the type of the result of starwars[["name"]]?
# SAMPLE SOLUTION

class(starwars[["name"]])
[1] "character"
  1. What is the type of the result of starwars$name?
# SAMPLE SOLUTION

class(starwars$name)
[1] "character"

Storing the names of variables in vectors can be counter-intuitive. Note that [ will work, $ will not, and [[ will work only with vectors of length 1.

vars <- c("name", "height")

# works
starwars[, vars]
# A tibble: 87 × 2
   name               height
   <chr>               <int>
 1 Luke Skywalker        172
 2 C-3PO                 167
 3 R2-D2                  96
 4 Darth Vader           202
 5 Leia Organa           150
 6 Owen Lars             178
 7 Beru Whitesun Lars    165
 8 R5-D4                  97
 9 Biggs Darklighter     183
10 Obi-Wan Kenobi        182
# ℹ 77 more rows
# doesn't work
starwars[[vars]]
Error in `starwars[[vars]]`:
! Can't extract column with `vars`.
✖ Subscript `vars` must be size 1, not 2.
# doesn't work
starwars$vars
Warning: Unknown or uninitialised column: `vars`.
NULL

The behavior is also different when the vector of names is of length one.

my_var <- c("name")

# works
starwars[, my_var]

# works!
starwars[[my_var]]

# doesn't work
starwars$my_var

These inconsistencies are some of the many reasons to use the selection operators in select() instead.

?tidyselect::select_helpers
  1. Why does starwars[[vars]] throw an error, but starwars[[my_var]] works? What is the logical inconsistency in the first case?

Engagement

Take a minute to think about what questions you still have about subsetting. Review what questions have been posted (in the #questions channel) recently by other students and either:

  • respond (e.g., react, comment, clarify, or answer)
  • post a new question