Vectors

In this lab, we will learn how to investigate the underlying data structures of R objects.

Goal: by the end of this lab, you will be able to determine the base class of any object.

Attributes

Objects in R can have attributes. Use the attributes() function to figure out what they are.

attributes(starwars)
$names
 [1] "name"       "height"     "mass"       "hair_color" "skin_color"
 [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
[11] "species"    "films"      "vehicles"   "starships" 

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
[76] 76 77 78 79 80 81 82 83 84 85 86 87

$class
[1] "tbl_df"     "tbl"        "data.frame"

Unlike in many other programming languages, attributes in R – including the class of an object – are changeable!

  1. Use the assignment operator (<-) and the attr() function to change the class of starwars to sds_is_awesome.
# SAMPLE SOLUTION

attr(starwars, "class") <- "sds_is_awesome"
attributes(starwars)
$names
 [1] "name"       "height"     "mass"       "hair_color" "skin_color"
 [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
[11] "species"    "films"      "vehicles"   "starships" 

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
[76] 76 77 78 79 80 81 82 83 84 85 86 87

$class
[1] "sds_is_awesome"
  1. Is starwars a data.frame now? How do you know? Try to select() a column.
# SAMPLE SOLUTION

starwars |>
  select(name)
  1. Once you’re done playing around attributes, use rm(starwars) to delete the bad copy. Now run starwars again. Why does this work?
# SAMPLE SOLUTION

rm(starwars)
starwars
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

S3 classes

S3 is the name of the simplest and most common object-oriented paradigm in R. We’ll learn more about S3 later. For now, we’ll explore common vector classes that are not atomic.

Note first that starwars has multiple classes, and these classes are ordered.

class(starwars)
[1] "tbl_df"     "tbl"        "data.frame"

The basic data type of starwars is a list, because all tbl_dfs and data.frames are lists.

typeof(starwars)
[1] "list"

When you type starwars at the console, what actually gets called is print(starwars). That is, the default action when you type the name of an object is to run the print() command on that object.

Thus, when you type starwars, R runs print(starwars), and since it knows that print() is a generic function, and starwars is a tbl_df, it looks for a method called print.tbl_df(). If it can’t find one, it will look for a method called print.tbl(). If it can’t find one, it will look for print.data.frame(). If it can’t find that it will look for print.default().

In this case, there are print() methods defined for tbl and data.frame. Note the difference between:

starwars
print.data.frame(starwars)
  1. Examine the output of print.data.frame(starwars) and as.data.frame(starwars). Are they the same? What is the difference between what is actually executed?

  2. Examine the output of as.numeric(starwars$name) and as.numeric(factor(starwars$name)). What is going on?

# SAMPLE SOLUTION

x <- factor(starwars$name)
attributes(x)
$levels
 [1] "Ackbar"                "Adi Gallia"            "Anakin Skywalker"     
 [4] "Arvel Crynyd"          "Ayla Secura"           "Bail Prestor Organa"  
 [7] "Barriss Offee"         "BB8"                   "Ben Quadinaros"       
[10] "Beru Whitesun Lars"    "Bib Fortuna"           "Biggs Darklighter"    
[13] "Boba Fett"             "Bossk"                 "C-3PO"                
[16] "Captain Phasma"        "Chewbacca"             "Cliegg Lars"          
[19] "Cordé"                 "Darth Maul"            "Darth Vader"          
[22] "Dexter Jettster"       "Dooku"                 "Dormé"                
[25] "Dud Bolt"              "Eeth Koth"             "Finis Valorum"        
[28] "Finn"                  "Gasgano"               "Greedo"               
[31] "Gregar Typho"          "Grievous"              "Han Solo"             
[34] "IG-88"                 "Jabba Desilijic Tiure" "Jango Fett"           
[37] "Jar Jar Binks"         "Jek Tono Porkins"      "Jocasta Nu"           
[40] "Ki-Adi-Mundi"          "Kit Fisto"             "Lama Su"              
[43] "Lando Calrissian"      "Leia Organa"           "Lobot"                
[46] "Luke Skywalker"        "Luminara Unduli"       "Mace Windu"           
[49] "Mas Amedda"            "Mon Mothma"            "Nien Nunb"            
[52] "Nute Gunray"           "Obi-Wan Kenobi"        "Owen Lars"            
[55] "Padmé Amidala"         "Palpatine"             "Plo Koon"             
[58] "Poe Dameron"           "Poggle the Lesser"     "Quarsh Panaka"        
[61] "Qui-Gon Jinn"          "R2-D2"                 "R4-P17"               
[64] "R5-D4"                 "Ratts Tyerel"          "Raymus Antilles"      
[67] "Rey"                   "Ric Olié"              "Roos Tarpals"         
[70] "Rugor Nass"            "Saesee Tiin"           "San Hill"             
[73] "Sebulba"               "Shaak Ti"              "Shmi Skywalker"       
[76] "Sly Moore"             "Tarfful"               "Taun We"              
[79] "Tion Medon"            "Wat Tambor"            "Watto"                
[82] "Wedge Antilles"        "Wicket Systri Warrick" "Wilhuff Tarkin"       
[85] "Yarael Poof"           "Yoda"                  "Zam Wesell"           

$class
[1] "factor"
as.numeric(x)
 [1] 46 15 62 21 44 54 10 64 12 53  3 84 17 33 30 35 82 38 86 56 13 34 14 43 45
[26]  1 50  4 83 51 61 52 27 55 37 69 70 68 81 73 60 75 20 11  5 65 25 29  9 48
[51] 40 41 26  2 71 85 57 49 31 19 18 59 47  7 24 23  6 36 87 22 42 78 39 63 80
[76] 72 74 32 77 66 76 79 28 67 58  8 16
as.numeric(starwars$name)
Warning: NAs introduced by coercion
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[76] NA NA NA NA NA NA NA NA NA NA NA NA

List-columns

Since data.frames are lists, their columns can be objects of arbitrary type. In particular, they can be lists.

The films column in starwars is a list-column. Each entry contains a list of the movies that the corresponding character has appeared in.

films <- starwars |> 
  pull(films)
films

Note that the length() of films is 87, but that each entry in films contains a list of arbitrary length. To see these lengths, we have to map() over the entries in films.

length(films)
[1] 87
map_int(films, length)
 [1] 5 6 7 4 5 3 3 1 1 6 3 2 5 4 1 3 3 1 5 5 3 1 1 2 1 2 1 1 1 1 1 3 1 3 2 1 1 1
[39] 2 1 1 2 1 1 3 1 1 1 1 3 3 3 2 2 2 1 3 2 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 2 1 1
[77] 2 1 1 2 2 1 1 1 1 1 1

nest() and unnest()

List-columns can be expanded by unnest(). This has the effect of lengthening the data frame (sort of like an accordian). Each row is duplicated for each unique value of each entry in the list-column.

Note that each row in starwars corresponds to one character, while films stores the list of films that character has appeared in. If we unnest() the data frame by expanding out the films, we get a data frame that is much longer, because each row now represents one character in one film.

library(tidyr)
starwars |>
  unnest(films)
# A tibble: 173 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue              19 male  mascu…
 2 Luke Sk…    172    77 blond      fair       blue              19 male  mascu…
 3 Luke Sk…    172    77 blond      fair       blue              19 male  mascu…
 4 Luke Sk…    172    77 blond      fair       blue              19 male  mascu…
 5 Luke Sk…    172    77 blond      fair       blue              19 male  mascu…
 6 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu…
 7 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu…
 8 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu…
 9 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu…
10 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu…
# ℹ 163 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <chr>,
#   vehicles <list>, starships <list>

Note that films is no longer a list-column – it’s now a character vector.

The nest() function performs the opposite operation of “rolling up” the data frame to create a new list-column.

  1. Experiment with list-columns by expanding and contracting the other list-columns in the starwars data frame.

Mapping over list columns

Suppose now we want to add the numbers of films for each character to the starwars data set. A simple mutate() like this will not throw an error, but also won’t do what we want.

oops <- starwars |>
  mutate(num_films = length(films)) |>
  arrange(desc(num_films)) |>
  select(name, num_films)
oops
# A tibble: 87 × 2
   name               num_films
   <chr>                  <int>
 1 Luke Skywalker            87
 2 C-3PO                     87
 3 R2-D2                     87
 4 Darth Vader               87
 5 Leia Organa               87
 6 Owen Lars                 87
 7 Beru Whitesun Lars        87
 8 R5-D4                     87
 9 Biggs Darklighter         87
10 Obi-Wan Kenobi            87
# ℹ 77 more rows

This just made all of the entries equal to length(films).

all(oops$num_films == length(starwars$films))
[1] TRUE

To get this right, we need to map() inside our mutate().

starwars |>
  mutate(num_films_actual = map_int(films, length)) |>
  arrange(desc(num_films_actual)) |>
  select(name, num_films_actual, films)
# A tibble: 87 × 3
   name           num_films_actual films    
   <chr>                     <int> <list>   
 1 R2-D2                         7 <chr [7]>
 2 C-3PO                         6 <chr [6]>
 3 Obi-Wan Kenobi                6 <chr [6]>
 4 Luke Skywalker                5 <chr [5]>
 5 Leia Organa                   5 <chr [5]>
 6 Chewbacca                     5 <chr [5]>
 7 Yoda                          5 <chr [5]>
 8 Palpatine                     5 <chr [5]>
 9 Darth Vader                   4 <chr [4]>
10 Han Solo                      4 <chr [4]>
# ℹ 77 more rows

Engagement

Take a minute to think about what questions you still have about vectors. Review what questions have been posted (in the #questions channel) recently by other students and either:

  • respond (e.g., react, comment, clarify, or answer)
  • post a new question

Here is prompt to prime your thinking:

Where did you stuck in this lab?