Is starwars a data.frame now? How do you know? Try to select() a column.
# SAMPLE SOLUTIONstarwars |>select(name)
Once you’re done playing around attributes, use rm(starwars) to delete the bad copy. Now run starwars again. Why does this work?
# SAMPLE SOLUTIONrm(starwars)starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
S3 classes
S3 is the name of the simplest and most common object-oriented paradigm in R. We’ll learn more about S3 later. For now, we’ll explore common vector classes that are not atomic.
Note first that starwars has multiple classes, and these classes are ordered.
class(starwars)
[1] "tbl_df" "tbl" "data.frame"
The basic data type of starwars is a list, because all tbl_dfs and data.frames are lists.
typeof(starwars)
[1] "list"
When you type starwars at the console, what actually gets called is print(starwars). That is, the default action when you type the name of an object is to run the print() command on that object.
Thus, when you type starwars, R runs print(starwars), and since it knows that print() is a generic function, and starwars is a tbl_df, it looks for a method called print.tbl_df(). If it can’t find one, it will look for a method called print.tbl(). If it can’t find one, it will look for print.data.frame(). If it can’t find that it will look for print.default().
In this case, there are print() methods defined for tbl and data.frame. Note the difference between:
starwarsprint.data.frame(starwars)
Examine the output of print.data.frame(starwars) and as.data.frame(starwars). Are they the same? What is the difference between what is actually executed?
Examine the output of as.numeric(starwars$name) and as.numeric(factor(starwars$name)). What is going on?
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[76] NA NA NA NA NA NA NA NA NA NA NA NA
List-columns
Since data.frames are lists, their columns can be objects of arbitrary type. In particular, they can be lists.
The films column in starwars is a list-column. Each entry contains a list of the movies that the corresponding character has appeared in.
films <- starwars |>pull(films)films
Note that the length() of films is 87, but that each entry in films contains a list of arbitrary length. To see these lengths, we have to map() over the entries in films.
List-columns can be expanded by unnest(). This has the effect of lengthening the data frame (sort of like an accordian). Each row is duplicated for each unique value of each entry in the list-column.
Note that each row in starwars corresponds to one character, while films stores the list of films that character has appeared in. If we unnest() the data frame by expanding out the films, we get a data frame that is much longer, because each row now represents one character in one film.
library(tidyr)starwars |>unnest(films)
# A tibble: 173 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Luke Sk… 172 77 blond fair blue 19 male mascu…
3 Luke Sk… 172 77 blond fair blue 19 male mascu…
4 Luke Sk… 172 77 blond fair blue 19 male mascu…
5 Luke Sk… 172 77 blond fair blue 19 male mascu…
6 C-3PO 167 75 <NA> gold yellow 112 none mascu…
7 C-3PO 167 75 <NA> gold yellow 112 none mascu…
8 C-3PO 167 75 <NA> gold yellow 112 none mascu…
9 C-3PO 167 75 <NA> gold yellow 112 none mascu…
10 C-3PO 167 75 <NA> gold yellow 112 none mascu…
# ℹ 163 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <chr>,
# vehicles <list>, starships <list>
Note that films is no longer a list-column – it’s now a character vector.
The nest() function performs the opposite operation of “rolling up” the data frame to create a new list-column.
Experiment with list-columns by expanding and contracting the other list-columns in the starwars data frame.
Mapping over list columns
Suppose now we want to add the numbers of films for each character to the starwars data set. A simple mutate() like this will not throw an error, but also won’t do what we want.
# A tibble: 87 × 3
name num_films_actual films
<chr> <int> <list>
1 R2-D2 7 <chr [7]>
2 C-3PO 6 <chr [6]>
3 Obi-Wan Kenobi 6 <chr [6]>
4 Luke Skywalker 5 <chr [5]>
5 Leia Organa 5 <chr [5]>
6 Chewbacca 5 <chr [5]>
7 Yoda 5 <chr [5]>
8 Palpatine 5 <chr [5]>
9 Darth Vader 4 <chr [4]>
10 Han Solo 4 <chr [4]>
# ℹ 77 more rows
Engagement
Take a minute to think about what questions you still have about vectors. Review what questions have been posted (in the #questions channel) recently by other students and either:
respond (e.g., react, comment, clarify, or answer)