Vectors

Mini-Lecture 2

Ben Baumer

Smith College

2024-09-05

Slack review

Garbage collection

when does R determine that a file is no longer needed to be considered for garbage collection?

  • Garbage collection is lazy
  • Your OS reclaims memory from R only when it needs it

Tibbles

I can’t seem to figure out why there is such a difference in the memory usage of a tibble vs a dataframe? Is it because a dataframe is more like a proper file where as a tibble is like preview?

library(tidyverse)
library(lobstr)
obj_size(as_tibble(iris)) - obj_size(iris)
136 B
obj_size(attr(iris, "class"))
120 B
obj_size(attr(as_tibble(iris), "class"))
256 B

Wide vs. long

I understand why “long” and “wide” data would take up different amounts of memory in Exercise 7 but am curious why long format data takes up more? Intuitively I would have thought that fewer vectors would take up less space than more vectors, even if they’re longer

dim(iris)
[1] 150   5
iris_long <- iris |>
  pivot_longer(-Species, names_to = "type", values_to = "measurement")

dim(iris_long)
[1] 600   3
obj_size(iris_long) / obj_size(iris)
1.93 B
prod(dim(iris_long)) / prod(dim(iris))
[1] 2.4
  • There is overhead because pivot_longer() adds a new variable

Wide vs. long (cont’d)

  • But you’re right otherwise (about factors, at least)!
iris |>
  map_dbl(obj_size)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
        1248         1248         1248         1248         1248 
iris_long |> 
  map_dbl(obj_size)
    Species        type measurement 
       3048        5104        4848 
class(iris_long$Species)
[1] "factor"

When does it matter?

When does object size start to make a noticeable difference in the efficiency/speed of the code? For example, if you had a long data frame vs. a wide one, is there a # of rows/columns that would make long tables slow your code down significantly with a long frame instead of a wide one, or is it just completely dependent on what kind of program you’re running?

  • Sounds like a great project!

Vectors

Clarification

  • Lists always store references to other objects

Coercion

character → double → integer → logical

  • Makes more sense to me that the arrows go the other way!

data.frames and tibbles

Differences

  • tibble() never coerces an input
  • tibble() won’t transform non-syntactic names
  • tibble() only recycles vectors of length 1
  • tibble() allows references to created variables
  • [ always returns a tibble
  • $ doesn’t do partial matching

Method-oriented programming?

  • Suppose we have an instrument object called violin and a method called play()
  • in Java:
instrument MyViolin = new Violin();
MyViolin.play();
  • but in R:
# constructor sets class attribute to "instrument"
my_violin <- violin()

# generic function dispatches on class attribute
play(my_violin)
# what actually happens!!!
play.instrument(my_violin)  #<<

List-columns

  • Where have you seen this before?
  • sf objects have geometry list-column
  • fitting many models

sf list-columns

library(sf)
library(macleish)
boundary <- macleish_layers[["boundary"]]

boundary |>
  as_tibble()
# A tibble: 1 × 2
    area                                                                geometry
  [acre]                                                           <POLYGON [°]>
1   255. ((-72.68133 42.45536, -72.68108 42.45539, -72.68111 42.45549, -72.6811…
boundary$geometry |> class()
[1] "sfc_POLYGON" "sfc"        
boundary$geometry |> typeof()
[1] "list"

nest() and unnest()

library(tidyr)
nrow(starwars)
[1] 87
starwars_person_film <- starwars |>
  unnest(films)

nrow(starwars_person_film)
[1] 173
starwars_person_film |>
  nest(films) |>
  nrow()
[1] 87

Now

Work on

  • Lab #2: Vectors

  • Reading quiz on Moodle by Sunday night at 11:59 pm