library(lobstr)Names and Values
In this lab, we will learn how to use the lobstr package to get information about objects in our environment.
Goal: by the end of this lab, you will be able to determine whether an operation makes a copy, and compute the amount of memory each object occupies.
Measuring memory
First, load the lobstr package. This package contains many functions that will make it easier for us to figure out what R is doing under the hood.
Your workspace should be empty.
- Use
ls()to list the objects in your workspace. If it is not empty, use the broom icon in the Environment tab in RStudio to empty it.
# SAMPLE SOLUTION
ls()[1] "posted"
Before we do anything, how much memory is being used by our R session?
mem_used()67.87 MB
Recall that a byte is eight bits. A byte is a very small amount of information, typically used to store one character. A kilobyte is 1000 bytes, and a megabyte is 1000 kilobyte, etc. You should familiarize yourself briefly with the orders of magnitude of data.
Now suppose we add some things to our workspace. We can add objects, functions, or load packages. Does loading a package increase the memory used by our session?
- Use the
library()command to load thebroompackage. Then check the memory usage withmem_used(). Does loading a package increase the amount of memory used?
# SAMPLE SOLUTION
library(broom)
mem_used()68.79 MB
What about loading a data set?
- Use the
data()command to load theirisdata set. Does that increase the memory usage?
# SAMPLE SOLUTION
data(iris)
mem_used()68.99 MB
- Use
obj_size()to measure the amount of memory thatiristakes up. Was the increase you observed previously equal to this amount?
# SAMPLE SOLUTION
obj_size(iris)7.20 kB
Making copies
As much fun as it is to make copies, each copy occupies memory. Generally, we want to minimize the amount of memory that our code needs to run.
Let’s store the amount of memory we are currently using.
before <- mem_used()Note the memory location of the iris data frame.
ref(iris)█ [1:0x5624ff402128] <df[,5]>
├─Sepal.Length = [2:0x5624fdc4bd10] <dbl>
├─Sepal.Width = [3:0x5624ff2e85b0] <dbl>
├─Petal.Length = [4:0x5624f8d45240] <dbl>
├─Petal.Width = [5:0x5624fd6e37d0] <dbl>
└─Species = [6:0x5624fead6230] <fct>
Note that we can bind a second name my_iris to the iris data frame, without making a copy.
my_iris <- iris
ref(my_iris)█ [1:0x5624ff402128] <df[,5]>
├─Sepal.Length = [2:0x5624fdc4bd10] <dbl>
├─Sepal.Width = [3:0x5624ff2e85b0] <dbl>
├─Petal.Length = [4:0x5624f8d45240] <dbl>
├─Petal.Width = [5:0x5624fd6e37d0] <dbl>
└─Species = [6:0x5624fead6230] <fct>
Now let’s change the data frame in a way that forces a copy to be made.
my_iris <- my_iris |>
mutate(sepal_area = Sepal.Length * Sepal.Width)
ref(my_iris)█ [1:0x5624f93d0178] <df[,6]>
├─Sepal.Length = [2:0x5624fdc4bd10] <dbl>
├─Sepal.Width = [3:0x5624ff2e85b0] <dbl>
├─Petal.Length = [4:0x5624f8d45240] <dbl>
├─Petal.Width = [5:0x5624fd6e37d0] <dbl>
├─Species = [6:0x5624fead6230] <fct>
└─sepal_area = [7:0x5624fcbd0dc0] <dbl>
Note that the memory locations of my_iris and iris are not the same anymore. However, the memory locations of the underlying vectors are the same!
- Use
beforeandmem_used()to calculate how much extra memory the copy ofmy_irisoccupies.
# SAMPLE SOLUTION
mem_used() - before1.83 MB
- Is the difference you observed above equal to the size of the new column we created? Why or why not?
# SAMPLE SOLUTION
obj_size(my_iris$sepal_area)1.25 kB
obj_size(iris)7.20 kB
Different representations of the same data may have different memory footprints. Suppose we change the iris data set into its long format.
iris_long <- iris |>
pivot_longer(-Species, names_to = "type", values_to = "measurement")- Does
iris_longtake up the same amount of memory asiris? Why or why not?
# SAMPLE SOLUTION
obj_size(iris_long)13.93 kB
obj_size(iris)7.20 kB
We know that tibbles are like data.frames. Do they take up the same amount of memory?
before <- mem_used()
iris_tbl <- iris_long |>
as_tibble()
mem_used() - before56 B
class(iris_tbl)[1] "tbl_df" "tbl" "data.frame"
Does converting a data.frame to a tbl force a copy?
ref(iris_tbl)█ [1:0x5624fcb38ff8] <tibble[,3]>
├─Species = [2:0x5624f9257ee0] <fct>
├─type = [3:0x5624fd5053b0] <chr>
└─measurement = [4:0x5624fecfa640] <dbl>
ref(iris_long)█ [1:0x5624f9464c98] <tibble[,3]>
├─Species = [2:0x5624f9257ee0] <fct>
├─type = [3:0x5624fd5053b0] <chr>
└─measurement = [4:0x5624fecfa640] <dbl>
- Discuss how using a tibble changes the memory footprint relative to using a data.frame.
Tracing memory
Unfortunately, due to various complications and optimizations, it’s not always possible to reason ahead of time about whether R will make a copy of an object. Instead, we can use the tracemem() function to have R tell us whether it makes a copy and why.
First, note the memory location of iris.
tracemem(iris)[1] "<0x5624ff402128>"
We are now tracing this memory location. Some types of computations we make on iris do not require making a copy.
iris |>
pull(Petal.Length) |>
mean()[1] 3.758
iris |>
select(contains("Petal")) |>
head() Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
However, if we modify iris using mutate(), a copy does get made.
iris |>
mutate(petal_area = Petal.Length * Petal.Width) |>
as_tibble()tracemem[0x5624ff402128 -> 0x5624f937f6a8]: initialize <Anonymous> mutate_cols mutate.data.frame mutate as_tibble eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main
tracemem[0x5624f937f6a8 -> 0x5624f937f718]: dplyr_new_list initialize <Anonymous> mutate_cols mutate.data.frame mutate as_tibble eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main
tracemem[0x5624f937f718 -> 0x5624f937f7f8]: dplyr_new_list initialize <Anonymous> mutate_cols mutate.data.frame mutate as_tibble eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main
tracemem[0x5624ff402128 -> 0x5624f92d1928]: new_data_frame vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate as_tibble eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main
tracemem[0x5624f92d1928 -> 0x5624f92d1998]: as.list.data.frame as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate as_tibble eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species petal_area
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 0.28
2 4.9 3 1.4 0.2 setosa 0.28
3 4.7 3.2 1.3 0.2 setosa 0.26
4 4.6 3.1 1.5 0.2 setosa 0.3
5 5 3.6 1.4 0.2 setosa 0.28
6 5.4 3.9 1.7 0.4 setosa 0.68
7 4.6 3.4 1.4 0.3 setosa 0.42
8 5 3.4 1.5 0.2 setosa 0.3
9 4.4 2.9 1.4 0.2 setosa 0.28
10 4.9 3.1 1.5 0.1 setosa 0.15
# ℹ 140 more rows
- Experiment with different operations after invoking
tracemem(). Can you get a feel for what operations induce copies?
Garbage collection
Garbage collection is the process of reclaiming memory that is no longer being used. R does the automatically, but you can force the issue with gc().
gc() used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1070652 57.2 1899792 101.5 1530102 81.8
Vcells 1900029 14.5 8388608 64.0 2725143 20.8
Engagement
Take a minute to think about what questions you still have about names, values, and copies. Review what questions have been posted (in the #questions channel) recently by other students and either:
- respond (e.g., react, comment, clarify, or answer)
- post a new question