fertile

Creating optimal conditions for reproducible data analysis in R

Benjamin S. Baumer
Statistical & Data Sciences
Smith College
(joint work with Audrey Bertin)

JSM, 2022-08-07
https://bit.ly/3cVEmvU

I have two problems

I want to do reproducible work, but…

My students:

  • are learning coding for the first time

  • used to WYSISYG, new to source + output paradigm

  • don’t understand about file paths

My collaborators:

  • have customized installations

  • write R in dialects

  • don’t understand about file paths

The files are in the computer?

Concepts in file paths

  • Absolute paths
  • Relative paths
  • / vs. \
  • root directories
  • Filesystem Hierarchy Standard
  • Local file vs. file on the Internet
  • Just because your Markdown knits on your computer doesn’t mean it will knit on mine!

Jenny will light your computer on fire!

Project-oriented workflow

Proactive path hygiene

file.exists("~/Desktop/data.csv")
[1] TRUE
read.csv("~/Desktop/data.csv") |> head()
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
library(fertile)
read.csv("~/Desktop/data.csv")
Error in `check_path_absolute_shim()`:
! Detected absolute paths. Absolute paths are not reproducible and will likely only work on your computer. If you would like to continue anyway, please execute the following command: utils::read.csv('~/Desktop/data.csv')

Checking path portability

is_path_portable("~/Desktop/data.csv")
[1] FALSE
is_path_portable("/home/bbaumer/data.csv")
[1] FALSE
is_path_portable("../data.csv")
[1] FALSE
is_path_portable("data.csv")
[1] TRUE
is_path_portable("data/data.csv")
[1] TRUE

No more!

read_csv("/Users/gregoryjmatthews/projects/openWAR/")
Error in `check_path_absolute_shim()`:
! Detected absolute paths. Absolute paths are not reproducible and will likely only work on your computer. If you would like to continue anyway, please execute the following command: readr::read_csv('/Users/gregoryjmatthews/projects/openWAR/')

Keep you safe from Jenny!

setwd("~/Desktop")
Error in `setwd()`:
! setwd() is likely to break reproducibility. Use here::here() instead.

How does it work?

  • Shims for many common functions
search()
 [1] ".GlobalEnv"        "package:fertile"   "package:forcats"  
 [4] "package:stringr"   "package:dplyr"     "package:purrr"    
 [7] "package:readr"     "package:tidyr"     "package:tibble"   
[10] "package:ggplot2"   "package:tidyverse" "tools:quarto"     
[13] "package:stats"     "package:graphics"  "package:grDevices"
[16] "package:utils"     "package:datasets"  "package:methods"  
[19] "Autoloads"         "package:base"     
find("read.csv")
[1] "package:fertile" "package:utils"  
find("ggsave")
[1] "package:fertile" "package:ggplot2"

Logging

log_report()
# A tibble: 2 × 4
  path                                      path_abs   func  timestamp          
  <chr>                                     <chr>      <chr> <dttm>             
1 ~/Desktop/data.csv                        ~/Desktop… util… 2022-08-07 15:37:12
2 /Users/gregoryjmatthews/projects/openWAR/ <NA>       read… 2022-08-07 15:37:12

Retroactive Checks

list_checks()
 [1] "has_tidy_media"          "has_tidy_images"        
 [3] "has_tidy_code"           "has_tidy_raw_data"      
 [5] "has_tidy_data"           "has_tidy_scripts"       
 [7] "has_readme"              "has_no_lint"            
 [9] "has_proj_root"           "has_no_nested_proj_root"
[11] "has_only_used_files"     "has_clear_build_chain"  
[13] "has_no_absolute_paths"   "has_only_portable_paths"
[15] "has_no_randomness"       "has_well_commented_code"

Like R CMD check, but for analysis

proj_check("~/Dropbox/marshmallow/")
# A tibble: 1 × 2
  culprit   expr                        
  <chr>     <glue>                      
1 README.md fs::file_create('README.md')
# A tibble: 1 × 1
  path_abs                                             
  <chr>                                                
1 /home/bbaumer/Dropbox/marshmallow/fertile-badges.html


# A tibble: 1 × 2
  culprit   expr                        
  <chr>     <glue>                      
1 README.md fs::file_create('README.md')
# A tibble: 1 × 1
  path_abs                                             
  <chr>                                                
1 /home/bbaumer/Dropbox/marshmallow/fertile-badges.html

A specific check

proj_check_some("~/Dropbox/marshmallow/", contains("README"))
# A tibble: 1 × 2
  culprit   expr                        
  <chr>     <glue>                      
1 README.md fs::file_create('README.md')


# A tibble: 1 × 2
  culprit   expr                        
  <chr>     <glue>                      
1 README.md fs::file_create('README.md')

I now have two
slightly smaller problems

The real hero: Audrey Bertin ’21

| |
  • Smith College ’21

  • Junior Data Scientist
    MassMutual

  • M.A. Computer Science
    Data Science concentration
    UMass-Amherst

  • Amstat News profile

Learn more

THANK YOU!!!