12  Working with Large Data

Authors
Affiliations

Bowling Green State University

Smith College

Max Marchi

Cleveland Guardians

12.1 Introduction

In Chapter 11, we set up and populated a MySQL database using the RMariaDB package for R. That database contained game logs and play-by-play data gathered from Retrosheet. In this chapter, we expand on this work by building databases of pitch-by-pitch Statcast data downloaded from Baseball Savant. Along the way, in addition to MySQL we explore several different ways to store and represent a dataset that spans four years of play (2020–2023). Finally, we compare and contrast the strengths of weaknesses of these various approaches and take these databases for a few test drives.

In Section C.10, we show how to use the statcast_search() function from the baseballr package to download Statcast data from Baseball Savant. Because there is a row of data for each pitch thrown in Major League Baseball, these data can quickly become large enough to complicate our workflow. For the most part, the data that we have used in this book has been small, in the sense that it only comprises a few kilobytes or megabytes, which is far less than the amount of physical memory (i.e. random access memory) in most personal computers. Even the Retrosheet data, which covers every play, is less than 100 MB for a full season. However, as we will see below, the Statcast data is an order of magnitude larger, and will occupy several hundred megabytes per season, and several gigabytes over multiple seasons. Working with data of this magnitude can impede or overwhelm a personal computer if one is not careful. As such, in this chapter we introduce new tools for acquiring multiple years worth of Statcast, and then storing that data efficiently. Mercifully, regardless of what data storage format we choose, the magic of dplyr and dbplyr makes the process of analyzing the data the same.

12.2 Acquiring a Year’s Worth of Statcast Data

To further explore many sabermetric ideas, it’s useful to collect the in-play data for all plays in a particular season. One can acquire the relevant data via the statcast_search() function provided by baseballr. However, because this function only returns at most 25,000 observations, and there were more than 700,000 pitches thrown in 2023, compiling the full season’s worth of data takes some effort. To that end we show how one can iterate the process of downloading Statcast data on a daily basis, and then combine that data into the full-season data sets that we use in this book.

On a typical day during the regular season, around 4500 pitches are thrown. This means that in a typical week, more than 30,000 pitches are thrown, which exceeds the 25,000 limit returned by statcast_search(). This means that we can’t safely download data on a weekly basis. As a workaround, we write the following function to download Statcast data for a single day and write that data to an appropriately-named CSV in a specified directory dir. The code for the statcast_daily() function is included in the abdwr3edata package, and reproduced here.

abdwr3edata::statcast_daily
function(the_date = lubridate::now(), dir = getwd()) {
  if (!dir.exists(dir)) {
    dir.create(dir, recursive = TRUE)
  }
  
  filename <- paste0("sc_", lubridate::as_date(the_date), ".csv")
  file_path <- fs::path(dir, filename)
  
  # if the file already exists, read it. 
  if (file.exists(file_path)) {
    x <- file_path |>
      readr::read_csv() |>
      suppressMessages()
    if (nrow(x) > 0) {
      message(
        paste(
          "Found", nrow(x), "observations in", file_path, "..."
        )
      )
    }
    return(NULL)
  }
  
  # the file doesn't exist or doesn't have data, get it
  message(paste("Retrieving data from", the_date))
  x <- baseballr::statcast_search(
    start_date = lubridate::as_date(the_date),
    end_date = lubridate::as_date(the_date), 
    player_type = "batter"
  ) |>
    dplyr::filter(game_type == "R")
  
  if (nrow(x) > 0) {
    message(paste("Writing", file_path, "..."))
    x |>
      readr::write_csv(file = fs::path(dir, filename))
  }
  return(NULL)
}
<bytecode: 0x62609af3d328>
<environment: namespace:abdwr3edata>

In this case, we don’t want to store the data in the working directory. We want to store it in a directory named statcast_csv, and we need to create that directory if it doesn’t exist already. Using the path() function from the fs package ensures that the file paths we create will be valid on any operating system.

library(fs)
data_dir <- here::here("data_large")
statcast_dir <- path(data_dir, "statcast_csv")
if (!dir.exists(statcast_dir)) {
  dir.create(statcast_dir)
}

Now, we create a vector of dates for which we want to download Statcast data. For example, regular season games during the 2023 season were played between March 30 and November 6. These were the 89th and 274th days of the year, respectively. To extend statcast_daily() to a full season, we’ll have to iterate the function over a series of dates. We can use the parse_date_time() function from the lubridate package to convert a vector of integers to a vector of dates.

mlb_2023_dates <- 89:274 |>
  parse_date_time("%j") |>
  as_date()
head(mlb_2023_dates)
[1] "2024-03-29" "2024-03-30" "2024-03-31" "2024-04-01"
[5] "2024-04-02" "2024-04-03"

The walk() function from the purrr package will allow us to successively apply the statcast_daily() function to each of these 186 days. walk() is similar to map() but doesn’t return anything. Since statcast_daily() always returns NULL, the effect is the same. Note that since statcast_daily() writes a CSV file for each day, you can safely run this function over and over again without duplicating any work. The code for the statcast_season() function shown below pulls these steps together. Note that if we want to be able to use this function over multiple years, we need to be a little bit more conservative with our choice of beginning and ending dates. While these dates vary from one year to another, they will typically be close to the first of April and the first of November.

abdwr3edata::statcast_season
function(
    year = lubridate::year(lubridate::now()), dir = getwd()
) {
  if (!dir.exists(dir)) {
    dir.create(dir, recursive = TRUE)
  }
  
  mlb_days <- 80:280
  mlb_dates <- mlb_days |>
    paste(year) |>
    lubridate::parse_date_time("%j %Y") |>
    lubridate::as_date()
  
  mlb_dates |>
    purrr::walk(statcast_daily, dir)
}
<bytecode: 0x62609d158570>
<environment: namespace:abdwr3edata>

Once the directory is populated with daily CSV files, we can read them all into one large data frame with a call to read_csv(). This functionality is encapsulated in the statcast_read_csv() function.

abdwr3edata::statcast_read_csv
function(dir = getwd(), pattern = "*.csv") {
  dir |>
    list.files(pattern = pattern, full.names = TRUE) |>
    readr::read_csv() |>
    dplyr::bind_rows()
}
<bytecode: 0x62609d35dff8>
<environment: namespace:abdwr3edata>

Putting it all together, we run the statcast_season() function to verify that we’ve downloaded all of the data for the 2023 season, and then once that is complete, we run the statcast_read_csv() function to read the various CSV files we’ve downloaded.

library(abdwr3edata)

# skip this step while building the book!
# statcast_season(2023, dir = statcast_dir)

sc2023 <- statcast_dir |>
  statcast_read_csv(pattern = "2023.+\\.csv")

To check the validity of the data, we can spot check certain statistics against their known values. First, if 30 teams play 162 games against each other, we should see 2430 total games. Second, according to Baseball-Reference, there were 5,868 home runs hit in 2023. How many do we see in our data?

sc2023 |>
  group_by(game_type) |>
  summarize(
    num_games = n_distinct(game_pk),
    num_pitches = n(),
    num_hr = sum(events == "home_run", na.rm = TRUE)
  )
# A tibble: 1 × 4
  game_type num_games num_pitches num_hr
  <chr>         <int>       <int>  <int>
1 R              2430      717945   5868

Our data appears to be accurate for the full season.

If we want to acquire multiple years of Statcast data, we simply iterate the statcast_season() function over multiple years.

In the remainder of this chapter, we analyze data across the 2020–2023 seasons.

12.3 Storing Large Data Efficiently

Once we’re satisfied with the integrity of our data, we should store the data in a format that makes it easy to work with in the future. Downloading the data for each day took a long time, and there is no reason to keep pinging the Baseball Savant servers once we already have a copy of the data.

As noted above, a full season of Statcast data contains over 700,000 rows and nearly 100 variables. This takes up about half of a gigabyte of space in memory in R.

sc2023 |>
  object.size() |>
  print(units = "MB")
500.7 Mb

While most computers now have at least 16 GB of memory, working with data of this magnitude (especially if you want to work with data across multiple seasons) can quickly become burdensome. We can help ourselves by writing the data to disk and storing it in an efficient format.

We saw above that the data frame containing a full season’s worth of Statcast data occupies about half a gigabtye in memory. We built that data frame by combining many small CSVs, each containing one day’s worth of data. How much space do those CSV files occupy on disk?

statcast_dir |>
  list.files(full.names = TRUE) |>
  str_subset("2023") |>
  file.size() |>
  sum() / 1024^2
[1] 373

In this case, the CSVs occupy about 75% as much space on disk as the data do when read into memory. That’s fine—but we can do better!

The main advantage of CSVs is that they are human-readable and editable, which makes them easy to understand, and because they are so commonly-used, they can be read and written by virtually any program designed for working with data. However, CSVs are not a very space-efficient file format. Let’s explore some different options.

12.3.1 Using R’s internal data format

Suppose now that we want to work with multiple years of Statcast data. For example, suppose we want to investigate pandemic-era trends using data from the 2020–2023 seasons. Should we store these data in four files or one? If we keep the data in four files (one for each year), then analyzing it across multiple years will be cumbersome, since we would have to read the data into R separately, and then combine the resulting data frames into one large data frame. That large data frame will occupy multiple gigabytes of memory, which could make things sluggish. Alternatively, we could do this once, and then write one big *.rds file that contained all four years worth of data. But then that file would be large, and we’d have to redo that process every time new data came in.

Once the data has been read into R, we can use the write_rds() function to write the full-season data frame to disk for safekeeping. This uses R’s internal binary data storage format, which is much more space-efficient than CSVs. However, the .rds format is designed to work with R, so while it’s great if you’re going to be working exclusively in R, it isn’t that useful if you want to share your data with users of other applications (like Python).

The statcast_write_rds() function wraps the write_rds() function, but first splits the data into groups based on the year. It will write a different, appropriately-named *.rds file for each year present in the data frame that you give it.

abdwr3edata::statcast_write_rds
function(x, dir = getwd(), ...) {
  tmp <- x |>
    dplyr::group_by(year = lubridate::year(game_date))
  years <- tmp |>
    dplyr::group_keys() |> 
    dplyr::pull(year)
  tmp |>
    dplyr::group_split() |>
    rlang::set_names(years) |>
    purrr::map(
      ~readr::write_rds(
        .x, 
        file = fs::path(
          dir,
          paste0(
            "statcast_", 
            max(lubridate::year(dplyr::pull(.x, game_date))), 
            ".rds"
          )
        ), 
        compress = "xz",
#        ...
      )
    )
  list.files(dir, pattern = "*.rds", full.names = TRUE)
}
<bytecode: 0x62609da30938>
<environment: namespace:abdwr3edata>

In this case, the directory of CSV files takes up nearly 6 times more space (373 Mb for the 2023 files) than the single .rds file that contains the exact same data (64 Mb)!

disk_space_rds <- data_dir |>
  path("statcast_rds") |>
  dir_info(regexp = "*.rds") |>
  select(path, size) |>
  mutate(
    path = path_file(path),
    format = "rds"
  )
disk_space_rds
# A tibble: 4 × 3
  path                     size format
  <chr>             <fs::bytes> <chr> 
1 statcast_2020.rds       23.4M rds   
2 statcast_2021.rds       63.1M rds   
3 statcast_2022.rds         63M rds   
4 statcast_2023.rds       63.8M rds   

12.3.2 Using Apache Arrow and Apache Parquet

Apache Parquet is a file format that, when combined with the software framework Apache Arrow, provides a slick, scalable solution to the problem we raised above about how to store our data. The Parquet format is not as space-efficient as the *.rds format, but it is cross-platform and scalable, in that it will automatically chunk the data into partitions based on a grouping variable (in this case, the year). The arrow package for R provides a dplyr compatible interface that allows us to work with data in the Parquet format very easily. Because Arrow is columnar-oriented (as opposed to row-oriented), it can be very fast. Please see the chapter on Arrow in Wickham, Çetinkaya-Rundel, and Grolemund (2023) for more information about Arrow.

One nice feature of arrow is that it can read an entire directory of CSVs using the open_dataset() function.

library(arrow)
sc_arrow <- statcast_dir |>
  open_dataset(format = "csv")
dim(sc_arrow)
[1] 2399921      92

Note that while the Arrow object sc_arrow behaves like a data frame—in this case containing nearly 2.4 million rows!—it takes up almost no space in memory.

sc_arrow |>
  object.size()
504 bytes

This is possible because the data is still on disk in the form of the CSVs. It hasn’t been read into R’s memory yet. That won’t stop us from querying the data, however.

summary_arrow <- sc_arrow |>
  group_by(year = year(game_date), game_type) |>
  summarize(
    num_games = n_distinct(game_pk),
    num_pitches = n(),
    num_hr = sum(events == "home_run", na.rm = TRUE)
  ) 
summary_arrow |>
  collect()
# A tibble: 4 × 5
# Groups:   year [4]
   year game_type num_games num_pitches num_hr
  <int> <chr>         <int>       <int>  <int>
1  2020 R               898      263584   2304
2  2021 R              2429      709852   5944
3  2022 R              2430      708540   5215
4  2023 R              2430      717945   5868

So far we have just worked with an Arrow object that was backed by a directory of CSVs. We can write the Arrow data frame in the Parquet format using the write_dataset() function. Note that since we used the group_by() function first, we’ll get one Parquet file for each year. This is a form of file-based partitioning that can provide significant performance advantages (Wickham, Çetinkaya-Rundel, and Grolemund 2023).

statcast_parquet <- path(data_dir, "statcast_parquet")
if (!dir.exists(statcast_parquet)) {
  dir.create(statcast_parquet)
}
sc_arrow |>
  group_by(year = year(game_date)) |>
  write_dataset(path = statcast_parquet, format = "parquet")

The write_dataset() function automatically creates a directory structure that partitions the data set into separate files of about 100 MB for each full season. This is just more than a quarter of the disk space occupied by the CSVs.

disk_space_parquet <- statcast_parquet |>
  dir_info(recurse = TRUE, glob = "*.parquet") |>
  select(path, size) |>
  mutate(
    format = "parquet",
    path = path_rel(path, start = statcast_parquet)
  )
disk_space_parquet
# A tibble: 4 × 3
  path                            size format 
  <fs::path>               <fs::bytes> <chr>  
1 year=2020/part-0.parquet       37.4M parquet
2 year=2021/part-0.parquet      101.5M parquet
3 year=2022/part-0.parquet      101.1M parquet
4 year=2023/part-0.parquet      102.4M parquet

12.3.3 Using DuckDB

While Arrow provides a dplyr interface that allows you to work seamlessly with Arrow objects in R as if they were data frames, Arrow is not SQL-based. Thus, while Arrow and Parquet are cross-platform, you would need another interface to write SQL queries against them.

Another fast, cross-platform alternative that is SQL-based is DuckDB. Like SQLite, DuckDB has a server-less architecture that can store data in memory or write its database files locally. This makes it a great option for someone who wants a SQL-interface but doesn’t want to set up or maintain a SQL server. You can learn more about DuckDB in Wickham, Çetinkaya-Rundel, and Grolemund (2023).

DuckDB also implements a dplyr interface, so we set up a database connection in the same way we set up any DBI-compatible SQL database connection: using dbConnect(). However, in this case we want to write the database files to disk so that we can use them again in the future, and so that we can compare their sizes to the other storage formats. The dbdir argument specifies the path to a DuckDB database file that will be created for us if it doesn’t exist already.

statcast_duckdb <- path(data_dir, "statcast_duckdb")
if (!dir.exists(statcast_duckdb)) {
  dir.create(statcast_duckdb)
}
library(duckdb)
con_duckdb <- dbConnect(
  drv = duckdb(),
  dbdir = path(statcast_duckdb, "statcast.ddb")
)

Initially, our DuckDB database doesn’t contain any tables, so we use the dbWriteTable() function to copy the contents of the Arrow object to our DuckDB object.1

con_duckdb |>
  dbWriteTable("events", collect(sc_arrow), overwrite = TRUE)

Now, we can use our familiar dplyr interface to access the DuckDB database.

sc_ddb <- con_duckdb |>
  tbl("events")

summary_duckdb <- sc_ddb |>
  group_by(year = year(game_date), game_type) |>
  summarize(
    num_games = n_distinct(game_pk),
    num_pitches = n(),
    num_hr = sum(as.numeric(events == "home_run"), na.rm = TRUE)
  )
summary_duckdb
# Source:   SQL [4 x 5]
# Database: DuckDB v0.10.0 [bbaumer@Linux 6.6.10-76060610-generic:R 4.3.3//home/bbaumer/Dropbox/git/abdwr3e/data_large/statcast_duckdb/statcast.ddb]
# Groups:   year
   year game_type num_games num_pitches num_hr
  <dbl> <chr>         <dbl>       <dbl>  <dbl>
1  2020 R               898      263584   2304
2  2021 R              2429      709852   5944
3  2022 R              2430      708540   5215
4  2023 R              2430      717945   5868

While both the arrow and duckdb packages provide dplyr interfaces, only duckdb works with SQL tools like dbGetQuery().

con_duckdb |>
  dbGetQuery("
    SELECT game_date, pitch_type, release_speed, pitcher 
    FROM events 
    WHERE release_speed > 100 AND events = 'home_run'
    LIMIT 6;
  ")
   game_date pitch_type release_speed pitcher
1 2021-04-10         FF           100  594798
2 2021-05-18         SI           100  621237
3 2021-06-27         FF           100  543037
4 2021-06-29         SI           101  621237
5 2021-07-09         FC           100  661403
6 2021-07-16         FC           100  661403

The storage footprint of the DuckDB database is similar to that of the CSVs, but we will see in Section 12.4 that the performance is excellent.

disk_space_duckdb <- statcast_duckdb |>
  dir_info(recurse = TRUE, glob = "*.ddb") |>
  select(path, size) |>
  mutate(
    format = "duckdb",
    path = path_rel(path, start = statcast_duckdb)
  )
disk_space_duckdb
# A tibble: 1 × 3
  path                size format
  <fs::path>   <fs::bytes> <chr> 
1 statcast.ddb       1.28G duckdb

12.3.4 Using MySQL

Finally, we can use the MariaDB (MySQL) database that we set up in Section 11.2.

library(dbplyr)
library(RMariaDB)
con_mariadb <- dbConnect(MariaDB(), group = "abdwr")

Just as we did with DuckDB, we first copy the data to the MySQL server using dbWriteTable().

con_mariadb |>
  dbWriteTable("events", collect(sc_arrow), overwrite = TRUE)

Now we can query the database using the dplyr interface or write SQL queries.

sc_maria <- con_mariadb |>
  tbl("events")
summary_maria <- sc_maria |>
  group_by(year = year(game_date), game_type) |>
  summarize(
    num_games = n_distinct(game_pk),
    num_pitches = n(),
    num_hr = sum(events == "home_run", na.rm = TRUE)
  )
summary_maria
# Source:   SQL [4 x 5]
# Database: mysql  [abdwr@localhost:NA/abdwr]
# Groups:   year
   year game_type num_games num_pitches num_hr
  <int> <chr>       <int64>     <int64>  <dbl>
1  2020 R               898      263584   2304
2  2021 R              2429      709852   5944
3  2022 R              2430      708540   5215
4  2023 R              2430      717945   5868

Determining the storage footprint for a MySQL database is a bit more complicated, but the size of the events.ibd and events.frm files in the output below provide a lower bound on the size of the events table. The storage footprint here is about the same as those of the original CSV files and DuckDB.

disk_space_mariadb <- "/var/lib/mysql/abdwr/" |>
  dir_info(glob = "*events.*") |>
  select(path, size) |>
  mutate(
    format = "mariadb",
    path = path_rel(path, start = "/var/lib/mysql/abdwr/")
  )
disk_space_mariadb
# A tibble: 2 × 3
  path              size format 
  <fs::path> <fs::bytes> <chr>  
1 events.frm       3.92K mariadb
2 events.ibd       1.22G mariadb

12.4 Performance Comparison

In this chapter, we have explored five different data storage formats (CSV, *.rds, Parquet, DuckDB, and MariaDB), as well as their corresponding R object interfaces, all of which are compatible with those of the dplyr package. Computing performance is often measured in terms of three quantities: computational speed, memory footprint, and disk storage footprint. We consider these three criteria in turn.

12.4.1 Computational speed

First, we compare the performance in terms of querying speed. We use the mark() function from the bench package to compare the amount of time it takes to compute the Statcast summary statistics for our 2020–2023 data. The five classes of objects are: 1) a tbl (data frame) which stores in the data in memory; 2) an Arrow object backed by the CSV files; 3) an Arrow object backed by the Parquet files, partitioned by year; 4) a DuckDB object; and 5) a MariaDB object.

First, we set up the tbl interface and query across all four years of data.

sc_tbl <- statcast_dir |>
  statcast_read_csv()

summary_tbl <- sc_tbl |>
  group_by(year = year(game_date), game_type) |>
  summarize(
    num_games = n_distinct(game_pk),
    num_pitches = n(),
    num_hr = sum(events == "home_run", na.rm = TRUE)
  )

Second, we set up an arrow object to read the from the Parquet files we created earlier, and to make use of the file-based partitioning based on year.

sc_arrow_part <- statcast_parquet |>
  open_dataset(partitioning = "year")

summary_arrow_part <- sc_arrow_part |>
  group_by(year = year(game_date), game_type) |>
  summarize(
    num_games = n_distinct(game_pk),
    num_pitches = n(),
    num_hr = sum(events == "home_run", na.rm = TRUE)
  )

Now we can benchmark the query execution times.

library(bench)
res <- mark(
  tbl = summary_tbl,
  arrow_csv = summary_arrow |> collect(),
  arrow_part = summary_arrow_part |> collect(),
  duckdb = summary_duckdb |> collect(),
  mariadb = summary_maria |> collect(),
  check = FALSE
) |>
  arrange(median)
res
# A tibble: 5 × 6
  expression      min   median    `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>        <dbl> <bch:byt>    <dbl>
1 tbl         15.95ns  20.95ns 46835446.           0B     0   
2 duckdb      68.68ms  77.56ms       13.2     490.9KB     5.30
3 arrow_part 231.06ms 235.03ms        4.26     47.6KB     0   
4 arrow_csv     1.18s    1.18s        0.847    47.9KB     0   
5 mariadb       4.28s    4.28s        0.233   185.3KB     0   

The performance can vary greatly depending on the hardware available and software configuration of the computer. On this machine, which has 12 CPUs and 32 gigabytes of RAM, the results indicate that duckdb is significantly faster than the other databases. The tbl interface already has the data in memory, so of course it is by far the fastest. In this case, our query ran across all four partitions, so the partitioning scheme of Parquet wasn’t that useful. Still, the Arrow object that was backed by Parquet files was about 5 times faster than the Arrow object backed by the CSV files. However, the DuckDB instance was still about 3 times faster than the Arrow/Parquet object. Even the Arrow object backed by the CSV files was about 4 times faster than RMariaDB, which was by far the worst performer.

In Section 12.5, we explore whether the performance of the Arrow/Parquet object will improve when querying the data for individual seasons separately.

12.4.2 Memory footprint

Second, we note that while the other objects occupy a negligible amount of space in memory, the tbl object takes up 1.6 Gb of RAM! This is because, as noted above, the tbl object stores the data in memory, while the other objects leave the data on disk and only query the relevant data when prompted. The size of these objects in bytes is shown below. Thus, the superior performance of the tbl object comes at a cost.

list(
  "tbl" = sc_tbl, 
  "arrow" = sc_arrow, 
  "duckdb" = sc_ddb,
  "mariadb" = sc_maria
) |>
  map_int(object.size)
       tbl      arrow     duckdb    mariadb 
1752447264        504      44424      11792 

12.4.3 Disk storage footprint

Third, in terms of data storage space, the *.rds format is the most compact. However, it comes with its own limitations: *.rds only works with R. The Parquet files take up about 50% more space than the *.rds files, but they work across platforms and they implement seamless partitioning. The DuckDB files take up about three times as much disk space as the Parquet files (and this is before we have built any indexes!), but they are faster to query.

disk_space <- bind_rows(
  disk_space_csv,
  disk_space_rds,
  disk_space_parquet,
  disk_space_duckdb,
  disk_space_mariadb
)
disk_space |>
  group_by(format) |>
  summarize(footprint = sum(size)) |>
  arrange(desc(footprint))
# A tibble: 5 × 2
  format    footprint
  <chr>   <fs::bytes>
1 duckdb        1.28G
2 mariadb       1.22G
3 csv           1.22G
4 parquet     342.46M
5 rds         213.36M

12.4.4 Overall guidelines

In our experiment with a 2.4 million row data set, on this particular computer using these particular data, we confirmed the obvious: reading the data into a tbl object in memory leads to the fastest computational performance while requiring the largest footprint in memory. Among the interfaces that read the data from disk, duckdb provided the fastest computational performance, while RMariaDB offered the slowest performance. Neither reduced the largest footprint on disk substantially from the original CSV files. Arrow using the Parquet storage format provided medium performance with medium footprint on disk. R’s internal *.rds storage format was the most compact, but the least versatile.

This leads us to the following guidelines as we consider these options in broader practice.

  • If your data is small (i.e., less than a couple hundred megabytes), just use CSV because it’s easy, cross-platform, and versatile. The fact that it is not space-efficient won’t matter because your data is small anyway.
  • If your data is larger than a couple hundred megabytes and you’re just working in R (either by yourself or with a few colleagues), use .rds because it’s space-efficient and optimized for R. (Note that this is how we chose to store much of the Retrosheet and Statcast data we use in this book.) Reading these files into tbl objects will lead to fast performance, and the data probably aren’t large enough to eat up enough of your computer’s memory for you to notice.
  • If your data is around a gigabyte or more and you need to share your data files across different platforms (i.e., not just R but also Python, etc.) and you don’t want to use a SQL-based RDBMS, store your data in the Parquet format and use the arrow package. Parquet is cross-platform and Arrow scales better than .rds. Both the performance and the storage footprint are likely to meet your needs. The file-based partitioning scheme may or may not help you, depending on how you are querying the database.
  • If you want to work in SQL with a local data store, use DuckDB, because it offers more features and better performance than RSQLite, and doesn’t require a server-client architecture that can be cumbersome to set up and maintain.
  • If you have access to a RDBMS server (hopefully maintained by a professional database administrator), use the appropriate DBI interface (e.g., RMariaDB, RPostgreSQL, etc.) to connect to it. A well-oiled server with sufficient resources will easily outperform anything you can do on your personal computer.

Making the appropriate choice for your situation will depend on weighing these factors carefully.

12.5 Launch Angles and Exit Velocities, Revisited

In the previous edition of this book, we created a data graphic that showed how wOBA varied as a function of launch angle and exit velocity for all batted balls in the 2017 season. That data graphic appeared on the cover of the second edition. Here, we produce a similar graphic across our four years of data. In doing so, we revisit the performance of our fastest data interfaces, and compare how well they work in practice.

Recall that while DuckDB stores its data in one large file, Arrow employs a file-based partitioning scheme that writes our data to a separate file for each year. Consider what happens when we ask the database to give us all the data for a particular player, say Pete Alonso, in a particular year, say 2020. DuckDB has to look in the whole big file for these data, but Arrow only has to look at the file for 2020, which in this case is less than one quarter the size of all of the files together. Then it only needs to look for Alonso’s data within that file. Because the file is smaller, it should be faster to find the relevant data.

Conversely, if we were to ask for Alonso’s data across all four years, the file-based partitioning wouldn’t do any good, because we’d have to consult all of the files anyway.

The following function pulls the data we need for any given batter and set of years.

read_bip_data <- function(tbl, begin, end = begin, 
                          batter_id = 624413) {
  x <- tbl |>
    mutate(year = year(game_date)) |>
    group_by(year) |>
    filter(type == "X", year >= begin, year <= end) |> 
    select(
      year, game_date, batter, launch_speed, launch_angle,
      estimated_ba_using_speedangle,
      estimated_woba_using_speedangle
    ) 
  if (!is.null(batter_id)) {
    x <- x |>
      filter(batter == batter_id)
  }
  x |>
    collect()
}

First, we compare the computational performance for pulling Pete Alonso’s balls in play for a single season, 2020.

mark(
  tbl = nrow(read_bip_data(sc_tbl, 2020)),
  arrow = nrow(read_bip_data(sc_arrow_part, 2020)),
  duckdb = nrow(read_bip_data(sc_ddb, 2020)),
  iterations = 5
) |>
  arrange(median)
# A tibble: 3 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 duckdb        114ms    121ms      7.86    1.09MB     3.15
2 arrow         236ms    255ms      3.85   395.2KB     5.39
3 tbl           420ms    428ms      2.29  280.26MB     0   

Amazingly, both the duckdb and arrow objects outperform the tbl object, which is remarkable given that the tbl object is stored in R’s memory and the others are reading from files stored on disk. This result reveals how well these highly optimized technologies work in practice. It’s also worth noting that while the duckdb object is still about twice as fast as the arrow object, the latter greatly improved its performance relative to the previous comparison. This is because the file-based partitioning scheme was useful in this case, because we were only querying for data in an individual year.

If instead, we query across years, this performance boost disappears, and duckdb once again several times faster than arrow.

mark(
  tbl = nrow(read_bip_data(sc_tbl, 2021, 2023)),
  arrow = nrow(read_bip_data(sc_arrow_part, 2021, 2023)),
  duckdb = nrow(read_bip_data(sc_ddb, 2021, 2023)),
  iterations = 5
) |>
  arrange(median)
# A tibble: 3 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 duckdb       89.7ms    105ms      9.54     994KB    3.82 
2 arrow       453.7ms    462ms      2.04     363KB    2.86 
3 tbl         548.5ms    589ms      1.69     520MB    0.339

12.5.1 Launch angles over time

We’re now ready to make our plot, and compare the relationship of wOBA to both launch angle and exit velocity, but now over time. Just as we did in the second edition, we add some helpful guidelines to our plot.

guidelines <- tibble(
  launch_angle = c(10, 25, 50),
  launch_speed = 40,
  label = c("Ground balls", "Line drives", "Flyballs")
)

Since duckdb proved to be our best performer, we’ll use it to pull the data and draw the plot. Note that we use the slice_sample() function to avoid plotting all the data.

ev_plot <- sc_ddb |>
  read_bip_data(2020, 2023, batter_id = NULL) |>
  # for speed
  slice_sample(prop = 0.2) |>
  ggplot(
    aes(
      x = launch_speed, 
      y = launch_angle, 
      color = estimated_woba_using_speedangle
    )
  ) + 
  geom_hline(
    data = guidelines, aes(yintercept = launch_angle), 
    color = "black", linetype = 2
  ) + 
  geom_text(
    data = guidelines, 
    aes(label = label, y = launch_angle - 4), 
    color = "black", hjust = "left"
  ) + 
  geom_point(alpha = 0.05) +
  scale_color_viridis_c("BA") +
  scale_x_continuous(
    "Exit velocity (mph)", 
    limits = c(40, 120)
  ) + 
  scale_y_continuous(
    "Launch angle (degrees)", 
    breaks = seq(-75, 75, 25)
  ) +
  facet_wrap(vars(year))

A few observations stand out in Figure 12.1. First, line drives have a very high probability of being a hit, but the likelihood depends on how hard the ball comes off the bat and how high it is going. Second, there is a “sweet spot” where nearly all batted ball are hits. These form a white pocket centered around a launch angle of about 25 degrees and an exit velocity of at least 100 mph. As we will see later, these are often home runs. The contention is that batters are optimizing their swings to produce batted balls with these properties.

ev_plot +
  guides(color = guide_colorbar(title = "wOBA"))
Figure 12.1: Estimated wOBA as a function of launch angle and exit velocity, 2020–2023.

Do you see any differences across the various years? Other than the fact that there is less data in 2020 due to the pandemic-shortened season, the relationship appears to be about the same.

12.6 Further Reading

The abdwr3edata package contains all of the functions displayed in this chapter. Chapter 11 of this book covers the use of a MySQL database to explore park factors. In Section 11.2, we illustrate how to setup and use a MySQL server, and in Section 11.6, we discuss how to build your own baseball database.

Chapter 21 of Wickham, Çetinkaya-Rundel, and Grolemund (2023) covers databases, and explicates the dplyr interface to duckdb and other DBI-compliant databases. Chapter 22 discusses Arrow and Parquet in greater detail.

12.7 Exercises

1. Home runs on fast pitches

Use the database constructed in this chapter to find the total number of home runs that were hit on fastballs traveling in excess of 100 mph across the four seasons 2020–2023. What percentage of the total number of home runs hit in each season were hit on such fastballs?

2. Stolen base percentages by velocity

Use the database constructed in this chapter to compute the stolen base success rate for every pitch speed, rounded to the nearest 1 mph. Separate your analysis by steals of 2nd base versus 3rd base. Does pitch speed appear to be correlated with stolen base success rate?


  1. The arrow package contains a function called to_duckdb() that will create a DuckDB object from an existing Arrow object, but we choose not to use that because in this case we want to copy the data.↩︎