Appendix A — Retrosheet Files Reference

Authors
Affiliations

Bowling Green State University

Smith College

Max Marchi

Cleveland Guardians

A.1 Downloading Play-by-Play Files

A.1.1 Introduction

The play-by-play data files for every Major League season between 1913 and 2022 are currently available at the Retrosheet Web page at https://www.retrosheet.org/game.htm. By clicking on a single year, say 1950, one obtains a compressed (.zip) file containing a collection of files: one set of files containing information on the plays for the home games for all teams, and another set of files giving the rosters of the players for each team. This appendix illustrates the easiest way to work with Retrosheet files.

A.1.2 Chadwick

Henry Chadwick was a sportswriter who is credited with inventing the box score, batting average, and earned run average. The special software tools designed to process Retrosheet data are named in his honor. These tools are maintained by Ted Turocy, and are available at https://github.com/chadwickbureau/chadwick. Please follow the installation instructions for Chadwick. The repository contains binaries suitable for Windows users, while Linux and Mac users can compile their own versions of these tools by downloading and compiling the source code.

The particular component of Chadwick that we need to generate Retrosheet play-by-play data is called cwevent. It is a program that is run at the command line. If it is installed and working properly, you can simply type cwevent at the command line and see output like this.

cwevent
Chadwick expanded event descriptor, version 0.10.0
  Type 'cwevent -h' for help.
Copyright (c) 2002-2023
Dr T L Turocy, Chadwick Baseball Bureau (ted.turocy@gmail.com)
This is free software, subject to the terms of the GNU GPL 
license.

If you have installed Chadwick and you get an error when you run cwevent, it is probably one of two problems, both of which involve setting path environment variables. If the error says command not found (or something of that nature), then your operating system cannot find the cwevent binary, probably because your PATH environment variable does not include the directory where the cwevent binary is. On this Ubuntu machine1, we can find the correct path by typing:

which cwevent
/usr/local/bin

You can check the current value of the PATH environment variable at the command line using echo.

echo $PATH

Use the export directive to append the path to the cwevent binary to the current PATH environment variable.

export PATH=$PATH:/usr/local/bin

If cwevent runs, but throws an error, the most likely problem is that cwevent can’t find the Chadwick shared libraries. You can solve this problem by setting the LD_LIBRARY_PATH environment variable. Note that environment variables are system-specific. On this Ubuntu machine, we can find the Chadwick shared libraries using the find command.

find /usr/local -name "libchadwick*"
/usr/local/lib/libchadwick.la
/usr/local/lib/libchadwick.a
/usr/local/lib/libchadwick.so
/usr/local/lib/libchadwick.so.0
/usr/local/lib/libchadwick.so.0.0.0

So in order for cwevent to work, the LD_LIBRARY_PATH environment variable needs to include /usr/local/lib. You can set the LD_LIBRARY_PATH environment variable using export as we did above, or, to set environment variables from within R, use the Sys.getenv() and Sys.setenv() functions. The safe_add_ld_path() function we present below is similar to the chadwick_ld_library_path() function in the baseballr package.

safe_add_ld_path <- function(path_new = "/usr/local/lib") {
  path_old <- Sys.getenv("LD_LIBRARY_PATH")
  path_old_parts <- path_old |>
    str_split_1(":") |>
    unique()
  if (!path_new %in% path_old_parts) {
    path_new_parts <- c(path_new, path_old_parts)
    Sys.setenv(
      LD_LIBRARY_PATH = paste0(path_new_parts, collapse = ":")
    )
  }
  Sys.getenv("LD_LIBRARY_PATH")
}
safe_add_ld_path()
[1] "/opt/R/4.4.0/lib/R/lib:/usr/local/lib:/usr/lib/x86_64-linux-gnu"

A.1.3 Downloading data for one or more seasons

Once you have Chadwick installed and working properly, the retrosheet_data() function from the baseballr package makes obtaining Retrosheet play-by-play data a breeze.

The retrosheet_data() function takes three optional arguments that help you manage your data. Note that this function calls cwevent, and so it will not work if you haven’t set up Chadwick properly as in Section A.1.2.

To download the Retrosheet data we use in this book, type:

retro_data <- baseballr::retrosheet_data(
  here::here("data_large/retrosheet"),
  c(1992, 1996, 1998, 2016)
)

This will download and process four years worth of play-by-play data and return a list of length four, with each item containing a list of length two. Each of the four items in retro_data corresponds to the four years we specified. The two items in each of those years are data frames: one called events that stores the play-by-play data, and another called rosters that stores the rosters.

To isolate the play-by-play data for a single year, use the pluck() function.

retro1992 <- retro_data |>
  pluck("1992") |>
  pluck("events")

A.1.4 Saving the data

You probably don’t want to have to download and process all of this data every time you want to use it. Now that you have it stored in R, the best way to save it for later use is by writing the data frame for each year to disk using R’s internal data storage format. You can do this with the write_rds() function.

retro1992 |>
  write_rds(
    file = here::here("data/retro1992.rds"), 
    compress = "xz"
  )

Be sure to use the compress argument—it will significantly reduce the size of the data.

If you want to build a whole database of Retrosheet data (for many years), iterate the process described above and combine it with our illustration of how to build a SQL database in Chapter 11 and our discussion of working with large data in Chapter 12.

A.1.5 The function parse_retrosheet_pbp()

In previous versions of this book, we included a function called parse_retrosheet_pbp() that could be used to download and process Retrosheet data. This function has been superseded by the retrosheet_data() function from the baseballr package, and we no longer recommend using it. However, if you are interested in working through its logic, the code is available through the abdwr3edata package.

function(season) {
  download_retrosheet(season)
  unzip_retrosheet(season)
  create_csv_file(season)
  create_csv_roster(season)
  cleanup()
}
<bytecode: 0x5556446d3da8>
<environment: namespace:abdwr3edata>

You may want to access the various helper functions for further detail. For example:

abdwr3edata::download_retrosheet
function(season) {
  # get zip file from retrosheet website
  utils::download.file(
    url = paste0(
      "http://www.retrosheet.org/events/", season, "eve.zip"
    ),
    destfile = file.path(
      "retrosheet", "zipped",
      paste0(season, "eve.zip")
    )
  )
}
<bytecode: 0x555644881c80>
<environment: namespace:abdwr3edata>

A.1.6 Alternatives to Chadwick

The retrosheet package (Douglas and Scriven 2024) provides an alternative method for bringing Retrosheet data into R without an external dependency on the Chadwick software through its getRetrosheet() function. However, these data are returned as a list of lists (rather than a list of data frames), and thus can be considerably more cumbersome to analyze.

A.2 Retrosheet Event Files: a Short Reference

As we mentioned in Chapter 1, Retrosheet event files come in a format expressly devised for them, and require the use of some software tools for converting them in a format suitable for data analysis. Retrosheet provides such software tools https://www.retrosheet.org/tools.htm and a step-by-step example https://www.retrosheet.org/stepex.txt for performing the conversion.

Chadwick provides similar tools for parsing Retrosheet event files that have been used for creating the play-by-play files used in this book (see Section A.1.2).

Chadwick tools generate a line for each play in the Retrosheet event files, consisting of 97 “regular” columns (the same that are obtained using the tools provided by Retrosheet) plus 63 “extended” fields, allowing to easily access all of the information contained in the Retrosheet event files. Going through every one of the more than 150 columns generated by the Chadwick tools is beyond the scope of this book, and thus we point to the documentation on the Chadwick website for the full list.2 In this section we present the main fields describing an event and the state of the game when it happens.

Please note that while the Chadwick tools return variable names in all caps, the retrosheet_data() function uses snake_case for variable names.

A.2.1 Game and event identifiers

The games are identified in Retrosheet event files by 12-character strings (the GAME_ID column): the first three characters identify the home team, the following eight characters indicate the date when the game took place (in the YYYYMMDD format), and the last character is used to distinguish games of doubleheaders (thus “1” indicates the first game, “2” the second game, and “0” means only one game was played on the day).

Events are progressively numerated in each game (column EVENT_ID), thus every single action in the Retrosheet database can be uniquely identified by the combination of the game identifier and the event identifier.

A.2.2 The state of the game

Several fields are helpful for defining the state of the game when a particular event happened. The inning and the team on offense variables are stored in the INN_CT and BAT_HOME_ID fields respectively. The latter field can assume values “0” (away team batting, i.e., top of the inning) or “1” (home team batting, bottom of the inning). The visitor score and the home score variables are recorded in the AWAY_SCORE_CT and HOME_SCORE_CT.

The number of outs before the play is indicated in the OUTS_CT column, while the situation of runners on base is coded in the field START_BASES_CD, using numbers from 1 to 7 as shown in Table A.1.3

Table A.1: Retrosheet coding for the situation of runners on base.
Code Bases occupancy
0 Empty
1 1B only
2 2B only
3 1B & 2B
4 3B only
5 1B & 3B
6 2B & 3B
7 Loaded

The actual description of the event resides in the EVENT_TX column, consisting of a string describing the outcome of the play (e.g., strikeout, single, etc.), some additional details (e.g., the type and location of the batted ball), and the advancement of any runner on base. Several columns are generated by decoding the EVENT_TX string:

  • EVENT_CD is a numeric code reflecting the basic event; Table A.2 displays the codes for the possible plays coded in this column.

  • BAT_EVENT_FL is a flag indicating whether an event is a batting event, in which case it is labeled as T. Non-batting events include, for example, stolen bases, wild pitches and, generally, any event that does not mark the end of a plate appearance.

  • H_CD is a numeric code indicating the base hit type, going from 1 for a single to 4 for a home run.

  • BATTEDBALL_CD is a single character code denoting the batted ball type. It can assume one of the following values: G (ground ball), L (line drive), F (fly ball), P (pop-up). Note that for most of the seasons in the Retrosheet database, the batted ball type is reported only for plate appearances ending with the batter making an out, while they are not available on base hits.

  • BATTEDBALL_LOC_TX is a string indicating the batted ball location, coded according to the diagram shown at https://www.retrosheet.org/location.htm. Note that this information is available for a limited number of seasons.

  • FLD_CD is a numeric code denoting the fielder first touching a batted ball, coded with the conventional baseball fielding notation going from 1 (the pitcher) to 9 (the right fielder).

Table A.2: Retrosheet coding for the type of event.
Code Event type
2 Generic Out
3 Strikeout
4 Stolen Base
5 Defensive Indifference
6 Caught Stealing
8 Pickoff
9 Wild Pitch
10 Passed Ball
11 Balk
12 Other Advance
13 Foul Error
14 Nonintentional Walk
15 Intentional Walk
16 Hit By Pitch
17 Interference
18 Error
19 Fielder Choice
20 Single
21 Double
22 Triple
23 Homerun

The sequence of pitches is recorded in the PITCH_SEQ_TX and has been addressed in Chapter 6, where Table 6.1 displays how the different pitch outcomes are coded. Several columns are generated from this one, indicating counts of the various types of pitch outcomes, as displayed in Table A.3.

Table A.3: Columns reporting counts of various pitch types.
Column name Column description
PA_BALL_CT No. of balls in plate appearance
PA_CALLED_BALL_CT No. of called balls in plate appearance
PA_INTENT_BALL_CT No. of intentional balls in plate appearance
PA_PITCHOUT_BALL_CT No. of pitchouts in plate appearance
PA_HITBATTER_BALL_CT No. of pitches hitting batter in plate appearance
PA_OTHER_BALL_CT No. of other balls in plate appearance
PA_STRIKE_CT No. of strikes in plate appearance
PA_CALLED_STRIKE_CT No. of called strikes in plate appearance
PA_SWINGMISS_STRIKE_CT No. of swinging strikes in plate appearance
PA_FOUL_STRIKE_CT No. of foul balls in plate appearance
PA_INPLAY_STRIKE_CT No. of balls in play in plate appearance
PA_OTHER_STRIKE_CT No. of other strikes in plate appearance

A.3 Parsing Retrosheet Pitch Sequences

A.3.1 Introduction

Chapter 6 showed how to compute, using regular expressions, whether a plate appearance went through either a 1-0 or a 0-1 count. Here we provide the code to retrieve the same information for every possible balls/strikes count.

A.3.2 Setup

We first load Retrosheet data for the 2016 season.

retro2016 <- read_rds(here::here("data/retro2016.rds"))

Then a new column sequence is created in which the pitch sequence is reported, stripped by any character not indicating an actual pitch to the batter.4

retro2016 <- retro2016 |>
  mutate(sequence = gsub("[.>123+*N]", "", pitch_seq_tx))

A.3.3 Evaluating every count

Every plate appearance starts with a 0-0 count. The code for both the 1-0 and 0-1 counts was described in Chapter 6.

retro2016 <- retro2016 |>
  mutate(
    c00 = TRUE,
    c10 = grepl("^[BIPV]", sequence),
    c01 = grepl("^[CFKLMOQRST]", sequence)
  )

A number inside the curly brackets indicates the exact number of times the preceding expression has to be repeated in the string to match. The following lines look for plate appearances going through the counts 2-0, 3-0, and 0-2.

retro2016 <- retro2016 |>
  mutate(
    c20 = grepl("^[BIPV]{2}", sequence),
    c30 = grepl("^[BIPV]{3}", sequence), 
    c02 = grepl("^[CFKLMOQRST]{2}", sequence)
  )

The | (vertical bar) character is used to separate alternatives. The following lines parse the sequence string looking for the different sequences that can lead to 1-1, 2-1, and 3-1 counts.

b <- "[BIPV]"
s <- "[CFKLMOQRST]"
retro2016 <- retro2016 |>
  mutate(
    c11 = grepl(
      paste0("^", s, b, "|^", b, s), sequence
    ),
    c21 = grepl(
      paste0("^", s, b, b, 
             "|^", b, s, b, 
             "|^", b, b, s), sequence
    ), 
    c31 = grepl(
      paste0("^", s, b, b, b, 
             "|^", b, s, b, b, 
             "|^", b, b, s, b, 
             "|^", b, b, b, s), sequence
    )
  )

On two-strike counts, batters can indefinitely foul pitches off without affecting the count. In the lines below, sequences reaching two strikes before reaching the desired number of balls feature the [FR]* expression, denoting a foul ball5 happening any number of times, including zero, as indicated by the asterisk.

retro2016 <- retro2016 |>
  mutate(
    c12 = grepl(
      paste0("^", b, s, s, 
             "|^", s, b, s, 
             "|^", s, s, "[FR]*", b), sequence
    ), 
    c22 = grepl(
      paste0("^", b, b, s, s, 
             "|^", b, s, b, s, 
             "|^", b, s, s, "[FR]*", b,
             "|^", s, b, b, s, 
             "|^", s, b, s, "[FR]*", b,
             "|^", s, s, "[FR]*", b, "[FR]*", b), 
      sequence
    ), 
    c32 = grepl(
      paste0("^",  s, "*", b, s, 
             "*", b, s, "*", b), sequence
    ) & grepl(
      paste0("^", b, "*", s, b, "*", s), sequence
    )
  )

The retrosheet_add_counts() in the abdwr3edata package contains all of the code necessary to compute these counts.


  1. On Windows, the analogous DOS command is where.↩︎

  2. The documentation for all the software tools is available at https://chadwick.sourceforge.net/doc/cwtools.html. In particular, the tool for processing the event files (cwevent) is documented at https://chadwick.sourceforge.net/doc/cwevent.html#cwtools-cwevent.↩︎

  3. An analogous column named END_BASES_CD contains the base state at the end of the play, coded in the same way.↩︎

  4. See Table 6.1 in Chapter 6 for reference.↩︎

  5. F encodes a foul ball, R a foul ball on a pitchout. See Table 6.1.↩︎