cwevent
Appendix A — Retrosheet Files Reference
A.1 Downloading Play-by-Play Files
A.1.1 Introduction
The play-by-play data files for every Major League season between 1913 and 2022 are currently available at the Retrosheet Web page at https://www.retrosheet.org/game.htm. By clicking on a single year, say 1950, one obtains a compressed (.zip
) file containing a collection of files: one set of files containing information on the plays for the home games for all teams, and another set of files giving the rosters of the players for each team. This appendix illustrates the easiest way to work with Retrosheet files.
A.1.2 Chadwick
Henry Chadwick was a sportswriter who is credited with inventing the box score, batting average, and earned run average. The special software tools designed to process Retrosheet data are named in his honor. These tools are maintained by Ted Turocy, and are available at https://github.com/chadwickbureau/chadwick. Please follow the installation instructions for Chadwick. The repository contains binaries suitable for Windows users, while Linux and Mac users can compile their own versions of these tools by downloading and compiling the source code.
The particular component of Chadwick that we need to generate Retrosheet play-by-play data is called cwevent
. It is a program that is run at the command line. If it is installed and working properly, you can simply type cwevent
at the command line and see output like this.
Chadwick expanded event descriptor, version 0.10.0
Type 'cwevent -h' for help.
Copyright (c) 2002-2023
Dr T L Turocy, Chadwick Baseball Bureau (ted.turocy@gmail.com)
This is free software, subject to the terms of the GNU GPL
license.
If you have installed Chadwick and you get an error when you run cwevent
, it is probably one of two problems, both of which involve setting path environment variables. If the error says command not found
(or something of that nature), then your operating system cannot find the cwevent
binary, probably because your PATH
environment variable does not include the directory where the cwevent
binary is. On this Ubuntu machine1, we can find the correct path by typing:
which cwevent
/usr/local/bin
You can check the current value of the PATH
environment variable at the command line using echo
.
echo $PATH
Use the export
directive to append the path to the cwevent
binary to the current PATH
environment variable.
export PATH=$PATH:/usr/local/bin
If cwevent
runs, but throws an error, the most likely problem is that cwevent
can’t find the Chadwick shared libraries. You can solve this problem by setting the LD_LIBRARY_PATH
environment variable. Note that environment variables are system-specific. On this Ubuntu machine, we can find the Chadwick shared libraries using the find
command.
find /usr/local -name "libchadwick*"
/usr/local/lib/libchadwick.la
/usr/local/lib/libchadwick.a
/usr/local/lib/libchadwick.so
/usr/local/lib/libchadwick.so.0
/usr/local/lib/libchadwick.so.0.0.0
So in order for cwevent
to work, the LD_LIBRARY_PATH
environment variable needs to include /usr/local/lib
. You can set the LD_LIBRARY_PATH
environment variable using export
as we did above, or, to set environment variables from within R, use the Sys.getenv()
and Sys.setenv()
functions. The safe_add_ld_path()
function we present below is similar to the chadwick_ld_library_path()
function in the baseballr package.
safe_add_ld_path <- function(path_new = "/usr/local/lib") {
path_old <- Sys.getenv("LD_LIBRARY_PATH")
path_old_parts <- path_old |>
str_split_1(":") |>
unique()
if (!path_new %in% path_old_parts) {
path_new_parts <- c(path_new, path_old_parts)
Sys.setenv(
LD_LIBRARY_PATH = paste0(path_new_parts, collapse = ":")
)
}
Sys.getenv("LD_LIBRARY_PATH")
}
safe_add_ld_path()
[1] "/opt/R/4.4.1/lib/R/lib:/usr/local/lib:/usr/lib/x86_64-linux-gnu"
A.1.3 Downloading data for one or more seasons
Once you have Chadwick installed and working properly, the retrosheet_data()
function from the baseballr package makes obtaining Retrosheet play-by-play data a breeze.
The retrosheet_data()
function takes three optional arguments that help you manage your data. Note that this function calls cwevent
, and so it will not work if you haven’t set up Chadwick properly as in Section A.1.2.
To download the Retrosheet data we use in this book, type:
retro_data <- baseballr::retrosheet_data(
here::here("data_large/retrosheet"),
c(1992, 1996, 1998, 2016)
)
This will download and process four years worth of play-by-play data and return a list
of length four, with each item containing a list
of length two. Each of the four items in retro_data
corresponds to the four years we specified. The two items in each of those years are data frames: one called events
that stores the play-by-play data, and another called rosters
that stores the rosters.
To isolate the play-by-play data for a single year, use the pluck()
function.
retro1992 <- retro_data |>
pluck("1992") |>
pluck("events")
A.1.4 Saving the data
You probably don’t want to have to download and process all of this data every time you want to use it. Now that you have it stored in R, the best way to save it for later use is by writing the data frame for each year to disk using R’s internal data storage format. You can do this with the write_rds()
function.
retro1992 |>
write_rds(
file = here::here("data/retro1992.rds"),
compress = "xz"
)
Be sure to use the compress
argument—it will significantly reduce the size of the data.
If you want to build a whole database of Retrosheet data (for many years), iterate the process described above and combine it with our illustration of how to build a SQL database in Chapter 11 and our discussion of working with large data in Chapter 12.
A.1.5 The function parse_retrosheet_pbp()
In previous versions of this book, we included a function called parse_retrosheet_pbp()
that could be used to download and process Retrosheet data. This function has been superseded by the retrosheet_data()
function from the baseballr package, and we no longer recommend using it. However, if you are interested in working through its logic, the code is available through the abdwr3edata package.
abdwr3edata::parse_retrosheet_pbp
function(season) {
download_retrosheet(season)
unzip_retrosheet(season)
create_csv_file(season)
create_csv_roster(season)
cleanup()
}
<bytecode: 0x555727287090>
<environment: namespace:abdwr3edata>
You may want to access the various helper functions for further detail. For example:
abdwr3edata::download_retrosheet
function(season) {
# get zip file from retrosheet website
utils::download.file(
url = paste0(
"http://www.retrosheet.org/events/", season, "eve.zip"
),
destfile = file.path(
"retrosheet", "zipped",
paste0(season, "eve.zip")
)
)
}
<bytecode: 0x555727484028>
<environment: namespace:abdwr3edata>
A.1.6 Alternatives to Chadwick
The retrosheet package (Douglas and Scriven 2024) provides an alternative method for bringing Retrosheet data into R without an external dependency on the Chadwick software through its getRetrosheet()
function. However, these data are returned as a list of lists (rather than a list of data frames), and thus can be considerably more cumbersome to analyze.
A.2 Retrosheet Event Files: a Short Reference
As we mentioned in Chapter 1, Retrosheet event files come in a format expressly devised for them, and require the use of some software tools for converting them in a format suitable for data analysis. Retrosheet provides such software tools https://www.retrosheet.org/tools.htm and a step-by-step example https://www.retrosheet.org/stepex.txt for performing the conversion.
Chadwick provides similar tools for parsing Retrosheet event files that have been used for creating the play-by-play files used in this book (see Section A.1.2).
Chadwick tools generate a line for each play in the Retrosheet event files, consisting of 97 “regular” columns (the same that are obtained using the tools provided by Retrosheet) plus 63 “extended” fields, allowing to easily access all of the information contained in the Retrosheet event files. Going through every one of the more than 150 columns generated by the Chadwick tools is beyond the scope of this book, and thus we point to the documentation on the Chadwick website for the full list.2 In this section we present the main fields describing an event and the state of the game when it happens.
Please note that while the Chadwick tools return variable names in all caps, the retrosheet_data()
function uses snake_case for variable names.
A.2.1 Game and event identifiers
The games are identified in Retrosheet event files by 12-character strings (the GAME_ID
column): the first three characters identify the home team, the following eight characters indicate the date when the game took place (in the YYYYMMDD format), and the last character is used to distinguish games of doubleheaders (thus “1” indicates the first game, “2” the second game, and “0” means only one game was played on the day).
Events are progressively numerated in each game (column EVENT_ID
), thus every single action in the Retrosheet database can be uniquely identified by the combination of the game identifier and the event identifier.
A.2.2 The state of the game
Several fields are helpful for defining the state of the game when a particular event happened. The inning and the team on offense variables are stored in the INN_CT
and BAT_HOME_ID
fields respectively. The latter field can assume values “0” (away team batting, i.e., top of the inning) or “1” (home team batting, bottom of the inning). The visitor score and the home score variables are recorded in the AWAY_SCORE_CT
and HOME_SCORE_CT
.
The number of outs before the play is indicated in the OUTS_CT
column, while the situation of runners on base is coded in the field START_BASES_CD
, using numbers from 1 to 7 as shown in Table A.1.3
Code | Bases occupancy |
---|---|
0 | Empty |
1 | 1B only |
2 | 2B only |
3 | 1B & 2B |
4 | 3B only |
5 | 1B & 3B |
6 | 2B & 3B |
7 | Loaded |
The actual description of the event resides in the EVENT_TX
column, consisting of a string describing the outcome of the play (e.g., strikeout, single, etc.), some additional details (e.g., the type and location of the batted ball), and the advancement of any runner on base. Several columns are generated by decoding the EVENT_TX
string:
EVENT_CD
is a numeric code reflecting the basic event; Table A.2 displays the codes for the possible plays coded in this column.BAT_EVENT_FL
is a flag indicating whether an event is a batting event, in which case it is labeled asT
. Non-batting events include, for example, stolen bases, wild pitches and, generally, any event that does not mark the end of a plate appearance.H_CD
is a numeric code indicating the base hit type, going from 1 for a single to 4 for a home run.BATTEDBALL_CD
is a single character code denoting the batted ball type. It can assume one of the following values:G
(ground ball),L
(line drive),F
(fly ball),P
(pop-up). Note that for most of the seasons in the Retrosheet database, the batted ball type is reported only for plate appearances ending with the batter making an out, while they are not available on base hits.BATTEDBALL_LOC_TX
is a string indicating the batted ball location, coded according to the diagram shown at https://www.retrosheet.org/location.htm. Note that this information is available for a limited number of seasons.FLD_CD
is a numeric code denoting the fielder first touching a batted ball, coded with the conventional baseball fielding notation going from 1 (the pitcher) to 9 (the right fielder).
Code | Event type |
---|---|
2 | Generic Out |
3 | Strikeout |
4 | Stolen Base |
5 | Defensive Indifference |
6 | Caught Stealing |
8 | Pickoff |
9 | Wild Pitch |
10 | Passed Ball |
11 | Balk |
12 | Other Advance |
13 | Foul Error |
14 | Nonintentional Walk |
15 | Intentional Walk |
16 | Hit By Pitch |
17 | Interference |
18 | Error |
19 | Fielder Choice |
20 | Single |
21 | Double |
22 | Triple |
23 | Homerun |
The sequence of pitches is recorded in the PITCH_SEQ_TX
and has been addressed in Chapter 6, where Table 6.1 displays how the different pitch outcomes are coded. Several columns are generated from this one, indicating counts of the various types of pitch outcomes, as displayed in Table A.3.
Column name | Column description |
---|---|
PA_BALL_CT | No. of balls in plate appearance |
PA_CALLED_BALL_CT | No. of called balls in plate appearance |
PA_INTENT_BALL_CT | No. of intentional balls in plate appearance |
PA_PITCHOUT_BALL_CT | No. of pitchouts in plate appearance |
PA_HITBATTER_BALL_CT | No. of pitches hitting batter in plate appearance |
PA_OTHER_BALL_CT | No. of other balls in plate appearance |
PA_STRIKE_CT | No. of strikes in plate appearance |
PA_CALLED_STRIKE_CT | No. of called strikes in plate appearance |
PA_SWINGMISS_STRIKE_CT | No. of swinging strikes in plate appearance |
PA_FOUL_STRIKE_CT | No. of foul balls in plate appearance |
PA_INPLAY_STRIKE_CT | No. of balls in play in plate appearance |
PA_OTHER_STRIKE_CT | No. of other strikes in plate appearance |
A.3 Parsing Retrosheet Pitch Sequences
A.3.1 Introduction
Chapter 6 showed how to compute, using regular expressions, whether a plate appearance went through either a 1-0 or a 0-1 count. Here we provide the code to retrieve the same information for every possible balls/strikes count.
A.3.2 Setup
We first load Retrosheet data for the 2016 season.
retro2016 <- read_rds(here::here("data/retro2016.rds"))
Then a new column sequence
is created in which the pitch sequence is reported, stripped by any character not indicating an actual pitch to the batter.4
retro2016 <- retro2016 |>
mutate(sequence = gsub("[.>123+*N]", "", pitch_seq_tx))
A.3.3 Evaluating every count
Every plate appearance starts with a 0-0 count. The code for both the 1-0 and 0-1 counts was described in Chapter 6.
A number inside the curly brackets indicates the exact number of times the preceding expression has to be repeated in the string to match. The following lines look for plate appearances going through the counts 2-0, 3-0, and 0-2.
The |
(vertical bar) character is used to separate alternatives. The following lines parse the sequence
string looking for the different sequences that can lead to 1-1, 2-1, and 3-1 counts.
b <- "[BIPV]"
s <- "[CFKLMOQRST]"
retro2016 <- retro2016 |>
mutate(
c11 = grepl(
paste0("^", s, b, "|^", b, s), sequence
),
c21 = grepl(
paste0("^", s, b, b,
"|^", b, s, b,
"|^", b, b, s), sequence
),
c31 = grepl(
paste0("^", s, b, b, b,
"|^", b, s, b, b,
"|^", b, b, s, b,
"|^", b, b, b, s), sequence
)
)
On two-strike counts, batters can indefinitely foul pitches off without affecting the count. In the lines below, sequences reaching two strikes before reaching the desired number of balls feature the [FR]*
expression, denoting a foul ball5 happening any number of times, including zero, as indicated by the asterisk.
retro2016 <- retro2016 |>
mutate(
c12 = grepl(
paste0("^", b, s, s,
"|^", s, b, s,
"|^", s, s, "[FR]*", b), sequence
),
c22 = grepl(
paste0("^", b, b, s, s,
"|^", b, s, b, s,
"|^", b, s, s, "[FR]*", b,
"|^", s, b, b, s,
"|^", s, b, s, "[FR]*", b,
"|^", s, s, "[FR]*", b, "[FR]*", b),
sequence
),
c32 = grepl(
paste0("^", s, "*", b, s,
"*", b, s, "*", b), sequence
) & grepl(
paste0("^", b, "*", s, b, "*", s), sequence
)
)
The retrosheet_add_counts()
in the abdwr3edata package contains all of the code necessary to compute these counts.
abdwr3edata::retrosheet_add_counts
On Windows, the analogous DOS command is
where
.↩︎The documentation for all the software tools is available at https://chadwick.sourceforge.net/doc/cwtools.html. In particular, the tool for processing the event files (
cwevent
) is documented at https://chadwick.sourceforge.net/doc/cwevent.html#cwtools-cwevent.↩︎An analogous column named
END_BASES_CD
contains the base state at the end of the play, coded in the same way.↩︎F
encodes a foul ball,R
a foul ball on a pitchout. See Table 6.1.↩︎