master_id <- baseballr::chadwick_player_lu() |>
write_rds(
here::here("data/chadwick_register.rds"), compress = "xz"
)
Appendix C — Statcast Data Reference
C.1 Introduction
Statcast is the current state-of-the-art tracking system used in all Major League ballparks since the 2015 season. This system is used to track the movements of the baseball and all players on the field at 20,000 frames per second. Using the Statcast system, we can learn about the speed, direction, and distance traveled of players. For example, this system allows for precise evaluation of a defensive player’s movement towards a batted ball.
Currently some of the Statcast data is available through the Baseball Savant website, which downloads the data from MLB Advanced Media. The R package baseballr has special functions for downloading Statcast pitch-by-pitch data from Baseball Savant. We discuss these in Section C.10. The purpose of this reference is to describe the variables that overlap with variables available in the Retrosheet play-by-play and now defunct PITCHf/x datasets (see Appendix B), and describe the new “off the bat” variables available from Statcast.
C.2 Cross-referencing with Other Data Sources
The People
table in the Lahman database is a useful resource for cross-referencing players across several data sources such as the Baseball-Reference website and the Retrosheet files. Unfortunately, it currently does not contain a column for the MLBAM player identifier; thus the People
table is not useful for merging Statcast data to information coming from other sources. The best way to cross-reference player identifiers across these systems is by using The Register at Chadwick Baseball Bureau (https://github.com/chadwickbureau/register/). There, one finds a link for the download of a zip file containing a register of players, managers, and umpires at any professional level (including, other than the Major Leagues, the Minor and Independent Leagues, Winter Leagues, Japanese and Korean top levels, and the Negro Leagues).
Simpler still is to use the chadwick_player_lu()
function from the baseballr package, as we did in Section 7.5. Since this file takes a minute to download and process, we can store a local copy using the write_rds()
function.
C.3 Game Situation Variables
Many of the variables concern the game situation at the time of the pitch (see Table C.1). These variables include the date, inning, and number of outs. The identities of all players on the field together with the identities of the baserunners are included. With respect to the specific plate appearance, the dataset includes the pitch number, the number of balls and strikes, and the batting side and throwing hand of the pitcher.
Name | Description |
---|---|
game_date | Date of game |
batter | Id of the batter |
pitcher | Id of the pitcher |
stand | Side of the batter |
p_throws | Throwing hand of pitcher |
home_team | Code for home team |
away_team | Code for visiting team |
balls | Number of current balls |
strikes | Number of current strikes |
on_3b | Id of baserunner on third base |
on_2b | Id of baserunner on second base |
on_1b | Id of baserunner on first base |
outs_when_up | Current number of outs |
inning | Current inning |
inning_topbot | Top or bottom of inning |
pos1_person_id | Id of pitcher |
pos2_person_id | Id of catcher |
pos3_person_id | Id of first baseman |
pos4_person_id | Id of second baseman |
pos5_person_id | Id of third baseman |
pos6_person_id | Id of shortstop |
pos7_person_id | Id of left fielder |
pos8_person_id | Id of center fielder |
pos9_person_id | Id of right fielder |
pitch_number | Number of pitch in PA |
C.4 Pitch Variables
Similar to the PITCHf/x system, this Statcast dataset contains information about each pitch. The variables in Table C.2 include the release point of the pitch, its speed in miles per hour, and movement in the horizontal and vertical directions. The location of the pitch in the zone is recorded and it is classified into a particular region using the zone
variable. Using a classification method, the pitch type is recorded. See Table C.3 for the decoding of the abbreviations.
Name | Description |
---|---|
pitch_type | code for pitch type |
pitch_name | pitch type |
description | description of outcome of pitch |
release_speed | speed of pitch (mph) when released |
effective_speed | speed of pitch (mph) when crossing plate |
release_pos_x | x-coordinate of release point of pitch |
release_pos_y | y-coordinate of release point of pitch |
release_pos_z | z-coordinate of release point of pitch |
zone | zone location of pitch |
pfx_x | horizontal movement of pitch |
pfx_z | vertical movement of pitch |
sz_top | vertical location of top of strike zone |
sz_bot | vertical location of bottom of strike zone |
plate_x | horizontal location of pitch |
plate_z | vertical location of pitch |
vx0 | x-coordinate of pitch velocity |
vy0 | y-coordinate of pitch velocity |
vz0 | z-coordinate of pitch velocity |
ax | x-coordinate of pitch acceleration |
ay | y-coordinate of pitch acceleration |
az | z-coordinate of pitch acceleration |
release_spin_rate | spin rate |
spin_axis | spin direction |
pitch_type | pitch_name |
---|---|
CH | Changeup |
CS | Slow Curve |
CU | Curveball |
EP | Eephus |
FA | Other |
FC | Cutter |
FF | 4-Seam Fastball |
FO | Forkball |
FS | Split-Finger |
KC | Knuckle Curve |
KN | Knuckleball |
PO | Pitch Out |
SC | Screwball |
SI | Sinker |
SL | Slider |
ST | Sweeper |
SV | Slurve |
NA | NA |
Here are more detailed descriptions of the pitch variables.
release_speed and effective_speed: Speed in miles per hour at the release point and when the ball crosses the front of home plate.
sz_top and sz_bot: Vertical coordinates for the top and the bottom of the strike zone of the batter currently at the plate. Both variables are expressed as feet from the ground and they are manually recorded at the beginning of every at-bat.
pfx_x and pfx_z: Horizontal and vertical movement of the pitch compared to a theoretical pitch of the same speed with no spin-induced movement. Both variables are measured in inches.
plate_x and plate_z: Horizontal and vertical location of the pitch, measured when the pitch crosses the front of home plate. The coordinate system is centered on the middle of home plate and at ground level and viewed from the catcher/umpire point of view, thus a positive value of
plate_x
indicates the pitch crosses the plate to the right of its middle and a negative value to the left. A negative value ofplate_z
indicates a pitch that bounced before reaching home plate. Bothplate_x
andplate_z
variables are measured in feet.release_pos_x, release_pos_y, release_pos_z: Coordinates indicating the calculated position of the ball at the release point. The
release_pos_y
parameter indicates the distance from home plate and is generally set at 50 feet from home plate; researchers have found 55 feet as a distance that better approximates the true release point of the pitch and it is thus advisable to recalculate the coordinates at the 55 foot mark, as illustrated in Section C.5.release_pos_x
,release_pos_y
, andrelease_pos_z
are the left and right position and the height of the release point in the same coordinate system asplate_x
andplate_z
.vx0, vy0, and vz0: Components of the pitch velocity in three dimensions, measured at release in feet per second.
ax, ay, and az: Components of the pitch acceleration in three dimensions, measured at release in \(ft/s^2\).
release_spin_rate: Spin rate of the ball in revolutions per minute.
spin_axis: Direction of the spin of the ball, where 0° indicates a perfect top spin and 180° indicates a perfect bottom spin.
C.5 Calculating the Pitch Trajectory
As seen in the previous sections, Statcast tracks data on location, velocity, and acceleration of a pitch. Using the kinematics equation for constant acceleration, the position of the ball at a given time \(t\) can be determined by the following equations:
\[ x=x_{0}+xv_{0}t+\frac{1}{2}axt \] \[ y=y_{0}+yv_{0}t+\frac{1}{2}ayt \] \[ z=z_{0}+zv_{0}t+\frac{1}{2}azt \]
The previous equations are translated to R with use of the following function pitchloc()
.1
The function pitch_trajectory()
calculates the trajectory of a pitch from release point to home plate at specified time intervals (the default choice of the argument interval
is 0.01 seconds).
pitch_trajectory <- function(x0, ax, vx0,
y0, ay, vy0, z0, az, vz0,
interval = 0.01) {
cross_p <- (-1 * vy0 - sqrt(I(vy0 ^ 2) - 2 * y0 * ay)) / ay
tracking <- t(
sapply(
seq(0, cross_p, interval),
pitchloc,
x0 = x0, ax = ax, vx0 = vx0,
y0 = y0, ay = ay, vy0 = vy0,
z0 = z0, az = az, vz0 = vz0
)
)
colnames(tracking) <- c("x", "y", "z")
tracking <- data.frame(tracking)
return(tracking)
}
C.6 Play Event Variables
Although each row of the data set represents a pitch, several variables in Table C.4 record the outcome of the plate appearance. The type
variable indicates if the ball is a strike, ball, or put in play. The events
, des
, and description
variable provide descriptions of the outcome of the plate appearance.
Name | Description |
---|---|
type | ball or strike or ball in play |
events | outcome of plate appearance |
des | detailed description of outcome of plate appearance |
C.7 Batted Ball Variables
One special aspect of the Statcast dataset is the inclusion of variables about balls that are put into play described in Table C.5. These variables include the exit velocity and launch angle off of the bat, the \((x, y)\)-coordinates of the location of the batted ball, and its estimated distance from home plate. A barrel is a way of categorizing a well-hit ball with good combinations of exit velocity and launch angle.
Name | Description |
---|---|
hit_distance_sc | distance away (ft.) that ball lands |
hc_x | x location of batted ball when it lands |
hc_y | y location of batted ball when it lands |
launch_speed | speed of ball as it comes off of the bat |
launch_angle | vertical angle at which ball leaves bat |
barrel | classification to batted-ball events whose comparable hit types led to a minimum .500 AVG and 1.500 SLG |
The batted location variables hc_x
and hc_y
are related to the spray angle \(\phi\) by the equation \[
\phi = atan \left(\frac{{hc_x}-125.42}{198.27-{hc_y}}\right) \,.
\] We show this graphically in Figure C.1.
C.8 Derived Variables
Based on the batted ball variables, Statcast has developed several metrics that help in understanding the quality of a specific batted ball, shown in Table C.6. Based on the launch speed and launch angle, one variable estimated_ba_using_speedangle
gives the estimated probability of a base hit, and a second variable estimated_woba_using_speedangle
provides the estimate of the weighted on-base percentage for this batted ball.
Name | Description |
---|---|
estimated_ba_using_speedangle | estimated hit probability |
estimated_woba_using_speedangle | estimated woba value |
C.9 Defense Variables
Statcast also includes information about the defensive alignments of the teams, shown in Table C.7. The if_fielding_alignment
variable indicates if the defensive infield is “standard”, “infield shift” (three or more infielders on same side of second base), or “strategic positioning”. The of_fielding_alignment
can either be “standard”, “strategic”, or “4th outfielder”. Currently, there is some debate about the value of these new defensive alignments and the inclusion of these variables can help determine the effectiveness of these strategies.
Name | Description |
---|---|
if_fielding_alignment | infield positioning |
of_fielding_alignment | outfield positioning |
C.10 Acquiring Statcast Data
The statcast_search()
function from the baseballr package will allow you to download Statcast data from Baseball Savant over a specified period of time, or for a particular player. For example, Andrew McCutchen, Freddie Freeman, and José Altuve recorded their 2000th career hits on June 11, June 25, and August 19, 2023, respectively. To retrieve data for McCutchen during the three days before and after his hit, we can use the statcast_search()
function. There are various ways to find McCutchen’s MLB player identifier (see Section C.2), which in this case is 457705.
library(baseballr)
mccutchen <- statcast_search(
start_date = "2023-06-08",
end_date = "2023-06-14",
playerid = 457705
)
mccutchen |>
filter(game_date == "2023-06-11", events == "single") |>
select(pitch_type, release_speed, release_spin_rate)
# A tibble: 1 × 3
pitch_type release_speed release_spin_rate
<chr> <dbl> <dbl>
1 SL 85.8 2502
McCutchen’s 2000th hit came off of an 86 mph slider spinning at 2500 revolutions per minute.
Please see Section 12.2 for information about how to store one or more years of Statcast data.
The code in this section has been slightly adapted from https://code.google.com/p/r-pitchfx/.↩︎