Appendix C — Statcast Data Reference

Authors
Affiliations

Bowling Green State University

Smith College

Max Marchi

Cleveland Guardians

C.1 Introduction

Statcast is the current state-of-the-art tracking system used in all Major League ballparks since the 2015 season. This system is used to track the movements of the baseball and all players on the field at 20,000 frames per second. Using the Statcast system, we can learn about the speed, direction, and distance traveled of players. For example, this system allows for precise evaluation of a defensive player’s movement towards a batted ball.

Currently some of the Statcast data is available through the Baseball Savant website, which downloads the data from MLB Advanced Media. The R package baseballr has special functions for downloading Statcast pitch-by-pitch data from Baseball Savant. We discuss these in Section C.10. The purpose of this reference is to describe the variables that overlap with variables available in the Retrosheet play-by-play and now defunct PITCHf/x datasets (see Appendix B), and describe the new “off the bat” variables available from Statcast.

C.2 Cross-referencing with Other Data Sources

The People table in the Lahman database is a useful resource for cross-referencing players across several data sources such as the Baseball-Reference website and the Retrosheet files. Unfortunately, it currently does not contain a column for the MLBAM player identifier; thus the People table is not useful for merging Statcast data to information coming from other sources. The best way to cross-reference player identifiers across these systems is by using The Register at Chadwick Baseball Bureau (https://github.com/chadwickbureau/register/). There, one finds a link for the download of a zip file containing a register of players, managers, and umpires at any professional level (including, other than the Major Leagues, the Minor and Independent Leagues, Winter Leagues, Japanese and Korean top levels, and the Negro Leagues).

Simpler still is to use the chadwick_player_lu() function from the baseballr package, as we did in Section 7.5. Since this file takes a minute to download and process, we can store a local copy using the write_rds() function.

master_id <- baseballr::chadwick_player_lu() |>
  write_rds(
    here::here("data/chadwick_register.rds"), compress = "xz"
  )

C.3 Game Situation Variables

Many of the variables concern the game situation at the time of the pitch (see Table C.1). These variables include the date, inning, and number of outs. The identities of all players on the field together with the identities of the baserunners are included. With respect to the specific plate appearance, the dataset includes the pitch number, the number of balls and strikes, and the batting side and throwing hand of the pitcher.

Table C.1: Game situation variables from Statcast.
Name Description
game_date Date of game
batter Id of the batter
pitcher Id of the pitcher
stand Side of the batter
p_throws Throwing hand of pitcher
home_team Code for home team
away_team Code for visiting team
balls Number of current balls
strikes Number of current strikes
on_3b Id of baserunner on third base
on_2b Id of baserunner on second base
on_1b Id of baserunner on first base
outs_when_up Current number of outs
inning Current inning
inning_topbot Top or bottom of inning
pos1_person_id Id of pitcher
pos2_person_id Id of catcher
pos3_person_id Id of first baseman
pos4_person_id Id of second baseman
pos5_person_id Id of third baseman
pos6_person_id Id of shortstop
pos7_person_id Id of left fielder
pos8_person_id Id of center fielder
pos9_person_id Id of right fielder
pitch_number Number of pitch in PA

C.4 Pitch Variables

Similar to the PITCHf/x system, this Statcast dataset contains information about each pitch. The variables in Table C.2 include the release point of the pitch, its speed in miles per hour, and movement in the horizontal and vertical directions. The location of the pitch in the zone is recorded and it is classified into a particular region using the zone variable. Using a classification method, the pitch type is recorded. See Table C.3 for the decoding of the abbreviations.

Table C.2: Pitch variables from Statcast.
Name Description
pitch_type code for pitch type
pitch_name pitch type
description description of outcome of pitch
release_speed speed of pitch (mph) when released
effective_speed speed of pitch (mph) when crossing plate
release_pos_x x-coordinate of release point of pitch
release_pos_y y-coordinate of release point of pitch
release_pos_z z-coordinate of release point of pitch
zone zone location of pitch
pfx_x horizontal movement of pitch
pfx_z vertical movement of pitch
sz_top vertical location of top of strike zone
sz_bot vertical location of bottom of strike zone
plate_x horizontal location of pitch
plate_z vertical location of pitch
vx0 x-coordinate of pitch velocity
vy0 y-coordinate of pitch velocity
vz0 z-coordinate of pitch velocity
ax x-coordinate of pitch acceleration
ay y-coordinate of pitch acceleration
az z-coordinate of pitch acceleration
release_spin_rate spin rate
spin_axis spin direction
Table C.3: The pitch_type and pitch_name variables used by Statcast.
pitch_type pitch_name
CH Changeup
CS Slow Curve
CU Curveball
EP Eephus
FA Other
FC Cutter
FF 4-Seam Fastball
FO Forkball
FS Split-Finger
KC Knuckle Curve
KN Knuckleball
PO Pitch Out
SC Screwball
SI Sinker
SL Slider
ST Sweeper
SV Slurve
NA NA

Here are more detailed descriptions of the pitch variables.

  • release_speed and effective_speed: Speed in miles per hour at the release point and when the ball crosses the front of home plate.

  • sz_top and sz_bot: Vertical coordinates for the top and the bottom of the strike zone of the batter currently at the plate. Both variables are expressed as feet from the ground and they are manually recorded at the beginning of every at-bat.

  • pfx_x and pfx_z: Horizontal and vertical movement of the pitch compared to a theoretical pitch of the same speed with no spin-induced movement. Both variables are measured in inches.

  • plate_x and plate_z: Horizontal and vertical location of the pitch, measured when the pitch crosses the front of home plate. The coordinate system is centered on the middle of home plate and at ground level and viewed from the catcher/umpire point of view, thus a positive value of plate_x indicates the pitch crosses the plate to the right of its middle and a negative value to the left. A negative value of plate_z indicates a pitch that bounced before reaching home plate. Both plate_x and plate_z variables are measured in feet.

  • release_pos_x, release_pos_y, release_pos_z: Coordinates indicating the calculated position of the ball at the release point. The release_pos_y parameter indicates the distance from home plate and is generally set at 50 feet from home plate; researchers have found 55 feet as a distance that better approximates the true release point of the pitch and it is thus advisable to recalculate the coordinates at the 55 foot mark, as illustrated in Section C.5. release_pos_x, release_pos_y, and release_pos_z are the left and right position and the height of the release point in the same coordinate system as plate_x and plate_z.

  • vx0, vy0, and vz0: Components of the pitch velocity in three dimensions, measured at release in feet per second.

  • ax, ay, and az: Components of the pitch acceleration in three dimensions, measured at release in \(ft/s^2\).

  • release_spin_rate: Spin rate of the ball in revolutions per minute.

  • spin_axis: Direction of the spin of the ball, where 0° indicates a perfect top spin and 180° indicates a perfect bottom spin.

C.5 Calculating the Pitch Trajectory

As seen in the previous sections, Statcast tracks data on location, velocity, and acceleration of a pitch. Using the kinematics equation for constant acceleration, the position of the ball at a given time \(t\) can be determined by the following equations:

\[ x=x_{0}+xv_{0}t+\frac{1}{2}axt \] \[ y=y_{0}+yv_{0}t+\frac{1}{2}ayt \] \[ z=z_{0}+zv_{0}t+\frac{1}{2}azt \]

The previous equations are translated to R with use of the following function pitchloc().1

pitchloc <- function(t, x0, ax, vx0, 
                     y0, ay, vy0, z0, az, vz0) {
  x <- x0 + vx0 * t + 0.5 * ax * I(t ^ 2)
  y <- y0 + vy0 * t + 0.5 * ay * I(t ^ 2)
  z <- z0 + vz0 * t + 0.5 * az * I(t ^ 2)  
  if(length(t) == 1) {
    loc <- c(x, y, z)
  } else {
    loc <- cbind(x, y, z)
  }
  return(loc)
}

The function pitch_trajectory() calculates the trajectory of a pitch from release point to home plate at specified time intervals (the default choice of the argument interval is 0.01 seconds).

pitch_trajectory <- function(x0, ax, vx0, 
                             y0, ay, vy0, z0, az, vz0,
                             interval = 0.01) {
  cross_p <- (-1 * vy0 - sqrt(I(vy0 ^ 2) - 2 * y0 * ay)) / ay
  tracking <- t(
    sapply(
      seq(0, cross_p, interval), 
      pitchloc, 
      x0 = x0, ax = ax, vx0 = vx0, 
      y0 = y0, ay = ay, vy0 = vy0, 
      z0 = z0, az = az, vz0 = vz0
    )
  )
  colnames(tracking) <- c("x", "y", "z")
  tracking <- data.frame(tracking)
  return(tracking)
}

C.6 Play Event Variables

Although each row of the data set represents a pitch, several variables in Table C.4 record the outcome of the plate appearance. The type variable indicates if the ball is a strike, ball, or put in play. The events, des, and description variable provide descriptions of the outcome of the plate appearance.

Table C.4: Play event variables.
Name Description
type ball or strike or ball in play
events outcome of plate appearance
des detailed description of outcome of plate appearance

C.7 Batted Ball Variables

One special aspect of the Statcast dataset is the inclusion of variables about balls that are put into play described in Table C.5. These variables include the exit velocity and launch angle off of the bat, the \((x, y)\)-coordinates of the location of the batted ball, and its estimated distance from home plate. A barrel is a way of categorizing a well-hit ball with good combinations of exit velocity and launch angle.

Table C.5: Batted ball variables.
Name Description
hit_distance_sc distance away (ft.) that ball lands
hc_x x location of batted ball when it lands
hc_y y location of batted ball when it lands
launch_speed speed of ball as it comes off of the bat
launch_angle vertical angle at which ball leaves bat
barrel classification to batted-ball events whose comparable hit types led to a minimum .500 AVG and 1.500 SLG

The batted location variables hc_x and hc_y are related to the spray angle \(\phi\) by the equation \[ \phi = atan \left(\frac{{hc_x}-125.42}{198.27-{hc_y}}\right) \,. \] We show this graphically in Figure C.1.

Figure C.1: Relationship of Statcast variables hc_x and hc_y with the spray angle \(\phi\).

C.8 Derived Variables

Based on the batted ball variables, Statcast has developed several metrics that help in understanding the quality of a specific batted ball, shown in Table C.6. Based on the launch speed and launch angle, one variable estimated_ba_using_speedangle gives the estimated probability of a base hit, and a second variable estimated_woba_using_speedangle provides the estimate of the weighted on-base percentage for this batted ball.

Table C.6: Statcast derived variables.
Name Description
estimated_ba_using_speedangle estimated hit probability
estimated_woba_using_speedangle estimated woba value

C.9 Defense Variables

Statcast also includes information about the defensive alignments of the teams, shown in Table C.7. The if_fielding_alignment variable indicates if the defensive infield is “standard”, “infield shift” (three or more infielders on same side of second base), or “strategic positioning”. The of_fielding_alignment can either be “standard”, “strategic”, or “4th outfielder”. Currently, there is some debate about the value of these new defensive alignments and the inclusion of these variables can help determine the effectiveness of these strategies.

Table C.7: Statcast defensive alignment variables.
Name Description
if_fielding_alignment infield positioning
of_fielding_alignment outfield positioning

  1. The code in this section has been slightly adapted from https://code.google.com/p/r-pitchfx/.↩︎