5  Value of Plays Using Run Expectancy

Authors
Affiliations

Bowling Green State University

Smith College

Max Marchi

Cleveland Guardians

5.1 The Run Expectancy Matrix

An important concept in sabermetrics research is the run expectancy matrix. As each base (first, second, and third) can be either empty or occupied by a runner, there are \(2 \times 2 \times 2 = 8\) possible arrangements of runners on the three bases. The number of outs can be 0, 1, or 2 (three possibilities), and so there are a total of 8 \(\times\) 3 = 24 possible arrangements of runners and outs. For each combination of runners on base and outs, we are interested in computing the average number of runs scored in the remainder of the inning. When these average runs are arranged as a table classified by runners and outs, the display is often called the run expectancy matrix.

We use R to compute this matrix from play-by-play data for the 2016 season. This matrix is used to define the change in expected run value (often simply “run value”) of a batter’s plate appearance. We then explore the distribution of average run values for all batters in the 2016 season. The run values for José Altuve are used to help contextualize their meaning across players. We continue by exploring how players in different positions in the batting lineup perform with respect to this criterion. The notion of expected run value is helpful for understanding the relative benefit of different batting plays and we explore the value of a home run and a single. We conclude the chapter by using the run expectancy matrix and run values to understand the benefit of stealing a base and the cost of being caught stealing.

5.2 Runs Scored in the Remainder of the Inning

We begin by reading into R the play-by-play data we downloaded from Retrosheet for the 2016 season as retro2016. See Section A.1.3 for instructions for how to create the file retro2016.rds. (The retro2016 dataset is also available in the abdwr3edata package.)

retro2016 <- read_rds(here::here("data/retro2016.rds"))

At a given plate appearance, there is potential to score runs. Clearly, this potential is greater with runners on base, specifically runners in scoring position (second or third base), and when there are few outs. This potential for runs is estimated by computing the average number of runs scored in the remainder of the inning for each combination of runners on base and number of outs over some period of time. Certainly, the average runs scored depends on many variables such as home versus away, the current score, the pitching, and the defense. But this runs potential represents the opportunity to create runs in a typical situation during an inning and is a useful baseline against which to measure the contributions of players.

To compute the number of runs scored in the remainder of the inning, we need to know the total runs scored by both teams during the plate appearance and also the total runs scored by the teams at the end of the specific half-inning. The runs scored in the remainder of the inning (denoted by runs_roi) is the difference \[ runs_{roi} = runs_{\text{Total in Inning}} - runs_{\text{So far in Inning}}. \]

We create several new variables using the mutate() function: runs_before is equal to the sum of the visitor’s score (away_score_ct) and the home team’s score home_score_ct at each plate appearance, and half_inning uses the paste() function to combine the game id, the inning, and the team at bat, creating a unique identification for each half-inning of every game. Also, we create a new variable runs_scored that gives the number of runs scored for each play. (The variables bat_dest_id, run1_dest_id, run2_dest_id, and run3_dest_id give the destination bases for the batter and each runner, and runs are scored for each destination base that exceeds 3.)

retro2016 <- retro2016 |> 
  mutate(
    runs_before = away_score_ct + home_score_ct,
    half_inning = paste(game_id, inn_ct, bat_home_id),
    runs_scored = 
      (bat_dest_id > 3) + (run1_dest_id > 3) + 
      (run2_dest_id > 3) + (run3_dest_id > 3)
  )

We wish to compute the maximum total score for each half-inning, combining home and visitor scores. We accomplish this by using the summarize() function, after grouping by half_inning. In the summarize() function, outs_inning is the number of outs for each half-inning, runs_inning is the total runs scored in each half-inning, runs_start is the score at the beginning of the half-inning, and max_runs is the maximum total score in a half-inning, which is the sum of the initial total runs and the runs scored. These summary data are stored in the new data frame half_innings.

half_innings <- retro2016 |>
  group_by(half_inning) |>
  summarize(
    outs_inning = sum(event_outs_ct), 
    runs_inning = sum(runs_scored),
    runs_start = first(runs_before),
    max_runs = runs_inning + runs_start
  )

We use the inner_join() function to merge the data frames data2016 and half_innings. Then the runs scored in the remainder of the inning (new variable runs_roi) can be computed by taking the difference of max_runs and runs.

retro2016 <- retro2016 |>
  inner_join(half_innings, by = "half_inning") |>
  mutate(runs_roi = max_runs - runs_before)

5.3 Creating the Matrix

Now that the runs scored in the remainder of the inning variable has been computed for each plate appearance, it is straightforward to compute the run expectancy matrix.

Currently, there are three variables base1_run_id, base2_run_id, and base3_run_id containing the player codes of the baserunners (if any) who are respectively on first, second, or third base. We create a new three-digit variable bases where each digit is either 1 or 0 if the corresponding base is respectively occupied or empty. The state variable adds the number of outs to the bases variable. One particular value of state would be “011 2”, which indicates that there are currently runners on second and third base with two outs. A second state value “100 0” indicates there is a runner at first with no outs.

retro2016 <- retro2016 |>
  mutate(
    bases = paste0(
      if_else(base1_run_id == "", 0, 1),
      if_else(base2_run_id == "", 0, 1),
      if_else(base3_run_id == "", 0, 1)
    ),
    state = paste(bases, outs_ct)
  )

We want to only consider plays in our data frame where there is a change in the runners on base, number of outs, or runs scored. We create three new variables is_runner1, is_runner2, is_runner3, which indicate, respectively, if first base, second base, and third base are occupied after the play. (The function as.numeric() converts a logical variable to a numeric variable.) The variable new_outs is the number of outs after the play, new_bases indicates bases occupied, and new_state provides the runners on each base and the number of outs after the play.1

retro2016 <- retro2016 |>
  mutate(
    is_runner1 = as.numeric(
      run1_dest_id == 1 | bat_dest_id == 1
    ),
    is_runner2 = as.numeric(
      run1_dest_id == 2 | run2_dest_id == 2 | 
        bat_dest_id == 2
    ),
    is_runner3 = as.numeric(
      run1_dest_id == 3 | run2_dest_id == 3 |
        run3_dest_id == 3 | bat_dest_id == 3
    ),
    new_outs = outs_ct + event_outs_ct,
    new_bases = paste0(is_runner1, is_runner2, is_runner3),
    new_state = paste(new_bases, new_outs)
  )

We use the filter() function to restrict our attention to plays where either there is a change between state and new_state (indicated by the not equal logical operator != or there are runs scored on the play.

changes2016 <- retro2016 |> 
  filter(state != new_state | runs_scored > 0)

Before the run expectancies are computed, one final adjustment is necessary. The play-by-play database includes scoring information for all half-innings during the 2016 season, including partial half-innings at the end of the game where the winning run is scored with less than three outs. In our computation of run expectancies, we want to work only with complete half-innings where three outs are recorded. We use the filter() function to extract the data from the half-innings in changes2016 with exactly three outs—the new data frame is named changes2016_complete. (By removing the incomplete innings, we are introducing a small bias since these innings are not complete due to the scoring of at least one run.)

changes2016_complete <- changes2016 |>
  filter(outs_inning == 3)

We compute the expected number of runs scored in the remainder of the inning (the run expectancy) for each of the 24 bases/outs situations by use of the summarize() function, grouping by bases and outs_ct and employing the mean() function. We define store the resulting data frame as erm_2016.

erm_2016 <- changes2016_complete |> 
  group_by(bases, outs_ct) |>
  summarize(mean_run_value = mean(runs_roi))

To display these run values as an 8 \(\times\) 3 matrix, we use the pivot_wider() function.

erm_2016 |>
  pivot_wider(
    names_from = outs_ct, 
    values_from = mean_run_value, 
    names_prefix = "Outs="
  )
# A tibble: 8 × 4
# Groups:   bases [8]
  bases `Outs=0` `Outs=1` `Outs=2`
  <chr>    <dbl>    <dbl>    <dbl>
1 000      0.498    0.268    0.106
2 001      1.35     0.937    0.372
3 010      1.13     0.673    0.312
4 011      1.93     1.36     0.548
5 100      0.858    0.512    0.220
6 101      1.72     1.20     0.478
7 110      1.44     0.921    0.414
8 111      2.11     1.54     0.695

To see how the run expectancy values have changed over time, we input the 2002 season values as reported in Albert and Bennett (2003) in the matrix erm_2002. We display the 2016 and 2002 expectancies side-by-side for purposes of comparison using bind_cols() in Table 5.1.

erm_2002 <- tibble(
  "OLD=0" = c(.51, 1.40, 1.14,  1.96, .90, 1.84, 1.51, 2.33), 
  "OLD=1" = c(.27,  .94,  .68,  1.36, .54, 1.18,  .94, 1.51),
  "OLD=2" = c(.10,  .36,  .32,   .63, .23,  .52,  .45,  .78)
  )

out <- erm_2016 |>
  pivot_wider(
    names_from = outs_ct, 
    values_from = mean_run_value, 
    names_prefix = "NEW="
  ) |>
  bind_cols(erm_2002)
Table 5.1: Comparison of expected run values between 2016 (three columns labeled “NEW”) and 2002 (three columns labeled “OLD”).
2016
2002
bases NEW=0 NEW=1 NEW=2 OLD=0 OLD=1 OLD=2
000 0.50 0.27 0.11 0.51 0.27 0.10
001 1.35 0.94 0.37 1.40 0.94 0.36
010 1.13 0.67 0.31 1.14 0.68 0.32
011 1.93 1.36 0.55 1.96 1.36 0.63
100 0.86 0.51 0.22 0.90 0.54 0.23
101 1.72 1.20 0.48 1.84 1.18 0.52
110 1.44 0.92 0.41 1.51 0.94 0.45
111 2.11 1.54 0.70 2.33 1.51 0.78

It is somewhat remarkable that these run expectancy values have not changed much over the recent history of baseball. This indicates there have been few changes in the average run scoring tendencies of MLB teams between 2002 and 2016.

5.4 Measuring Success of a Batting Play

When a player comes to bat with a particular runner and out situation, the run expectancy matrix tells us the average number of runs a team will score in the remainder of the half-inning. Based on the outcome of the plate appearance, the state (runners on base and outs) will change and there will be a updated run expectancy value. We estimate the value of the plate appearance, called the run value, by computing the difference in run expectancies of the old and new states plus the number of runs scored on the particular play. \[ \text{RUN VALUE} = \text{RUNS}_{\text{New state}} - \text{RUNS}_{\text{Old state}} + \text{RUNS}_{\text{Scored on Play}} \]

We compute the run values for all plays in the original data frame retro2016 using the following R code. First, we use the left_join() function to match the expected run values for the beginning of each plate appearance. Note that we do this by matching the bases and outs_ct variables in retro2016 to those in the run expectancy matrix erm_2016. This creates a new variable mean_run_value, which we promptly rename rv_start. Next, we do this again; this time matching the new_bases and new_outs variables to the run expectancy matrix to create the variable rv_end. It is important to use a left_join() (rather than an inner_join()) here, since the three out states are not present in erm_2016. The run expectancy of a situation with three outs is obviously zero, so we use the replace_na() function to set these run values to zero.

Thus, in the dataset retro2016, the variable rv_start is defined to be the run expectancy of the current state, and the variable rv_end is defined to be the run expectancy of the new state. The new variable run_value is set equal to the difference in rv_end and rv_start plus runs_scored.

retro2016 <- retro2016 |>
  left_join(erm_2016, join_by("bases", "outs_ct")) |>
  rename(rv_start = mean_run_value) |>
  left_join(
    erm_2016, 
    join_by(new_bases == bases, new_outs == outs_ct)
  ) |>
  rename(rv_end = mean_run_value) |>
  replace_na(list(rv_end = 0)) |>
  mutate(run_value = rv_end - rv_start + runs_scored)

5.5 José Altuve

To better understand run values, let’s focus on the plate appearances for the great hitter José Altuve for the 2016 season. To find Altuve’s player id, we use the People data frame from the Lahman package and use the filter() function to extract the retroID. The pull() function extracts the vector retroID from the data frame.

library(Lahman)
altuve_id <- People |> 
  filter(nameFirst == "Jose", nameLast == "Altuve") |>
  pull(retroID)

We then use the filter() function to isolate a data frame altuve of Altuve’s plate appearances, where the batter id (variable bat_id) is equal to altuve_id. We wish to consider only the batting plays where Altuve was the hitter, so we also select the rows where the batting flag (variable bat_event_fl) is true.2

altuve <- retro2016 |> 
  filter(
    bat_id == altuve_id,
    bat_event_fl == TRUE
  )

How did Altuve do in his first three plate appearances this season? To answer this, we display the first three rows of the data frame altuve, showing the original state, new state, and run value variables:

altuve |> 
  select(state, new_state, run_value) |>
  slice_head(n = 3)
# A tibble: 3 × 3
  state new_state run_value
  <chr> <chr>         <dbl>
1 000 1 000 2        -0.162
2 000 1 100 1         0.244
3 000 1 000 2        -0.162

On his first plate appearance, there were no runners on base with one out. The outcome of this plate appearance was no runners on with two outs, indicating that Altuve got out, and the run value for this play was \(-0.162\) runs. In his second plate appearance, the bases were again empty with one out. Here Altuve got on base, and the run value in the transition from “000 1” to “100 1” was 0.244 runs. In the third plate appearance, Altuve again got out in a bases empty, one-out situation, and the run value was \(-0.162\) runs.

When one evaluates the run values for any player, there are two primary questions. First, we need to understand the player’s opportunities for producing runs. What were the runner/outs situations for the player’s plate appearances? Second, what did the batter do with these opportunities to score runs? The batter’s success or lack of success on these opportunities can be measured in relation to these run values.

Let’s focus on the runner states to understand Altuve’s opportunities. Since a few of the counts of the runners/outs states over the 32 outcomes are close to zero, we focus on the runners on base variable bases. We apply the n() function within summarize() to tabulate the runners state for all of Altuve’s plate appearances.

altuve |> 
  group_by(bases) |> 
  summarize(N = n())
# A tibble: 8 × 2
  bases     N
  <chr> <int>
1 000     417
2 001      24
3 010      60
4 011      18
5 100     128
6 101      22
7 110      40
8 111       8

We see that Altuve generally was batting with the bases empty (000) or with only a runner on first (100). Most of the time, Altuve was batting with no runners in scoring position.

How did Altuve perform with these opportunities? Using the geom_jitter() geometric object, we construct a jittered scatterplot that shows the run values for all plate appearances organized by the runners state (see Figure 5.1). Jittering the points in the horizontal direction is helpful in showing the density of run values. We also add a horizontal line at the value zero to the graph—points above the line (below the line) correspond to positive (negative) contributions.

ggplot(altuve, aes(bases, run_value)) +
  geom_jitter(width = 0.25, alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red") +
  xlab("Runners on base")
Figure 5.1: Stripchart of run values of José Altuve for all 2016 plate appearances as a function of the runners state. The points have been jittered since there are many plate appearances with identical run values.

When the bases were empty (000), the range of possible run values was relatively small. For this state, the large cluster of points at a negative run value corresponds to the many occurrences when Altuve got an out with the bases empty. The cluster of points at (000) at the value 1 corresponds to Altuve’s home runs with the bases empty. (A home run with runners empty will not change the bases/outs state and the value of this play is exactly one run.) For other situations, say the bases-loaded situation (111), there is much more variation in the run values. For one plate appearance, the state moved from 111 1 to 111 2, indicating that Altuve got out with the bases loaded with a run value of \(-0.84\). In contrast, Altuve did hit a double with the bases loaded with one out, and the run value of this outcome was 1.82.

To understand Altuve’s total run production for the 2016 season, we use the summarize() function together with the sum() and n() functions to compute the number of opportunities and sum of run values for each of the runners situations.

runs_altuve <- altuve |> 
  group_by(bases) |>
  summarize(
    PA = n(),
    total_run_values = sum(run_value)
  )
runs_altuve
# A tibble: 8 × 3
  bases    PA total_run_values
  <chr> <int>            <dbl>
1 000     417          10.1   
2 001      24           4.06  
3 010      60           0.0695
4 011      18           3.43  
5 100     128          10.2   
6 101      22           1.34  
7 110      40           5.62  
8 111       8          -0.0968

We see, for example, that Altuve came to bat with the runners empty 417 times, and his total run value contribution to these 417 PAs was \(10.10\) runs. Altuve didn’t appear to do particularly well with runners in scoring position. For example, there were 60 PAs where he came to bat with a runner on second base, and his net contribution in runs for this situation was \(0.07\) runs. Altuve’s total runs contribution for the 2016 season can be computed by summing the last column of this data frame. This measure of batting performance is known as RE24, since it represents the change in run expectancy over the 24 base/out states (Appelman 2008).

runs_altuve |> 
  summarize(RE24 = sum(total_run_values))
# A tibble: 1 × 1
   RE24
  <dbl>
1  34.7

It is not surprising that Altuve has a positive total contribution in his PAs in 2016, but it is difficult to understand the size of 34.7 runs unless this value is compared with the contribution of other players. In the next section, we will see how Altuve compares to all hitters in the 2016 season.

5.6 Opportunity and Success for All Hitters

The run value estimates can be used to compare the batting effectiveness of players. We focus on batting plays, so we construct a new data frame retro2016_bat that is the subset of the main data frame retro2016 where the bat_event_fl variable is equal to TRUE:

retro2016_bat <- retro2016 |> 
  filter(bat_event_fl == TRUE)

It is difficult to compare the RE24 of two players at face value, since they have different opportunities to create runs for their teams. One player in the middle of the batting order may come to bat many times when there are runners in scoring position and good opportunities to create runs. Other players towards the bottom of the batting order may not get the same opportunities to bat with runners on base. One can measure a player’s opportunity to create runs by the sum of the runs potential state (variable rv_start) over all of his plate appearances. We can summarize a player’s batting performance in a season by the total number of plate appearances, the sum of the runs potentials, and the sum of the run values.

The R function summarize() is helpful in obtaining these summaries. We initially group the data frame retro2016_bat by the batter id variable bat_id, and for each batter, compute the total run value RE24, the total starting runs potential runs_start, and the number of plate appearances PA.

run_exp <- retro2016_bat |> 
  group_by(bat_id) |>
  summarize(
    RE24 = sum(run_value),
    PA = length(run_value),
    runs_start = sum(rv_start)
  )

The data frame run_exp contains batting data for both pitchers and non-pitchers. It seems reasonable to restrict attention to non-pitchers, since pitchers and non-pitchers have very different batting abilities. Also we limit our focus on the players who are primarily starters on their teams. One can remove pitchers and non-starters by focusing on batters with at least 400 plate appearances. We create a new data frame run_exp_400 by an application of the filter() function. We display the first few rows by use of the slice_head() function.

run_exp_400 <- run_exp |> 
  filter(PA >= 400)
run_exp_400 |>
  slice_head(n = 6)
# A tibble: 6 × 4
  bat_id     RE24    PA runs_start
  <chr>     <dbl> <int>      <dbl>
1 abrej003  13.6    695       336.
2 alony001  -5.28   532       249.
3 altuj001  34.7    717       346.
4 andet001 -11.5    431       205.
5 andre001  17.7    568       257.
6 aokin001  -1.91   467       229.

Is there a relationship between batters’ opportunities and their success in converting these opportunities to runs? To answer this question, we construct a scatterplot of run opportunity (runs_start) against run value (RE24) for these hitters with at least 400 at bats (see Figure 5.2) using the geom_point() function. To help see the pattern in this scatterplot, we use the geom_smooth() function to add a LOESS smoother to the scatterplot. To interpret this graph, it is helpful to add a horizontal line (using the geom_hline() function) at 0—points above this line correspond to hitters who had a total positive run value contribution in the 2016 season.

plot1 <- ggplot(run_exp_400, aes(runs_start, RE24)) +
  geom_point() + 
  geom_smooth() +
  geom_hline(yintercept = 0, color = "red")
plot1
Figure 5.2: Scatterplot of total run value against the runs potential for all players in the 2016 season with at least 400 plate appearances. A smoothing curve is added to the scatterplot—this shows that players who had more run potential tend to have large run values.

From viewing Figure 5.2, we see that batters with larger values of runs_start tend to have larger runs contributions. But there is a wide spread in the run values for these players. In the group of players who have runs_start values between 300 and 350, four of these players actually have negative runs contributions and other players created over 60 runs in the 2016 season.

From the graph, we see that only a limited number of players created more than 40 runs for their teams. Who are these players? For labeling purposes, we extract the nameLast and retroID variables from the Lahman People data frame and merge this information with the run_exp_400 data frame. Using the geom_text_repel() function, we add point labels to the previous scatterplot for these outstanding hitters. This function from the ggrepel package plots the labels so there is no overlap. (See Figure 5.3.)

run_exp_400 <- run_exp_400 |>
  inner_join(People, by = c("bat_id" = "retroID"))
library(ggrepel)
plot1 + 
  geom_text_repel(
    data = filter(run_exp_400, RE24 >= 40), 
    aes(label = nameLast)
  )
Figure 5.3: Scatterplot of RE24 against the runs potential for all players in the 2016 season with unusual players identified.

From Figure 5.3, we learn that the best hitters in terms of RE24 are Mike Trout (66.43), David Ortiz (60.82), Freddie Freeman (47.19), Joey Votto (46.95), Josh Donaldson (46.22), and Nolan Arenado (46.0).

5.7 Position in the Batting Lineup

Managers like to put their best hitters in the middle of the batting lineup. Traditionally, a team’s “best hitter” bats third and the cleanup hitter in the fourth position is the best batter for advancing runners on base. What are the batting positions of the hitters in our sample? Specifically, are the best hitters using the run value criterion the ones who bat in the middle of the lineup?

A player may bat in several positions in the lineup during the season. We define a player’s batting position as the position that he bats most frequently. We first merge the retro2016 and run_exp_400 data frames into the data frame regulars. Then by grouping regulars by the variables bat_id and bat_lineup_id, we find the frequency of each batting position for each player. Then by applications of the arrange() and mutate() functions, we define position to be the most frequent batting position. We add this new variable to the run_exp_400 data frame.

regulars <- retro2016 |>
  inner_join(run_exp_400, by = "bat_id")

positions <- regulars |> 
  group_by(bat_id, bat_lineup_id) |>
  summarize(N = n()) |> 
  arrange(desc(N)) |> 
  mutate(position = first(bat_lineup_id))

run_exp_400 <- run_exp_400 |>
  inner_join(positions, by = "bat_id")

In the following R code, the players’ run opportunities are plotted against their RE24 values using geom_text() with position as the label variable. (See Figure 5.4.)

ggplot(run_exp_400, aes(runs_start, RE24, label = position)) +
  geom_text() +
  geom_hline(yintercept = 0, color = "red") +
  geom_point(
    data = filter(run_exp_400, bat_id == altuve_id),
    size = 4, shape = 16, color = crcblue
  )
Figure 5.4: Scatterplot of total run value against the runs potential for all players in the 2016 season with at least 400 plate appearances. The points are labeled by the position in the batting lineup and the large point corresponds to José Altuve.

From Figure 5.4, we better understand the relationship between batting position, run opportunities, and run values. The best hitters—the ones who create a large number of runs—generally bat third, fourth, and fifth in the batting order. The number of runs created by the leadoff (first) and second batters in the lineup are much smaller than the runs created by the best hitters in the middle (third and fourth positions) of the lineup. There are some surprises from this general pattern of batting positions. For example, there are some cleanup hitters (position 4) displayed who have mediocre values of runs created.

How does José Altuve and his total run value of 34.7 compare among the group of hitters with at least 400 plate appearances? We had saved Altuve’s batter id in the value altuve_id. Figure 5.4 uses another application of the geom_point() function to display Altuve’s (runs_start, RE24) value by a large solid dot. In this particular season (2016), Altuve was one of the better hitters in terms of creating runs for his team.

5.8 Run Values of Different Base Hits

There are many applications of run values in studying baseball. Here we look at the value of a home run and a single from the perspective of creating runs.

One criticism of batting average is that it gives equal value to the four possible base hits (single, double, triple, and home run). One way of distinguishing the values of the base hits is to assign the number of bases reached: 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run. Alternatively, slugging percentage is the total number of bases divided by the number of at-bats. But it is not clear that the values 1, 2, 3, and 4 represent a reasonable measure of the value of the four possible base hits. We can get a better measure of the importance of these base hits by the use of run values.

5.8.1 Value of a home run

Let’s focus on the value of a home run from a runs perspective. We extract the home run plays from the data frame retro2016 using the event_cd play event variable. An event_cd value of 23 corresponds to a home run. Using the filter() function with the event_cd == 23 condition, we create a new data frame home_runs with the home run plays.

home_runs <- retro2016 |> 
  filter(event_cd == 23)

What are the runners/outs states for the home runs hit during the 2016 season? We answer this question using the table() function.

home_runs |>
  select(state) |>
  table()
state
000 0 000 1 000 2 001 0 001 1 001 2 010 0 010 1 010 2 011 0 
 1530   957   845    12    39    61    98   150   158    24 
011 1 011 2 100 0 100 1 100 2 101 0 101 1 101 2 110 0 110 1 
   37    39   319   357   340    28    74    63    82   131 
110 2 111 0 111 1 111 2 
  156    18    44    48 

We compute the relative frequencies using the prop.table() function, and use the round() function to round the values to three decimal spaces.

home_runs |>
  select(state) |>
  table() |>
  prop.table() |>
  round(3)
state
000 0 000 1 000 2 001 0 001 1 001 2 010 0 010 1 010 2 011 0 
0.273 0.171 0.151 0.002 0.007 0.011 0.017 0.027 0.028 0.004 
011 1 011 2 100 0 100 1 100 2 101 0 101 1 101 2 110 0 110 1 
0.007 0.007 0.057 0.064 0.061 0.005 0.013 0.011 0.015 0.023 
110 2 111 0 111 1 111 2 
0.028 0.003 0.008 0.009 

We see from this table that the fraction of home runs hit with the bases empty is \(0.273 +0.171+ 0.151 = 0.595\). So over half of the home runs are hit with no runners on base.

Overall, what is the run value of a home run? We answer this question by computing the average run value of all the home runs in the data frame home_runs.

mean_hr <- home_runs |>
  summarize(mean_run_value = mean(run_value))
mean_hr
# A tibble: 1 × 1
  mean_run_value
           <dbl>
1           1.38

What are the run values of these home runs? We already observed in the analysis of Altuve’s data that the run value of a home run with the bases empty is one. We construct a histogram of the run values for all home runs using the geom_histogram() function (see Figure 5.5).

ggplot(home_runs, aes(run_value)) +
  geom_histogram() + 
  geom_vline(
    data = mean_hr, aes(xintercept = mean_run_value), 
    color = "red", linewidth = 1.5
  ) +
  annotate(
    "text", 1.7, 2000, 
    label = "Mean Run\nValue", color = "red"
  )
Figure 5.5: Histogram of the run values of the home runs hit during the 2016 season. The vertical line shows the location of the mean run value of a home run.

It is obvious from this graph that most home runs (the ones with the bases empty) have a run value of one. But there is a cluster of home runs with values between 1.5 and 2.0, and there is a small group of home runs with run values exceeding three.

Which runner/out situations lead to the most valuable home runs? Using the arrange() function, we display the row of the data frame corresponding to the largest run value.

home_runs |> 
  arrange(desc(run_value)) |>
  select(state, new_state, run_value) |>
  slice_head(n = 1)
# A tibble: 1 × 3
  state new_state run_value
  <chr> <chr>         <dbl>
1 111 2 000 2          3.41

As one might expect, the most valuable home run occurs when there are bases loaded with two outs. The run value of this home run is 3.41.

Using the geom_vline() function, we draw a vertical line on the graph showing the mean run value and a label to this line (see Figure 5.5). This average run value is pretty small in relation to the value of a two-out grand slam, but this value partially reflects the fact that most home runs are hit with the bases empty.

5.8.2 Value of a single

Run values can also be used to evaluate the benefit of a single. Unlike a home run, the run value of a single will depend both on the initial state (runners and outs) and on the final state. The final state of a home run will always have the bases empty; in contrast, the final state of a single will depend on the movement of any runners on base.

We use the filter() function to select the plays where event_cd equals 20 (corresponding to a single); the new data frame is called singles. We construct a histogram of the run values for all of the singles in the 2016 season in Figure 5.6. As in the case of the home run, it is straightforward to compute the mean run value of a single. We display this mean value on the histogram in Figure 5.6.

singles <- retro2016 |> 
  filter(event_cd == 20)

mean_singles <- singles |>
  summarize(mean_run_value = mean(run_value))

ggplot(singles, aes(run_value)) + 
  geom_histogram(bins = 40) +
  geom_vline(
    data = mean_singles, color = "red", 
    aes(xintercept = mean_run_value), linewidth = 1.5
  ) +
  annotate(
    "text", 0.8, 4000, 
    label = "Mean Run\nValue", color = "red"
  )
Figure 5.6: Histogram of the run values of the singles hit during the 2016 season. The vertical line shows the location of the mean run value of a single.

Looking at the histogram of run values of the single, there are three large spikes between 0 and 0.5. These large spikes can be explained by constructing a frequency table of the beginning state.

singles |>
  select(state) |>
  table()
state
000 0 000 1 000 2 001 0 001 1 001 2 010 0 010 1 010 2 011 0 
 6920  4767  3763    71   323   354   516   745   864    90 
011 1 011 2 100 0 100 1 100 2 101 0 101 1 101 2 110 0 110 1 
  224   208  1665  1974  1765   159   321   368   364   697 
110 2 111 0 111 1 111 2 
  729   115   280   257 

We see that most of the singles occur with the bases empty, and the three tall spikes in the histogram, as one moves from left to right in Figure 5.6, correspond to singles with no runners on and two outs, one out, and no outs. The small cluster of run values in the interval 0.5 to 2.0 correspond to singles hit with runners on base.

What is the most valuable single from the run value perspective? We use the arrange() function to find the beginning and end states for the single that resulted in the largest run value.

singles |> 
  arrange(desc(run_value)) |>
  select(state, new_state, run_value) |>
  slice_head(n = 1)
# A tibble: 1 × 3
  state new_state run_value
  <chr> <chr>         <dbl>
1 111 2 001 2          2.68

In this particular play, the hitter came to bat with the bases loaded and two outs, and the final state was a runner on third with two outs. How could this have happened with a single? The data frame does contain a brief description of the play. But from the data frame we identify the play happening during the bottom of the 8th inning of a game between the Orioles and Yankees on June 5, 2016. We check with Baseball-Reference to find the following play description:

Single to CF (Ground Ball thru SS-2B); Trumbo Scores; Davis Scores; Pena Scores/unER/Adv on E8 (throw); to 3B/Adv on throw

So evidently, the center fielder made an error on the fielding of the single that allowed all three runners to score and the batter to reach third base.

At the other extreme, by another use of the arrange() function, we identify two plays that achieved the smallest run value.

singles |> 
  arrange(run_value) |>
  select(state, new_state, run_value) |>
  slice(1)
# A tibble: 1 × 3
  state new_state run_value
  <chr> <chr>         <dbl>
1 010 0 100 1        -0.621

How could the run value of a single be negative six tenths of a run? With further investigation, we find that in each case, there was a runner on second who was hit by the ball in play and was called out.

In this case, we see that the mean value of a single is approximately equal to the run value when a single is hit with the bases empty with no outs. It is interesting that the run value of a single can be large (in the 1 to 2 range). These large run values reflect the fact that the benefit of the single depends on the advancement of the runners.

5.9 Value of Base Stealing

The run expectancy matrix is also useful in understanding the benefits of stealing bases. When a runner attempts to steal a base, there are two likely outcomes—either the runner will be successful in stealing the base or the runner will be caught stealing. Overall, is there a net benefit to attempting to steal a base?

The variable event_cd gives the code of the play and codes of 4 and 6 correspond respectively to a stolen base (SB) or caught stealing (CS). Using the filter() function, we create a new data frame stealing that consists of only the plays where a stolen base is attempted.

stealing <- retro2016 |> 
  filter(event_cd %in% c(4, 6))

By use of the summarize() and n() functions, we find the frequencies of the SB and CS outcomes.

stealing |> 
  group_by(event_cd) |> 
  summarize(N = n()) |> 
  mutate(pct = N / sum(N))
# A tibble: 2 × 3
  event_cd     N   pct
     <int> <int> <dbl>
1        4  2213 0.756
2        6   713 0.244

Among all stolen base attempts, the proportion of stolen bases is equal to 2213 / (2213 + 713) = 0.756.

What are common runners/outs situations for attempting a stolen base? We answer this by constructing a frequency table for the state variable.

stealing |> 
  group_by(state) |> 
  summarize(N = n())
# A tibble: 16 × 2
   state     N
   <chr> <int>
 1 001 1     1
 2 001 2     1
 3 010 0    37
 4 010 1   124
 5 010 2   102
 6 011 1     1
 7 100 0   559
 8 100 1   708
 9 100 2   870
10 101 0    37
11 101 1    99
12 101 2   219
13 110 0    30
14 110 1    84
15 110 2    53
16 111 1     1

We see that stolen base attempts typically happen with a runner only on first (state 100). But there are a wide variety of situations where runners attempt to steal.

Every stolen base attempt has a corresponding run value that is stored in the variable run_value. This run value reflects the success of the attempt (either SB or CS) and the situation (runners and outs) where this attempt occurs. Using the geom_histogram() function, we construct a histogram of all of the runs created for all the stolen base attempts in Figure 5.7. The color of the bar indicates the success or failure of the attempt.

ggplot(stealing, aes(run_value, fill = factor(event_cd))) + 
  geom_histogram() +
  scale_fill_manual(
    name = "event_cd",
    values = crc_fc,
    labels = c("Stolen Base (SB)", "Caught Stealing (CS)")
  )
Figure 5.7: Histogram of the run values of all steal attempts during the 2016 season.

Generally, all of the successful SBs have positive run value, although most of the values fall in the interval from 0 to 0.3. In contrast, the unsuccessful CSs (as expected) have negative run values. In further exploration, one can show that the three spikes for negative run values correspond to CS when there is only a runner on first with 0, 1, and 2 outs.

Let’s focus on the benefits of stolen base attempts in a particular situation. We create a new data frame that gives the attempted stealing data when there is a runner on first base with one out (state “100 1”).

stealing_1001 <- stealing |> 
  filter(state == "100 1")

By tabulating the event_cd variable, we see the runner successfully stole 498 times out of 498 + 210 attempts for a success rate of 70.3%.

stealing_1001 |> 
  group_by(event_cd) |> 
  summarize(N = n()) |> 
  mutate(pct = N / sum(N))
# A tibble: 2 × 3
  event_cd     N   pct
     <int> <int> <dbl>
1        4   498 0.703
2        6   210 0.297

Another way to look at the outcome is to look at the frequencies of the new_state variable.

stealing_1001 |> 
  group_by(new_state) |> 
  summarize(N = n()) |> 
  mutate(pct = N / sum(N))
# A tibble: 4 × 3
  new_state     N     pct
  <chr>     <int>   <dbl>
1 000 1         1 0.00141
2 000 2       211 0.298  
3 001 1        39 0.0551 
4 010 1       457 0.645  

This provides more information than simply recording a stolen base. On 457 occurrences, the runner successfully advanced to second base. On an additional 39 occurrences, the runner advanced to third. Perhaps this extra base was due to a bad throw from the catcher or a misplay by the infielder. More can be learned about the details of these plays by further examination of the other variables.

We are most interested in the value of attempting stolen bases in this situation—we address this by computing the mean run value of all of the attempts with a runner on first with one out.

stealing_1001 |> 
  summarize(Mean = mean(run_value))
# A tibble: 1 × 1
     Mean
    <dbl>
1 0.00723

Stolen base attempts are worthwhile, although the value overall is about 0.007 runs per attempt. Of course, the actual benefit of the attempt depends on the success or failure and on the situation (runners and outs) where the stolen base is attempted.

5.10 Further Reading and Software

Lindsey (1963) was the first researcher to analyze play-by-play data in the manner described in this chapter. Using data collected by his father for the 1959–60 season, Lindsey obtained the run expectancy matrix that gives the average number of runs in the remainder of the inning for each of the runners/outs situations. Chapters 7 and 9 of Albert and Bennett (2003) illustrate the use of the run expectancy matrix to measure the value of different base hits and to assess the benefits of stealing and sacrifice hits. Tango, Lichtman, and Dolphin (2007), in their Toolshed chapter, describe the run expectancy table as one of the fundamental tools used throughout their book. Also, run expectancy plays a major role in the essays in Keri and Baseball Prospectus (2007).

Baumer, Jensen, and Matthews (2015) use these run expectancies in the computation of WAR (wins above replacement) measures for players. The website http://www.fangraphs.com/library/misc/war/ introduces WAR, a useful way of summarizing a player’s total contribution to his team.

5.11 Exercises

1. Run Values of Hits

In Section 5.8, we found the average run value of a home run and a single.

  1. Use similar R code as described in Section 5.8 for the 2016 season data to find the mean run values for a double, and for a triple.

  2. Albert and Bennett (2003) use a regression approach to obtain the weights 0.46, 0.80, 1.02, and 1.40 for a single, double, triple, and home run, respectively. Compare the results from Section 5.8 and part (a) with the weights of Albert and Bennett.

2. Value of Different Ways of Reaching First Base

There are three different ways for a runner to get on base: a single, walk (BB), or hit-by-pitch (HBP). But these three outcomes have different run values due to the different advancement of the runners on base. Use run values based on data from the 2016 season to compare the benefit of a walk, a hit-by-pitch, and a single when there is a single runner on first base.

3. Comparing Two Players with Similar OBPs

Adam Eaton (Retrosheet batter id eatoa002) and Starling Marte (Retrosheet batter id marts002) both had 0.362 on-base percentages during the 2016 season. By exploring the run values of these two payers, investigate which player was really more valuable to his team. Can you explain the difference in run values in terms of traditional batting statistics such as AVG, SLG, or OBP?

4. Create Probability of Scoring a Run Matrix

In Section 5.3, we illustrate the construction of the run expectancy matrix from 2016 season data. Suppose instead that one was interested in computing the proportion of times when at least one run was scored for each of the 24 possible bases/outs situations. Use R to construct this probability of scoring matrix.

5. Runner Advancement with a Single

Suppose one is interested in studying how runners move with a single.

  1. Using the filter() function, select the plays when a single was hit. (The value of event_cd for a single is 20.) Call the new data frame singles.

  2. Use the group_by() and summarize() functions with the data frame singles to construct a table of frequencies of the variables state (the beginning runners/outs state) and new_state (the final runners/outs state).

  3. Suppose there is a single runner on first base. Using the table from part (b), explore where runners move with a single. Is it more likely for the lead runner to move to second, or to third base?

  4. Suppose instead there are runners on first and second. Explore where runners move with a single. Estimate the probability a run is scored on the play.

6. Hitting Evaluation of Players by Run Values

Choose several players who were good hitters in the 2016 season. For each player, find the run values and the runners on base for all plate appearances. As in Figure 5.1, construct a graph of the run values against the runners on base. Was this particular batter successful when there were runners in scoring position?


  1. The logic for computing the present and future state is encoded in the retrosheet_add_states() function in the abdwr3edata package. See the run_expectancy_code() function from the baseballr package for similar functionality.↩︎

  2. The variable bat_event_fl distinguishes batting events from non-batting events such as steals and wild pitches.↩︎