9  Simulation

Authors
Affiliations

Bowling Green State University

Smith College

Max Marchi

Cleveland Guardians

9.1 Introduction

A baseball season consists of a collection of games between teams, where each game consists of nine innings, and a half-inning consists of a sequence of plate appearances. Because of this clean structure, the sport can be represented by relatively simple probability models. Simulations from these models are helpful in understanding different characteristics of the game.

One attractive aspect of the R system is its ability to simulate from a wide variety of probability distributions. In this chapter, we illustrate the use of R functions to simulate a game consisting of a large number of plate appearances. Also, we use R to simulate the game-to-game competition between teams during an entire season.

Section 9.2 focuses on simulating the events in a baseball half-inning using a special probability model called a Markov chain. The runners on base and the number of outs define a state and this probability model describes movements between states until one reaches three outs. The movement or transition probabilities are found using actual data from the 2016 season. By simulating many half-innings using this model, one gets a basic understanding of the pattern of run scoring.

Section 9.3 describes a simulation of an entire baseball season using the Bradley-Terry probability model. Teams are assigned talents from a bell-shaped (normal) distribution and a season of baseball games is played using win probabilities based on the talents. By simulating many seasons, one learns about the relationship between a team’s talent and its performance in a 162-game season. We describe simulating the post-season series and assess the probability that the “best” team, that is, the team with the best ability actually wins the World Series.

9.2 Simulating a Half Inning

9.2.1 Markov chains

A Markov chain is a special type of probability model useful for describing movement between locations, called states. In the baseball context, a state is viewed as a description of the runners on base and the number of outs in an inning. Each of the three bases can be occupied by a runner or not, and so there are 2 \(\times\) 2 \(\times\) 2 = 8 possible runner situations. Since there are three possible numbers of outs (0, 1, or 2), there are 8 \(\times\) 3 = 24 possible runner and outs states. If we include the 3 outs state, there are a total of 25 possible states during a half-inning of baseball (see Section 5.1).

In a Markov chain, a matrix of transition probabilities is used to describe how one moves between the different states. For example, suppose that there are currently runners on first and second with one out. Based on the outcome of the plate appearance, the state can change. For example, the batter may hit a single; the runner on second scores and the runner on first moves to third. In this case, the new state is runners on first and third with one out. Or maybe the batter will strike out, and the new state is runners on first and second with two outs. By looking at a specific row in the transition probability matrix, one learns about the probability of moving to first and third with one out, or moving to first and second with two outs, or any other possible state.

In a Markov chain, there are two types of states: transition states and absorbing states. Once one moves into an absorbing state, one remains there and can’t return to other transition states. In a half-inning of baseball, since the inning is over when there are 3 outs, this 3-outs state acts as an absorbing state.

There are some special assumptions in a Markov chain model. We assume that the probability of moving to a new state only depends on the current state. So any baseball events that happened before the current runners and outs situation are not relevant in finding the probabilities1. In other words, this model assumes there is not a momentum effect in batting through an inning. Also we are assuming that the probabilities of these movements are the same for all teams, against all pitchers, and for all innings during a game. Clearly, this assumption that all teams are average is not realistic, but we will address this issue in one of the other sections of this chapter.

There are several attractive aspects of using a Markov chain to model a half-inning of baseball. First, the construction of the transition probability matrix is easily done with 2016 season data using computations from Chapter 5. One can use the model to play many half-innings of baseball, and the run scoring patterns that are found resemble the actual run scoring of actual MLB baseball. Last, there are special properties of Markov chains that simplify some interesting calculations, such as the number of players who come to bat during an inning.

9.2.2 Review of work in run expectancy

To construct the transition matrix for the Markov chain, one needs to know the frequencies of transitions from the different runners/outs states to other possible runners/outs states. One can obtain these frequencies using the Retrosheet play-by-play data from a particular season. Here we review the work from Chapter 5.

We begin by reading in play-by-play data for the 2016 season, creating the data frame retro2016.

library(tidyverse)
retro2016 <- read_rds(here::here("data/retro2016.rds"))

First, we use the retrosheet_add_states() function that we wrote in Chapter 5 and stored in the abdwr3edata package. This function adds a number of useful new variables to retro2016. Recall that in particular, we now have a new variable state (which gives the runner locations and the number of outs at the beginning of each play), and a another new variable new_state (which contains the same information at the conclusion of the play).

library(abdwr3edata)
retro2016 <- retro2016 |>
  retrosheet_add_states()

Next, we create the variable half_inning_id as a unique identifier for each half-inning in each baseball game. The new variable runs gives the number of runs scored in each play. The new data frame half_innings contains data aggregated over each half-inning of baseball played in 2016.

half_innings <- retro2016 |>
  mutate(
    runs = away_score_ct + home_score_ct,
    half_inning_id = paste(game_id, inn_ct, bat_home_id)
  ) |>
  group_by(half_inning_id) |>
  summarize(
    outs_inning = sum(event_outs_ct), 
    runs_inning = sum(runs_scored),
    runs_start = first(runs),
    max_runs = runs_inning + runs_start
  )

By using the filter() function from the dplyr package, we focus on plays where there is a change in the state or in the number of runs scored. By another application of filter(), we restrict attention to complete innings where there are three outs and where there is a batting event; the new dataset is called retro2016_complete. Here non-batting plays such as steals, caught stealing, wild pitches, and passed balls are ignored. There is obviously some consequence of removing these non-batting plays from the viewpoint of run production, and this issue is discussed later in this chapter.

retro2016_complete <- retro2016 |> 
  mutate(
    half_inning_id = paste(game_id, inn_ct, bat_home_id)
  ) |>
  inner_join(half_innings, join_by(half_inning_id)) |>
  filter(state != new_state | runs_scored > 0) |> 
  filter(outs_inning == 3, bat_event_fl)

In our definition of the new_state variable, we recorded the runner locations when there were three outs. The runner locations don’t matter, so we recode new_state to always have the value 3 when the number of outs is equal to 3. The str_replace() function replaces the regular expression [0-1]{3} 3—which matches any three character binary string followed by a space and a 3—with 3.

retro2016_complete <- retro2016_complete |>
  mutate(new_state = str_replace(new_state, "[0-1]{3} 3", "3"))

9.2.3 Computing the transition probabilities

Now that the state and new_state variables are defined, one can compute the frequencies of all possible transitions between states using the table() function. The matrix of counts is T_matrix. There are 24 possible values of the beginning state state, and 25 values of the final state new_state including the 3-outs state.

T_matrix <- retro2016_complete |>
  select(state, new_state) |>
  table()
dim(T_matrix)
[1] 24 25

This matrix can be converted to a probability matrix by use of the prop.table() function. The resulting matrix is denoted by P_matrix.

P_matrix <- prop.table(T_matrix, 1)
dim(P_matrix)
[1] 24 25

Finally, we add a row to this transition probability matrix corresponding to transitions from the 3-out state. When the inning reaches 3 outs, then it stays at 3 outs, so the probability of staying in this state is 1.

P_matrix <- P_matrix |>
  rbind("3" = c(rep(0, 24), 1))

The matrix P_matrix now has two important properties that allow it to model transitions between states in a Markov chain: 1) it is square, and; 2) the entries in each of its rows sum to 1.

dim(P_matrix)
[1] 25 25
P_matrix |>
  apply(MARGIN = 1, FUN = sum)
000 0 000 1 000 2 001 0 001 1 001 2 010 0 010 1 010 2 011 0 
    1     1     1     1     1     1     1     1     1     1 
011 1 011 2 100 0 100 1 100 2 101 0 101 1 101 2 110 0 110 1 
    1     1     1     1     1     1     1     1     1     1 
110 2 111 0 111 1 111 2     3 
    1     1     1     1     1 

To better understand this transition matrix, we display the transition probabilities starting at the “000 0” state, no runners and no outs below. (Only the positive probabilities are shown and the as_tibble() and pivot_longer() functions are used to display the probabilities vertically.) The most likely transitions are to the “no runners, one out” state with probability 0.676 and to the “runner on first, no outs” state with probability 0.235. The probability of moving from the “000 0” state to the “000 0” state is 0.033; in other words, the chance of a home run with no runners on with no outs is 0.033.

P_matrix |>
  as_tibble(rownames = "state") |>
  filter(state == "000 0") |>
  pivot_longer(
    cols = -state, 
    names_to = "new_state", 
    values_to = "Prob" 
  ) |>
  filter(Prob > 0)
# A tibble: 5 × 3
  state new_state    Prob
  <chr> <chr>       <dbl>
1 000 0 000 0     0.0334 
2 000 0 000 1     0.676  
3 000 0 001 0     0.00563
4 000 0 010 0     0.0503 
5 000 0 100 0     0.235  

Let’s contrast this with the possible transitions starting from the “010 2” state, runner on second with two outs. The most likely transitions are “3 outs” (probability 0.650), “runners on first and second with two outs” (probability 0.156), and “runner on first with 2 outs” (probability 0.074).

P_matrix |>
  as_tibble(rownames = "state") |>
  filter(state == "010 2") |>
  pivot_longer(
    cols = -state, 
    names_to = "new_state", 
    values_to = "Prob" 
  ) |>
  filter(Prob > 0)
# A tibble: 8 × 3
  state new_state     Prob
  <chr> <chr>        <dbl>
1 010 2 000 2     0.0233  
2 010 2 001 2     0.00587 
3 010 2 010 2     0.0576  
4 010 2 011 2     0.000451
5 010 2 100 2     0.0745  
6 010 2 101 2     0.0325  
7 010 2 110 2     0.156   
8 010 2 3         0.650   

9.2.4 Simulating the Markov chain

One can simulate this Markov chain model a large number of times to obtain the distribution of runs scored in a half-inning of 2016 baseball. The first step is to construct a matrix giving the runs scored in all possible transitions between states. Let \(N_{runners}\) denote the number of runners in a state and \(O\) denote the number of outs. Because every player who has already batted in the inning is either on base, out, or has scored, for a batting play, the number of runs scored is equal to \[ runs = (N_{runners}^{(b)} + O^{(b)} + 1) - (N_{runners}^{(a)} + O^{(a)}). \]

In other words, the runs scored is the sum of runners and outs before (b) the play minus the sum of runners and outs after (a) the play plus one. For example, suppose there are runners on first and second with one out, and after the play, there is a runner on second with two outs. The number of runs scored is equal to \[ runs = (2 + 1 + 1) - (1 + 2) = 1. \]

We define a new function num_havent_scored() which takes a state as input and returns the sum of the number of runners and outs. We then apply this function across all the possible states (using the map_int() function) and the corresponding sums are stored in the vector runners_out.

num_havent_scored <- function(s) {
  s |>
    str_split("") |>
    pluck(1) |>
    as.numeric() |>
    sum(na.rm = TRUE)
}

runners_out <- T_matrix |>
  row.names() |>
  set_names() |>
  map_int(num_havent_scored)

The outer() function with the - (subtraction) operation performs the runs calculation for all possible pairs of states and the resulting matrix is stored in the matrix R_runs. If one inspects the matrix R_runs, one will notice some negative values and some strange large positive values. But this is not a concern since the corresponding transitions, for example a movement between a “000 0” state and a “000 2” state in one batting play, are not possible. To make the matrix square, we add an additional column of zeros to this run matrix using the cbind() function.

R_runs <- outer(
  runners_out + 1, 
  runners_out, 
  FUN = "-"
) |>
  cbind("3" = rep(0, 24))

We are now ready to simulate a half-inning of baseball using a new function simulate_half_inning(). The inputs are the probability transition matrix P, the run matrix R, and the starting state s (an integer between 1 and 24). The output is the number of runs scored in the half-inning.

simulate_half_inning <- function(P, R, start = 1) {
  s <- start
  path <- NULL
  runs <- 0
  while (s < 25) {
    s_new <- sample(1:25, size = 1, prob = P[s, ])
    path <- c(path, s_new)
    runs <- runs + R[s, s_new]
    s <- s_new
  }
  runs
}

There are two key statements in this simulation. If the current state is s, the function sample() will simulate a new state using the s row in the transition matrix P; the new state is denoted s_new. The total number of runs scored in the inning is updated using the value in the s row and the s_new column of the runs matrix R.

Using the map_int() function, one can simulate a large number of half-innings of baseball. In the below code, we simulate 10,000 half-innings starting with no runners and no outs (state 1), collecting the runs scored in the vector simulated_runs. The set.seed() function sets the random number seed so the reader can reproduce the results of this particular simulation by running this code.

set.seed(111653)
simulated_runs <- 1:10000 |>
  map_int(~simulate_half_inning(T_matrix, R_runs))

To find the possible runs scored in a half-inning, we use the table() function to tabulate the values in simulated_runs.

table(simulated_runs)
simulated_runs
   0    1    2    3    4    5    6    7    8    9 
7364 1437  651  324  126   50   34   10    2    2 

In our 10,000 simulations, five or more runs scored in 50 + 34 + 10 + 2 + 2 = 98 half-innings, so the chance of scoring five or more runs would be 98 / 10,000 = 0.0098. This calculation can be checked using the sum() function.

sum(simulated_runs >= 5) / 10000
[1] 0.0098

We compute the mean number of runs scored by applying the mean() function to simulated_runs.

mean(simulated_runs)
[1] 0.477

Over the 10,000 half-innings, an average of 0.477 runs were scored.

To understand the runs potential of different runners and outs situations, one can repeat this simulation procedure for other starting states. We write a function runs_j() to compute the mean number of runs scored starting with state j. Using the map_int() function, we apply the function runs_j() over all of the possible starting states 1 through 24. The output is a vector of mean runs scored stored in the mean_run_value column. These values are displayed below as a simulated expected run matrix (see Section 5.1).

runs_j <- function(j) {
  1:10000 |>
    map_int(~simulate_half_inning(T_matrix, R_runs, j)) |>
    mean()
}

erm_2016_mc <- tibble(
  state = row.names(T_matrix), 
  mean_run_value = map_dbl(1:24, runs_j)
) |>
  mutate(
    bases = str_sub(state, 1, 3),
    outs_ct = as.numeric(str_sub(state, 5, 5))
  ) |>
  select(-state)
  
erm_2016_mc |>
  pivot_wider(names_from = outs_ct, values_from = mean_run_value)
# A tibble: 8 × 4
  bases   `0`   `1`   `2`
  <chr> <dbl> <dbl> <dbl>
1 000   0.481 0.255 0.103
2 001   1.32  0.925 0.338
3 010   1.14  0.640 0.295
4 011   1.93  1.30  0.474
5 100   0.855 0.500 0.211
6 101   1.71  1.14  0.425
7 110   1.39  0.875 0.406
8 111   2.19  1.46  0.667

Recall that our simulation model is based only on batting plays. To understand the effect of non-batting plays (stealing, caught stealing, wild pitches, etc.) on run scoring, we compare this run expectancy matrix with the one found in Chapter 5 using all batting and non-batting plays. Their difference is the contribution of non-batting plays to the average number of runs scored.

erm_2016 <- read_rds(here::here("data/erm2016.rds"))

erm_2016 |>
  inner_join(erm_2016_mc, join_by(bases, outs_ct)) |>
  mutate(
    run_value_diff = round(mean_run_value.x - mean_run_value.y, 2)
  ) |>
  select(bases, outs_ct, run_value_diff) |>
  pivot_wider(names_from = outs_ct, values_from = run_value_diff)
# A tibble: 8 × 4
# Groups:   bases [8]
  bases   `0`   `1`   `2`
  <chr> <dbl> <dbl> <dbl>
1 000    0.02  0.01  0   
2 001    0.03  0.01  0.03
3 010    0     0.03  0.02
4 011    0     0.06  0.07
5 100    0     0.01  0.01
6 101    0.02  0.06  0.05
7 110    0.06  0.05  0.01
8 111   -0.08  0.07  0.03

Note that most of the values of the difference are positive, indicating that these non-batting plays generally do create runs. We note that the largest values tend to occur in situations when there is a runner on third who can score on a wild pitch or passed ball.

9.2.5 Beyond run expectancy

By using properties of Markov chains, it is straightforward to use the transition matrix to learn more about the movement through the runners/outs states.

By multiplying the probability matrix P_matrix by itself three times, we can learn about the likelihood of the state of the inning after three plate appearances. In R, matrix multiplication is indicated by the %*% symbol. The result is stored in the matrix P_matrix_3.

P_matrix_3 <- P_matrix %*% P_matrix %*% P_matrix

The first row of P_matrix_3 gives the probabilities of being in each of the 25 states after three hitters starting at the “000 0” state. We round these values to three decimal places, sort from largest to smallest, and display the largest values.

P_sorted <- P_matrix_3 |>
  as_tibble(rownames = "state") |>
  filter(state == "000 0") |>
  pivot_longer(
    cols = -state, names_to = "new_state", values_to = "Prob"
  ) |>
  arrange(desc(Prob))
P_sorted |> 
  slice_head(n = 6)
# A tibble: 6 × 3
  state new_state   Prob
  <chr> <chr>      <dbl>
1 000 0 3         0.372 
2 000 0 100 2     0.241 
3 000 0 110 1     0.0815
4 000 0 010 2     0.0739
5 000 0 000 2     0.0529
6 000 0 001 2     0.0286

After three PAs, the most likely outcomes are three outs (probability 0.372), runner on first with 2 outs (probability 0.241), and runners on first and second with one out (probability 0.081).

It is also easy to learn about the number of visits to all runner-outs states. Define the matrix Q to be the 24-by-24 submatrix found from the transition matrix by removing the last row and column (the three outs state). By subtracting the matrix Q from the identity matrix and taking the inverse of the result, we obtain the fundamental matrix N of an absorbing Markov chain. (The diag() function is used to construct the identity matrix and the function solve() takes the matrix inverse.)

Q <- P_matrix[-25, -25]
N <- solve(diag(rep(1, 24)) - Q)

To understand the fundamental matrix, we display the beginning entries of the first row of the matrix.

N_0000 <- round(N["000 0", ], 2)
head(N_0000, n = 6)
000 0 000 1 000 2 001 0 001 1 001 2 
 1.05  0.75  0.60  0.01  0.03  0.05 

Starting at the beginning of the inning (the “000 0” state), the average number of times the inning will be in the “000 0” state is 1.05, the average number of times in the “000 1” state is 0.75, the average number of times in the “000 2” state is 0.6, and so on. By using the sum() function, we find the average number of states that are visited.

sum(N_0000)
[1] 4.27

In other words, the average number of plate appearances in a half-inning (before three outs) is 4.27.

We can compute the average number of batting plays until three outs for all starting states by multiplying the fundamental matrix N by a column vector of ones. The vector of average number of plays is stored in the variable avg_num_plays and eight values of this vector are displayed.

avg_num_plays <- N %*% rep(1, 24) |>
  t() |>
  round(2)
avg_num_plays[,1:8]
000 0 000 1 000 2 001 0 001 1 001 2 010 0 010 1 
 4.27  2.87  1.46  4.33  2.99  1.53  4.34  2.93 

This tells us the length of the remainder of the inning, on average, starting with each possible state. For example, starting at the bases empty, one out state, we expect on average to have 2.87 more batters. In contrast, with a runner on third with two outs, we expect to have 1.53 more batters.

9.2.6 Transition probabilities for individual teams

The transition probability matrix describes movements between states for an average team. Certainly, these probabilities will vary for teams of different batting abilities, and the probabilities will also vary against teams of different pitching abilities. We focus on different batting teams and discuss how to obtain good estimates of the transition probabilities for all teams.

To get the relevant data, a new variable batting_team needs to be defined that gives the batting team in each half-inning. By use of the str_sub() function, we define the home team variable home_team_id, and an if_else() function is used to define the batting team.

retro2016_complete <- retro2016_complete |>
  mutate(
    home_team_id = str_sub(game_id, 1, 3),
    batting_team = if_else(
      bat_home_id == 0,
      away_team_id, 
      home_team_id
    )
  )

By use of the group_by() and count() functions, we construct a data frame T_team giving the counts of each team in the transitions from the current to new states.

T_team <- retro2016_complete |>
  group_by(batting_team, state, new_state) |>
  count()

For example, the filtering for batting_team equal to ANA gives the transition counts for Anaheim in the 2016 season.

T_team |>
  filter(batting_team == "ANA") |> 
  slice_head(n = 6)
# A tibble: 192 × 4
# Groups:   batting_team, state, new_state [192]
   batting_team state new_state     n
   <chr>        <chr> <chr>     <int>
 1 ANA          000 0 000 0        40
 2 ANA          000 0 000 1      1007
 3 ANA          000 0 001 0         9
 4 ANA          000 0 010 0        75
 5 ANA          000 0 100 0       359
 6 ANA          000 1 000 1        31
 7 ANA          000 1 000 2       720
 8 ANA          000 1 001 1         3
 9 ANA          000 1 010 1        54
10 ANA          000 1 100 1       261
# ℹ 182 more rows

If one is interested in comparing run productions for different batting teams, it is necessary to make some adjustments to the team transition probability matrices to get realistic predictions of performance. To illustrate the problem, we focus on transitions from the “100 2” state. We store the transition counts in the data frame T_team_S using the tally() function and display a few rows of this table below for six of the teams.

T_team_S <- retro2016_complete |> 
  filter(state == "100 2") |>
  group_by(batting_team, state, new_state) |>
  tally()

T_team_S |> 
  ungroup() |>
  sample_n(size = 6)
# A tibble: 6 × 4
  batting_team state new_state     n
  <chr>        <chr> <chr>     <int>
1 MIN          100 2 010 2        15
2 CIN          100 2 101 2        19
3 NYN          100 2 101 2        14
4 SEA          100 2 010 2         6
5 DET          100 2 011 2         9
6 DET          100 2 101 2        15

For some of the less common transitions, there is much variability in the counts across teams and this causes the corresponding team transition probabilities to be unreliable. If \(p^{TEAM}\) represents the team’s transition probabilities for a particular team, and \(p^{ALL}\) are the average transition probabilities, then a better estimate at the team’s probabilities has the form \[ p^{EST} = \frac{n}{n + K} p^{TEAM} + \frac{K}{n + K} p^{ALL}, \] where \(n\) is the number of transitions for the team and \(K\) is a smoothing count. The description of the methodology is beyond the scope of this book, but in this case a smoothing count of \(K = 1274\) leads to a good estimate at the team’s transition probabilities. (The choice of \(K\) depends on the starting state.)

This method is illustrated for Washington’s transition counts starting from the “100 2” state. In the data frame T_WAS, the transition counts are stored in the variable n, and the corresponding proportions are stored in p. Similarly, for all teams, n and p are the counts and proportions in the data frame T_all.

T_WAS <- T_team_S |>
  filter(batting_team == "WAS") |>
  mutate(p = n / sum(n))

T_all <- retro2016_complete |> 
  filter(state == "100 2") |>
  group_by(new_state) |> 
  tally() |>
  mutate(p = n / sum(n))

We compute the improved estimate at Washington’s transition proportions using the formula and store the results in p_EST. The three sets of proportions (Washington, overall, and improved) are displayed in a data frame.

T_WAS |>
  inner_join(T_all, by = "new_state") |>
  mutate(
    p_EST = (n.x / (1274 + n.x)) * p.x + (1274 / (1274 + n.x)) * p.y
  )  |>
  select(batting_team, new_state, p.x, p.y, p_EST)
# A tibble: 8 × 6
# Groups:   batting_team, state [1]
  state batting_team new_state     p.x      p.y    p_EST
  <chr> <chr>        <chr>       <dbl>    <dbl>    <dbl>
1 100 2 WAS          000 2     0.0319  0.0291   0.0291  
2 100 2 WAS          001 2     0.00532 0.00577  0.00577 
3 100 2 WAS          010 2     0.0213  0.0220   0.0219  
4 100 2 WAS          011 2     0.0213  0.0220   0.0219  
5 100 2 WAS          100 2     0.00266 0.000775 0.000776
6 100 2 WAS          101 2     0.0452  0.0435   0.0435  
7 100 2 WAS          110 2     0.184   0.195    0.194   
8 100 2 WAS          3         0.689   0.682    0.683   

Note that the improved transition proportions are a compromise between the team’s proportions and the overall values. For example, for a transition from the state “100 2” to “010 2”, the Washington value is 0.0213, the overall value is 0.0220, and the improved value 0.0219 falls between the Washington and overall values. This method is especially helpful for particular transitions such as “100 2” to “100 2”, which may not occur for one team in this season but for which we know there is a positive chance of these transitions happening in the future.

This smoothing method can be applied for all teams and all rows of the transition matrix to obtain improved estimates of teams’ probability transition matrices. With the team transition matrices computed in this way, one can explore the run-scoring behavior of individual batting teams.

9.3 Simulating a Baseball Season

9.3.1 The Bradley-Terry model

An attractive method of modeling paired comparison data such as baseball games is the Bradley-Terry model. We illustrate this modeling technique via simulation for the 1968 Major League Baseball season when the regular season and playoff system had a relatively simple structure. It is straightforward to adapt these methods to the present baseball season with a more complicated schedule and playoff system.

In 1968, there were 20 teams, 10 in the National League and 10 in the American League. Suppose each team has a talent or ability to win a game. The talents for the 20 teams are represented by the values \(T_1, ..., T_{20}\). We assume that the talents are distributed from a normal curve model with mean 0 and standard deviation \(s_T\). A team of average ability would have a talent value close to zero, “good” teams would have positive talents, and bad teams would have negative talents. Suppose team \(A\) plays team \(B\) in a single game. By the Bradley-Terry model, the probability team \(A\) wins the game is given by the logistic function \[ P(A \, wins) = \frac{\exp(T_A)}{\exp(T_A) + \exp(T_B)}. \]

This model is closely related to the log5 method developed by Bill James in his Baseball Abstract books in the 1980s (see, for example, James (1982)). If \(P_A\) and \(P_B\) are the winning percentages of teams \(A\) and \(B\), then James’ formula is given by \[ P(A \, wins) = \frac{P_A / (1- P_A)}{P_A / (1- P_A) + P_B / (1- P_B)}. \] Comparing the two formulas, one sees that the log5 method is a special case of the Bradley-Terry model where a team’s talent \(T\) is set equal to the log odds of winning \(\log(P / (1 - P))\). A team with a talent \(T = 0\) will win (in the long run) half of its games (\(P = 0.5\)). In contrast, a team with talent \(T = 0.2\) will win (using the log5 values) approximately \(55\%\) of its games and a team with talent \(T = -0.2\) will win \(45\%\) of its games.

Using this model, one can simulate a baseball season as follows.

  1. Construct the 1968 baseball schedule. In this season, each of the 10 teams in each league play each other team in the same league 18 games, where 9 games are played in each team’s ballpark. (There was no interleague play in 1968.)
  2. Simulate 20 talents from a normal distribution with mean 0 and standard deviation \(s_T\). The value of \(s_T\) is chosen so that the simulated season winning percentages from this model resemble the actual winning percentages during this season.
  3. Using the probability formula and the talent values, one computes the probabilities that the home team wins all games. By a series of coin flips with these probabilities, one determines the winners of all games.
  4. Determine the winner of each league (ties need to be broken by some random mechanism) and play a best-of-seven World Series using winning probabilities computed using the Bradley-Terry model and the two talent numbers.

9.3.2 Making up a schedule

The first step in the simulation is to construct the schedule of games. We wrote a short function make_schedule() to help with this task. The inputs are the vector of team names teams and the number of games k that will be played between two teams in the first team’s home park. The output is a data frame where each row corresponds to a game and Home and Visitor give the names of the home and visiting teams. The rep() function, which generates repeated copies of a vector, is used several times in this function.

make_schedule <- function(teams, k) {
  num_teams <- length(teams)
  Home <- rep(rep(teams, each = num_teams), k)
  Visitor <- rep(rep(teams, num_teams), k)
  tibble(Home = Home, Visitor = Visitor) |>
    filter(Home != Visitor)
}

This function is used to construct the schedule for the 1968 season. Two vectors NL and AL are constructed containing abbreviations for the National League and American League teams. We apply the function make_schedule() twice, once for each league, using k = 9 since one team hosts another team nine games. We use the list_rbind() function to paste together the NL and AL schedules, creating the data frame schedule.

library(Lahman)
teams_68 <- Teams |>
  filter(yearID == 1968) |>
  select(teamID, lgID) |>
  mutate(teamID = as.character(teamID)) |>
  group_by(lgID)

schedule <- teams_68 |>
  group_split() |>
  set_names(pull(group_keys(teams_68), "lgID")) |>
  map(~make_schedule(teams = .x$teamID, k = 9)) |>
  list_rbind(names_to = "lgID")
dim(schedule)
[1] 1620    3

Note that schedule has \(\frac{162 \cdot 20}{2}\) rows, since each game involves two teams.

9.3.3 Simulating talents and computing win probabilities

The next step is to compute the win probabilities for all of the games in the season schedule. The team talents are assumed to come from a normal distribution with mean 0 and standard deviation s_talent, which we assign s_talent = 0.20. (Recall that this value of the standard deviation is chosen so that the season team win percentages generated from the model resemble the actual team win percentages.) We simulate the talents using the function rnorm() that assigns the talents to the 20 teams. By use of two applications of the inner_join() function, we add the team talents to the schedule data frame; the new data frame is called schedule_talent.

s_talent <- 0.20
teams_68 <- teams_68 |>
  mutate(talent = rnorm(10, 0, s_talent))

schedule_talent <- schedule |>
  inner_join(teams_68, join_by(lgID, Home == teamID)) |>
  rename(talent_home = talent) |>
  inner_join(teams_68, join_by(lgID, Visitor == teamID)) |>
  rename(talent_visitor = talent)

Last, once we have the talents for the home and visiting teams for all games, we apply the Bradley-Terry model to compute home team winning probabilities for all games; these probabilities are stored in the variable prob_home.

schedule_talent <- schedule_talent |> 
  mutate(
    prob_home = exp(talent_home) /
      (exp(talent_home) + exp(talent_visitor))
  )

The first six rows of the data frame schedule_talent are displayed below, where one sees the games scheduled, the talents of the home and away teams, and the probability that the home team wins the matchup.

slice_head(schedule_talent, n = 6)
# A tibble: 6 × 6
  lgID  Home  Visitor talent_home talent_visitor prob_home
  <chr> <chr> <chr>         <dbl>          <dbl>     <dbl>
1 AL    BAL   BOS           0.197        0.269       0.482
2 AL    BAL   CAL           0.197       -0.230       0.605
3 AL    BAL   CHA           0.197       -0.00924     0.551
4 AL    BAL   CLE           0.197       -0.185       0.594
5 AL    BAL   DET           0.197        0.409       0.447
6 AL    BAL   MIN           0.197       -0.208       0.600

9.3.4 Simulating the regular season

To simulate an entire season of games, we perform a series of coin flips, where the probability the home team wins depends on the winning probability. The function rbinom() performs the coin flips for the 1620 scheduled games; the outcomes are a sequence of 0s and 1s. By use of the if_else() function, we define the winner variable to be the Home team if the outcome is 1, and the Visitor otherwise.

schedule_talent <- schedule_talent |>
  mutate(
    outcome = rbinom(nrow(schedule_talent), 1, prob_home),
    winner = if_else(outcome == 1, Home, Visitor)
)

The teams, home win probabilities, and outcomes of the first six games are displayed below.

schedule_talent |>
  select(Visitor, Home, prob_home, outcome, winner) |> 
  slice_head(n = 6)
# A tibble: 6 × 5
  Visitor Home  prob_home outcome winner
  <chr>   <chr>     <dbl>   <int> <chr> 
1 BOS     BAL       0.482       0 BOS   
2 CAL     BAL       0.605       0 CAL   
3 CHA     BAL       0.551       1 BAL   
4 CLE     BAL       0.594       1 BAL   
5 DET     BAL       0.447       0 DET   
6 MIN     BAL       0.600       0 MIN   

How did the teams perform during this particular simulated season? Using the group_by() and summarize() functions, we find the number of wins for all teams. We collect this information together with the team names in the data frame WIN, and use the inner_join() function to combine the season results with the team talents to create the data frame results.

results <- schedule_talent |> 
  group_by(winner) |>
  summarize(Wins = n()) |>
  inner_join(teams_68, by = c("winner" = "teamID"))

9.3.5 Simulating the post-season

After the regular season, one can simulate the post-season series. We write a function win_league() that simulates a league championship. The inputs are the data frame res of teams and win totals. By use of the min_rank() function, we identify the teams that have the largest number of wins in each league. If one team has the maximum number, then an indicator variable is_winner_lg is created, which is 1 for that particular team. In order to avoid a tie in win totals for two or more teams, we randomly add a random tiebreaker quantity (that is less than 1) to every teams win total using the runif() function.

win_league <- function(res) {
  res |>
    group_by(lgID) |>
    mutate(
      tiebreaker = runif(n = length(talent)),
      wins_total = Wins + tiebreaker,
      rank = min_rank(desc(wins_total)),
      is_winner_lg = wins_total == max(wins_total)
    )
}

To simulate the post-season, we populate a new variable is_winner_ws; this is an indicator for the World Series winner. By an application of win_league(), we find the winners of each league. We simulate the World Series by flipping a coin seven times (rmultinom()), where the win probabilities are proportional to \(\exp(talent)\). The is_winner_ws indicates the team winning a majority of the games.

sim_one <- win_league(results)

ws_winner <- sim_one |>
  filter(is_winner_lg) |>
  ungroup() |>
  mutate(
    outcome = as.numeric(rmultinom(1, 7, exp(talent))),
    is_winner_ws = outcome > 3
  ) |>
  filter(is_winner_ws) |>
  select(winner, is_winner_ws)

sim_one |>
  left_join(ws_winner, by = c("winner")) |>
  replace_na(list(is_winner_ws = 0))
# A tibble: 20 × 9
# Groups:   lgID [2]
   winner  Wins lgID    talent tiebreaker wins_total  rank
   <chr>  <int> <fct>    <dbl>      <dbl>      <dbl> <int>
 1 ATL       83 NL    -0.215       0.867        83.9     5
 2 BAL       86 AL     0.197       0.0260       86.0     3
 3 BOS       99 AL     0.269       0.354        99.4     2
 4 CAL       62 AL    -0.230       0.936        62.9    10
 5 CHA       85 AL    -0.00924     0.246        85.2     4
 6 CHN       85 NL    -0.107       0.829        85.8     4
 7 CIN       81 NL    -0.0612      0.264        81.3     6
 8 CLE       71 AL    -0.185       0.290        71.3     9
 9 DET      100 AL     0.409       0.0841      100.      1
10 HOU       90 NL    -0.0871      0.845        90.8     2
11 LAN       63 NL    -0.326       0.729        63.7    10
12 MIN       74 AL    -0.208       0.667        74.7     7
13 NYA       78 AL     0.0424      0.422        78.4     6
14 NYN       80 NL     0.100       0.569        80.6     7
15 OAK       82 AL     0.287       0.927        82.9     5
16 PHI       93 NL     0.265       0.819        93.8     1
17 PIT       71 NL    -0.146       0.0460       71.0     9
18 SFN       88 NL     0.249       0.119        88.1     3
19 SLN       76 NL    -0.348       0.425        76.4     8
20 WS2       73 AL     0.0842      0.730        73.7     8
# ℹ 2 more variables: is_winner_lg <lgl>, is_winner_ws <lgl>

9.3.6 Function to simulate one season

It is convenient to place all of these commands including the functions make_schedule() and win_league() in a single function one_simulation_68(), which you can find in the abdwr3edata package. The only input is the standard deviation s_talent that describes the spread of the normal talent distribution. The output is a data frame containing the teams, talents, number of season wins, and success in the post-season. We illustrate simulating one season and display the data frame results_1 that is returned.

library(abdwr3edata)
set.seed(111653)
results_1 <- one_simulation_68(0.20)
results_1
# A tibble: 20 × 6
   Team   Wins League   Talent Winner.Lg Winner.WS
   <chr> <int>  <dbl>    <dbl>     <dbl>     <dbl>
 1 SFN      93      1 -0.0591          1         0
 2 PHI      93      1 -0.00979         0         0
 3 LAN      87      1  0.00406         0         0
 4 HOU      84      1 -0.117           0         0
 5 SLN      80      1 -0.128           0         0
 6 ATL      79      1 -0.100           0         0
 7 CIN      79      1 -0.235           0         0
 8 NYN      76      1 -0.269           0         0
 9 CHN      76      1 -0.0199          0         0
10 PIT      63      1 -0.313           0         0
11 NYA     100      2  0.284           1         1
12 DET      93      2  0.379           0         0
13 CHA      87      2  0.139           0         0
14 BOS      86      2 -0.102           0         0
15 WS2      84      2  0.0915          0         0
16 OAK      82      2 -0.0622          0         0
17 CAL      78      2 -0.129           0         0
18 BAL      74      2 -0.0728          0         0
19 MIN      65      2 -0.207           0         0
20 CLE      61      2 -0.292           0         0

We write a new function display_standings() to put the season wins in a more familiar standings format. The inputs to this function are the results_1 data frame and the league indicator.

display_standings <- function(data, league) {
  data |> 
    filter(League == league) |>
    select(Team, Wins) |>
    mutate(Losses = 162 - Wins) |>
    arrange(desc(Wins))
}

We then apply this function twice (once for each league) using map() and then use the bind_cols() function to combine the two standings into a single data frame. The league champions and the World Series winner are also displayed below.

map(1:2, display_standings, data = results_1) |>
  bind_cols()
# A tibble: 10 × 6
   Team...1 Wins...2 Losses...3 Team...4 Wins...5 Losses...6
   <chr>       <int>      <dbl> <chr>       <int>      <dbl>
 1 SFN            93         69 NYA           100         62
 2 PHI            93         69 DET            93         69
 3 LAN            87         75 CHA            87         75
 4 HOU            84         78 BOS            86         76
 5 SLN            80         82 WS2            84         78
 6 ATL            79         83 OAK            82         80
 7 CIN            79         83 CAL            78         84
 8 NYN            76         86 BAL            74         88
 9 CHN            76         86 MIN            65         97
10 PIT            63         99 CLE            61        101
results_1 |>
  filter(Winner.Lg == 1) |> 
  select(Team, Winner.WS)
# A tibble: 2 × 2
  Team  Winner.WS
  <chr>     <dbl>
1 SFN           0
2 NYA           1

In this particular simulated season, the Philadelphia Phillies (PHI) and the San Francisco Giants (SFN) tied for the National League title with 93 wins and the New York Yankees (NYA) won the American League with 100 wins. The Yankees defeated the Giants in the World Series. The team with the best talent in this season was Detroit (talent equal to 0.379) and they lost in the ALCS. In other words the “best team in baseball” was not the most successful during this simulated season. We will shortly see if the best team typically wins the World Series.

9.3.7 Simulating many seasons

One can learn about the relationship between a team’s ability and its season performance by simulating many seasons of baseball. To simulate 1000 seasons, we use the rep() function to create a vector of length 1000, then use map() to repeatedly apply the one_simulation_68() function to this vector, storing the output in many_results.

set.seed(111653)
many_results <- rep(0.20, 1000) |>
  map(one_simulation_68) |>
  list_rbind()

The data frame many_results contains the talent number and number of wins for 1000 \(\times\) 20 = 20,000 teams. By use of the geom_point() function using the alpha = 0.05 argument, we construct a “smoothed” scatterplot of Talent and Wins in Figure 9.1.

ggplot(many_results, aes(Talent, Wins)) +
  geom_point(alpha = 0.05)
Figure 9.1: Scatterplot of team talents and wins for many season simulations.

As expected, there is a positive trend in the graph, indicating that better teams tend to win more games. But there is much vertical spread in the scatterplot, which says that the relationship between talent and wins is not strong.

To reinforce the last point, suppose we focus on “average” teams that have a talent number between \(-0.05\) and 0.05. Using the filter() function, we isolate the talent and wins data for these average teams. A histogram of the season wins for these teams is shown in Figure 9.2.

many_results |>
  filter(Talent > -0.05, Talent < 0.05) |>
  ggplot(aes(Wins)) + 
  geom_histogram(color = crcblue, fill = "white")
Figure 9.2: Histogram of number of wins of teams of average talents in the simulations.

One expects these average teams to win about 81 games. But what is surprising is the variability in the win totals—average teams can regularly have win totals between 70 and 90, and it is possible (but not likely) to have a win total close to 100.

What is the relationship between a team’s talent and its post-season success? Consider first the relationship between a team’s talent (variable Talent) and winning the league (the variable Winner.Lg). Since Winner.Lg is a binary (0 or 1) variable, a common approach for representing this relationship is a logistic model; this is a generalization of the usual regression model where the response variable is binary instead of continuous. We use the glm() function with the family argument set to binomial to fit a logistic model; the output is stored in the variable fit1. In a similar fashion, we use a logistic model to model the relationship between winning the World Series (variable Winner.WS) and the talent; the output is in the variable fit2.

fit1 <- glm(
  Winner.Lg ~ Talent,
  data = many_results, family = binomial
)
fit2 <- glm(
  Winner.WS ~ Talent,
  data = many_results, family = binomial
)

A logistic model has the form \[ p = \frac{\exp(a + b T)}{ 1 + \exp(a + b T)}, \] where \(T\) is a team’s talent, \((a, b)\) are the regression coefficients, and \(p\) is the probability of the event.

In the following code, we generate a vector of plausible values of a team’s talent and store them in the vector talent_values. We then compute fitted probabilities of winning the pennant and winning the World Series using the predict() function. (The type = "response" argument will map values of \(a + b T\) to the probability scale.) Then we construct a graph of talent against probability using the geom_line() function where the color of the line corresponds to the type of achievement. The completed graph is displayed in Figure 9.3.

tdf <- tibble(
  Talent =  seq(-0.4, 0.4, length.out = 100)
)
tdf |>
  mutate(
    Pennant = predict(fit1, newdata = tdf, type = "response"),
    `World Series` = predict(fit2, newdata = tdf, type = "response")
  ) |>
  pivot_longer(
    cols = -Talent,
    names_to = "Outcome", 
    values_to = "Probability"
  ) |>
  ggplot(aes(Talent, Probability, color = Outcome)) +
  geom_line() + ylim(0, 1) +
  scale_color_manual(values = crc_fc)
Figure 9.3: Probability of winning the pennant and winning the World Series for teams of different talents.

As expected, the chance of a team winning the pennant (blue line) increases as a function of the talent. An average team with \(T = 0\) has only a small chance of winning the pennant; an excellent team with a talent close to 0.4 has about a 60% chance of winning the pennant. The probabilities of winning the World Series (represented by a red line) are substantially smaller than the chances of winning the pennant. For example, this excellent (\(T = 0.4\)) team has only about a 35% chance of winning the World Series. In fact, it can be demonstrated that the team winning the World Series is likely not to be the team with the best talent (largest value of \(T\)).

9.4 Further Reading

A general description of the Markov chain probability model is contained in Kemeny and Snell (1960). Pankin (1987) and Bukiet, Harold, and Palacios (1997) illustrate the use of Markov chains to model baseball. Chapter 9 of Albert (2017) gives an introductory description of Markov chains and illustrates the construction and use of the transition matrix using 1987 season data. The Bradley-Terry model (Bradley and Terry 1952) is a popular statistical model for paired comparisons. Chapter 9 of Albert and Bennett (2003) describes the application of the Bradley-Terry model for baseball team competition. The use of R in simulation is introduced in Chapter 11 of Albert and Rizzo (2012). Lopez, Matthews, and Baumer (2018) use a Bradley-Terry state space model to address the question of how often the best teams win.

9.5 Exercises

1. A Simple Markov Chain

Suppose one is interested only in the number of outs in an inning. There are four possible states in an inning (0 outs, 1 out, 2 outs, and 3 outs) and you move between these states in each plate appearance. Suppose at each PA, the chance of not increasing the number of outs is 0.3, and the probability of increasing the outs by one is 0.7. The following R code puts the transition probabilities of this Markov chain in a matrix P.

P <- matrix(c(.3, .7,  0,  0, 
               0, .3, .7,  0,
               0,  0, .3, .7, 
               0,  0,  0,  1), 4, 4, byrow = TRUE)
  1. If one multiplies the matrix P by itself P to obtain the matrix P2:
P2 <- P %*% P

The first row of P2 gives the probabilities of moving from 0 outs to each of the four states after two plate appearances. Compute P2. Based on this computation, find the probability of moving from 0 outs to 1 out after two plate appearances.

  1. The fundamental matrix N is computed as
N <- solve(diag(c(1, 1, 1)) - P[-4, -4])

The first row gives the average number of PAs at 0 out, 1 out, and 2 outs in an inning. Compute N and find the average number of PAs in one inning in this model.

2. A Simple Markov Chain, Continued

The following function simulate_half_inning() will simulate the number of plate appearances in a single half-inning of the Markov chain model described in Exercise 1 where the input P is the transition probability matrix.

simulate_half_inning <- function(P) {
  s <- 1 
  path <- NULL 
  while(s < 4){
    s_new <- sample(1:4, 1, prob = P[s, ])
    path <- c(path, s_new)
    s <- s_new
  }
  length(path)
}
  1. Use the map() function to simulate 1000 half-innings of this Markov chain and store the lengths of these simulated innings in the vector lengths.

  2. Using this simulated output, find the probability that a half-inning contains exactly four plate appearances.

  3. Use the simulated output to find the average number of PAs in a half-inning. Compare your answer with the exact answer in Exercise 1, part (b).

3. Simulating a Half Inning

In Section 9.2.4, the expected number of runs as calculated for each one of the 24 possible runners-outs situations using data from the 2016 season. To see how these values can change across seasons, download play-by-play data from Retrosheet for the 1968 season, construct the probability transition matrix, simulate 10,000 half-innings from each of the 24 situations, and compute the run expectancy matrix. Compare this 1968 run expectancy matrix with the one computed using 2016 data.

4. Simulating the 1950 Season

Suppose you are interested in simulating the 1950 regular season for the National League. In this season, the team abbreviations were “PHI”, “BRO”, “NYG”, “BSN”, “STL”, “CIN”, “CHC”, and “PIT” and each team played every other team 22 games (11 games at each park).

  1. Using the function make_schedule(), construct the schedule of games for this NL season.

  2. Suppose the team talents follow a normal distribution with mean 0 and standard deviation 0.25. Using the Bradley-Terry model, assign home win probabilities for all games on the schedule.

  3. Use the rbinom() function to simulate the outcomes of all 616 games of the NL 1950 season.

  4. Compute the number of season wins for all teams in your simulation.

5. Simulating the 1950 Season, Continued

  1. Write a function to perform the simulation scheme described in Exercise 4. Have the function return the team with the largest talent and the team with the most wins. (If there is a tie for the league pennant, have the function return one of the best teams at random.)

  2. Repeat this simulation for 1000 seasons, collecting the most talented team and the most successful team for all seasons.

  3. Based on your simulations, what is the chance that the most talented team wins the pennant?

6. Simulating the World Series

  1. Write a function to simulate a World Series. The input is the probability p that the AL team will defeat the NL team in a single game.

  2. Suppose an AL team with talent \(0.40\) plays a NL team with talent \(0.25\). Using the Bradley-Terry model, determine the probability p that the AL wins a game.

  3. Using the value of p determined in part (b), simulate 1000 World Series and find the probability the AL team wins the World Series.

  4. Repeat parts (b) and (c) for AL and NL teams who have the same talents.


  1. This is often known as the memoryless property.↩︎