library(tidyverse)
library(Lahman)
Homework
Note
All homework assignments must be completed in Quarto and submitted as a PDF to Moodle by the corresponding Friday at 5 pm.
You must show your code!!
HW 1: Due 9/12
Tip
You may find the schema for the Lahman database helpful.
- Use the Lahman package to identify when MLB moved to a 154-game schedule? When did they move to a 162 game schedule? Since the advent of the 154-game schedule, in which years were there interruptions in play? Can you Identify the cause of each?
- Since the advent of the 154-game schedule, use the Lahman package to rank the 10 best and 10 worst teams of all time, as measured by winning percentage.
- Use the Lahman package to identify all 8 players in Major League history with at least 300 home runs and at least 300 stolen bases in their careers.
- Use the Lahman package to identify all 10 pitchers in Major League history with at least 300 wins and at least 3000 strikeouts in their careers.
- Use the
Teams
table from the Lahman package to produce the following table for all teams with the words “New York” or “Brooklyn” in their names.
franchID | num_teamids | names | num_seasons | first | last | G | wpct | WS |
---|---|---|---|---|---|---|---|---|
NYY | 1 | New York Highlanders,New York Yankees | 122 | 1903 | 2024 | 19014 | 0.5694811 | 27 |
NYM | 1 | New York Mets | 63 | 1962 | 2024 | 9972 | 0.4833400 | 2 |
SFG | 1 | New York Gothams,New York Giants | 75 | 1883 | 1957 | 11115 | 0.5533060 | 7 |
LAD | 2 | Brooklyn Atlantics,Brooklyn Grays,Brooklyn Bridegrooms,Brooklyn Grooms,Brooklyn Superbas,Brooklyn Dodgers,Brooklyn Robins | 74 | 1884 | 1957 | 11031 | 0.5157740 | 1 |
BTT | 1 | Brooklyn Tip-Tops | 2 | 1914 | 1915 | 310 | 0.4803922 | 0 |
BWW | 1 | Brooklyn Ward’s Wonders | 1 | 1890 | 1890 | 133 | 0.5757576 | 0 |
NYI | 1 | New York Giants | 1 | 1890 | 1890 | 132 | 0.5648855 | 0 |
BRG | 1 | Brooklyn Gladiators | 1 | 1890 | 1890 | 100 | 0.2653061 | 0 |
NYP | 1 | New York Metropolitans | 5 | 1883 | 1887 | 592 | 0.4663212 | 0 |
NYU | 1 | New York Mutuals | 1 | 1876 | 1876 | 57 | 0.3750000 | 0 |
NNA | 1 | New York Mutuals | 5 | 1871 | 1875 | 278 | 0.5531136 | 0 |
BRA | 1 | Brooklyn Atlantics | 4 | 1872 | 1875 | 192 | 0.2631579 | 0 |
ECK | 1 | Brooklyn Eckfords | 1 | 1872 | 1872 | 29 | 0.1034483 | 0 |
HW 2: Due 9/19
- Fit a regression model to the
Teams
data since 1962 to estimate weights (i.e., run values) for the batting events included in eXtrapolated Runs Basic. Compare your results to the values shown. Make sure that you compute everything per game!- Report the Root Mean Squared Error.
- Which run values are most similar?
- Which run values are most different? How meaningful are these differences?
- For those run values that are most different, hypothesize why they might be different?
- For the same data set that you used above, compute:
- batting average (
BAVG
) - on-base percentage (
OBP
) - slugging percentage (
SLG
) - OPS
- Runs Created (basic) (
RC
) - eXtrapolated Runs Basic (
XR
)
Compute the Pearson correlation coefficient between each of these run estimators and (actual) runs per game. Which metrics correlate most closely to runs per game?
- batting average (
HW 3: Due 9/26
Tip
Use may the following code to generate the expected run matrix for the 2016 season using Retrosheet play-by-play data.
library(tidyverse)
library(abdwr3edata)
<- retro2016 |>
half_innings retrosheet_add_states() |>
group_by(game_id, inn_ct, bat_home_id) |>
summarize(
outs_inning = sum(event_outs_ct),
runs_inning = sum(runs_scored),
runs_start = first(away_score_ct + home_score_ct),
max_runs = runs_inning + runs_start
)
<- retro2016 |>
changes2016 retrosheet_add_states() |>
inner_join(half_innings, by = join_by(game_id, inn_ct, bat_home_id)) |>
mutate(runs_roi = max_runs - (away_score_ct + home_score_ct)) |>
filter(state != new_state | runs_scored > 0) |>
filter(outs_inning == 3)
<- changes2016 |>
erm2016 group_by(bases, outs_ct) |>
summarize(exp_run_value = mean(runs_roi))
`summarise()` has grouped output by 'bases'. You can override using the
`.groups` argument.
|>
erm2016 pivot_wider(
names_from = outs_ct,
values_from = exp_run_value,
names_prefix = "Outs="
|>
) arrange(`Outs=0`) |>
::kable(digits = 3) knitr
bases | Outs=0 | Outs=1 | Outs=2 |
---|---|---|---|
000 | 0.498 | 0.268 | 0.106 |
100 | 0.858 | 0.512 | 0.220 |
010 | 1.133 | 0.673 | 0.312 |
001 | 1.347 | 0.937 | 0.372 |
110 | 1.445 | 0.921 | 0.414 |
101 | 1.723 | 1.196 | 0.478 |
011 | 1.929 | 1.358 | 0.548 |
111 | 2.106 | 1.537 | 0.695 |
HW 4: Due 10/3
Tip
It’s just logistic regression. Use the glm()
function with family = binomial
, or ask a #question
on Slack about generalized linear models.
- Use your creativity to create one data graphic like the ones shown in Exploring Statcast data. Try to make this data graphic as compelling as possible.
HW 5: Due 10/10
- Use the
hoopR
(for the NBA) or thewehoop
(for the WNBA) package to create a compelling data graphic. You could pursue something more statistical (like the scatterplots for the Four Factors) or something more visual (like the shot charts). - Sports like field hockey, ice hockey, and soccer share important properties with basketball. Choose one of those sports (or another sport that you think would also fit), and describe how the concepts from the Four Factors would (or wouldn’t) translate to that sport.