Homework

Note

All homework assignments must be completed in Quarto and submitted as a PDF to Moodle by the corresponding Friday at 5 pm.

You must show your code!!

library(tidyverse)
library(Lahman)

HW 1: Due 9/12

Tip

You may find the schema for the Lahman database helpful.

  1. Use the Lahman package to identify when MLB moved to a 154-game schedule? When did they move to a 162 game schedule? Since the advent of the 154-game schedule, in which years were there interruptions in play? Can you Identify the cause of each?
  1. Since the advent of the 154-game schedule, use the Lahman package to rank the 10 best and 10 worst teams of all time, as measured by winning percentage.
  1. Use the Lahman package to identify all 8 players in Major League history with at least 300 home runs and at least 300 stolen bases in their careers.
  1. Use the Lahman package to identify all 10 pitchers in Major League history with at least 300 wins and at least 3000 strikeouts in their careers.
  1. Use the Teams table from the Lahman package to produce the following table for all teams with the words “New York” or “Brooklyn” in their names.
franchID num_teamids names num_seasons first last G wpct WS
NYY 1 New York Highlanders,New York Yankees 122 1903 2024 19014 0.5694811 27
NYM 1 New York Mets 63 1962 2024 9972 0.4833400 2
SFG 1 New York Gothams,New York Giants 75 1883 1957 11115 0.5533060 7
LAD 2 Brooklyn Atlantics,Brooklyn Grays,Brooklyn Bridegrooms,Brooklyn Grooms,Brooklyn Superbas,Brooklyn Dodgers,Brooklyn Robins 74 1884 1957 11031 0.5157740 1
BTT 1 Brooklyn Tip-Tops 2 1914 1915 310 0.4803922 0
BWW 1 Brooklyn Ward’s Wonders 1 1890 1890 133 0.5757576 0
NYI 1 New York Giants 1 1890 1890 132 0.5648855 0
BRG 1 Brooklyn Gladiators 1 1890 1890 100 0.2653061 0
NYP 1 New York Metropolitans 5 1883 1887 592 0.4663212 0
NYU 1 New York Mutuals 1 1876 1876 57 0.3750000 0
NNA 1 New York Mutuals 5 1871 1875 278 0.5531136 0
BRA 1 Brooklyn Atlantics 4 1872 1875 192 0.2631579 0
ECK 1 Brooklyn Eckfords 1 1872 1872 29 0.1034483 0

HW 2: Due 9/19

  1. ABDWR, Exercise 4.7.1
  1. ABDWR, Exercise 4.7.2
  1. Fit a regression model to the Teams data since 1962 to estimate weights (i.e., run values) for the batting events included in eXtrapolated Runs Basic. Compare your results to the values shown. Make sure that you compute everything per game!
    1. Report the Root Mean Squared Error.
    2. Which run values are most similar?
    3. Which run values are most different? How meaningful are these differences?
    4. For those run values that are most different, hypothesize why they might be different?
  1. For the same data set that you used above, compute:

HW 3: Due 9/26

Tip

Use may the following code to generate the expected run matrix for the 2016 season using Retrosheet play-by-play data.

library(tidyverse)
library(abdwr3edata)

half_innings <- retro2016 |>
  retrosheet_add_states() |>
  group_by(game_id, inn_ct, bat_home_id) |>
  summarize(
    outs_inning = sum(event_outs_ct), 
    runs_inning = sum(runs_scored),
    runs_start = first(away_score_ct + home_score_ct),
    max_runs = runs_inning + runs_start
  )

changes2016 <- retro2016 |>
  retrosheet_add_states() |>
  inner_join(half_innings, by = join_by(game_id, inn_ct, bat_home_id)) |>
  mutate(runs_roi = max_runs - (away_score_ct + home_score_ct)) |>
  filter(state != new_state | runs_scored > 0) |>
  filter(outs_inning == 3)
erm2016 <- changes2016 |> 
  group_by(bases, outs_ct) |>
  summarize(exp_run_value = mean(runs_roi)) 
`summarise()` has grouped output by 'bases'. You can override using the
`.groups` argument.
erm2016 |>
  pivot_wider(
    names_from = outs_ct, 
    values_from = exp_run_value, 
    names_prefix = "Outs="
  ) |>
  arrange(`Outs=0`) |>
  knitr::kable(digits = 3)
bases Outs=0 Outs=1 Outs=2
000 0.498 0.268 0.106
100 0.858 0.512 0.220
010 1.133 0.673 0.312
001 1.347 0.937 0.372
110 1.445 0.921 0.414
101 1.723 1.196 0.478
011 1.929 1.358 0.548
111 2.106 1.537 0.695
  1. ABDWR, Exercise 5.11.1
  2. ABDWR, Exercise 5.11.3
  3. ABDWR, Exercise 5.11.4
  4. ABDWR, Exercise 5.11.5

HW 4: Due 10/3

  1. ABDWR, Exercise 13.5.2
  2. ABDWR, Exercise 13.5.3
Tip

It’s just logistic regression. Use the glm() function with family = binomial, or ask a #question on Slack about generalized linear models.

  1. Use your creativity to create one data graphic like the ones shown in Exploring Statcast data. Try to make this data graphic as compelling as possible.

HW 5: Due 10/10

  1. Use the hoopR (for the NBA) or the wehoop (for the WNBA) package to create a compelling data graphic. You could pursue something more statistical (like the scatterplots for the Four Factors) or something more visual (like the shot charts).
  2. Sports like field hockey, ice hockey, and soccer share important properties with basketball. Choose one of those sports (or another sport that you think would also fit), and describe how the concepts from the Four Factors would (or wouldn’t) translate to that sport.