Homework

Note

All homework assignments must be completed in Quarto and submitted as a PDF to Moodle by the corresponding Friday at 5 pm.

You must show your code!!

library(tidyverse)
library(Lahman)

HW 1: Due 9/12

Tip

You may find the schema for the Lahman database helpful.

Use the Lahman package to identify when MLB moved to a 154-game schedule? When did they move to a 162 game schedule? Since the advent of the 154-game schedule, in which years were there interruptions in play? Can you Identify the cause of each?

Since the advent of the 154-game schedule, use the Lahman package to rank the 10 best and 10 worst teams of all time, as measured by winning percentage.

Use the Lahman package to identify all 8 players in Major League history with at least 300 home runs and at least 300 stolen bases in their careers.

Use the Lahman package to identify all 10 pitchers in Major League history with at least 300 wins and at least 3000 strikeouts in their careers.

Use the Teams table from the Lahman package to produce the following table for all teams with the words “New York” or “Brooklyn” in their names.

franchID	num_teamids	names	num_seasons	first	last	G	wpct	WS
NYY	1	New York Highlanders,New York Yankees	122	1903	2024	19014	0.5694811	27
NYM	1	New York Mets	63	1962	2024	9972	0.4833400	2
SFG	1	New York Gothams,New York Giants	75	1883	1957	11115	0.5533060	7
LAD	2	Brooklyn Atlantics,Brooklyn Grays,Brooklyn Bridegrooms,Brooklyn Grooms,Brooklyn Superbas,Brooklyn Dodgers,Brooklyn Robins	74	1884	1957	11031	0.5157740	1
BTT	1	Brooklyn Tip-Tops	2	1914	1915	310	0.4803922	0
BWW	1	Brooklyn Ward’s Wonders	1	1890	1890	133	0.5757576	0
NYI	1	New York Giants	1	1890	1890	132	0.5648855	0
BRG	1	Brooklyn Gladiators	1	1890	1890	100	0.2653061	0
NYP	1	New York Metropolitans	5	1883	1887	592	0.4663212	0
NYU	1	New York Mutuals	1	1876	1876	57	0.3750000	0
NNA	1	New York Mutuals	5	1871	1875	278	0.5531136	0
BRA	1	Brooklyn Atlantics	4	1872	1875	192	0.2631579	0
ECK	1	Brooklyn Eckfords	1	1872	1872	29	0.1034483	0

HW 2: Due 9/19

ABDWR, Exercise 4.7.1

Tip

Hint: See https://beanumber.github.io/abdwr3e/02-intro.html#iterating-using-map

ABDWR, Exercise 4.7.2

Fit a regression model to the Teams data since 1962 to estimate weights (i.e., run values) for the batting events included in eXtrapolated Runs Basic. Compare your results to the values shown. Make sure that you compute everything per game!
1. Report the Root Mean Squared Error.
2. Which run values are most similar?
3. Which run values are most different? How meaningful are these differences?
4. For those run values that are most different, hypothesize why they might be different?

For the same data set that you used above, compute:
- batting average (BAVG)
- on-base percentage (OBP)
- slugging percentage (SLG)
- OPS
- Runs Created (basic) (RC)
- eXtrapolated Runs Basic (XR)
  Compute the Pearson correlation coefficient between each of these run estimators and (actual) runs per game. Which metrics correlate most closely to runs per game?

HW 3: Due 9/26

Tip

Use may the following code to generate the expected run matrix for the 2016 season using Retrosheet play-by-play data.

library(tidyverse)
library(abdwr3edata)

half_innings <- retro2016 |>
  retrosheet_add_states() |>
  group_by(game_id, inn_ct, bat_home_id) |>
  summarize(
    outs_inning = sum(event_outs_ct), 
    runs_inning = sum(runs_scored),
    runs_start = first(away_score_ct + home_score_ct),
    max_runs = runs_inning + runs_start
  )

changes2016 <- retro2016 |>
  retrosheet_add_states() |>
  inner_join(half_innings, by = join_by(game_id, inn_ct, bat_home_id)) |>
  mutate(runs_roi = max_runs - (away_score_ct + home_score_ct)) |>
  filter(state != new_state | runs_scored > 0) |>
  filter(outs_inning == 3)

erm2016 <- changes2016 |> 
  group_by(bases, outs_ct) |>
  summarize(exp_run_value = mean(runs_roi))

`summarise()` has grouped output by 'bases'. You can override using the
`.groups` argument.

erm2016 |>
  pivot_wider(
    names_from = outs_ct, 
    values_from = exp_run_value, 
    names_prefix = "Outs="
  ) |>
  arrange(`Outs=0`) |>
  knitr::kable(digits = 3)

bases	Outs=0	Outs=1	Outs=2
000	0.498	0.268	0.106
100	0.858	0.512	0.220
010	1.133	0.673	0.312
001	1.347	0.937	0.372
110	1.445	0.921	0.414
101	1.723	1.196	0.478
011	1.929	1.358	0.548
111	2.106	1.537	0.695

HW 4: Due 10/3

Tip

It’s just logistic regression. Use the glm() function with family = binomial, or ask a #question on Slack about generalized linear models.

Use your creativity to create one data graphic like the ones shown in Exploring Statcast data. Try to make this data graphic as compelling as possible.

HW 5: Due 10/10

Use the hoopR (for the NBA) or the wehoop (for the WNBA) package to create a compelling data graphic. You could pursue something more statistical (like the scatterplots for the Four Factors) or something more visual (like the shot charts).
Sports like field hockey, ice hockey, and soccer share important properties with basketball. Choose one of those sports (or another sport that you think would also fit), and describe how the concepts from the Four Factors would (or wouldn’t) translate to that sport.