Reward Systems in Sports

Who’s the Fairest of Them All?

Benjamin Baumer and Sarah Susnea

Smith College

Sep 27, 2025

I wrote a thing…

On fairness in sports (Baumer 2024)

Equality vs. equity

Examples

Bartneck and Moltchanova (2024)
De Veaux, Plantinga, and Upton (2022)
Sauer, Cseh, and Lenzner (2024)
Pollard, Noble, and Pollard (2022)
Guyon (2018)
Arlegi and Dimitrov (2020)
Scelles, François, and Valenti (2024)
Martens and Starflinger (2022)
…

Fairness in Machine Learning

COMPAS: recidivism scores

race	two_year_recid	risk_score	n
African-American	0	High	345
African-American	0	Low	1169
African-American	1	High	843
African-American	1	Low	818
Caucasian	0	High	106
Caucasian	0	Low	1175
Caucasian	1	High	230
Caucasian	1	Low	592

\(Y\): Did they recidivate (within 2 years)
\(\hat{Y}\): Did COMPAS label them high risk?
\(A\): Protected class (i.e., race)

ProPublica (Angwin et al. 2016)

Prediction fails differently for Black Defendants

Error Type	White	African-American
Labeled Higher Risk, But Didn't Re-Offend	23.5	44.9
labeled Lower Risk, Yet Did Re-Offend	47.7	28.0

❌ COMPAS fails to demonstrate error rate parity
❌ COMPAS fails to demonstrate demographic parity

Northpointe

39-page rebuttal (Dieterich, Mendoza, and Brennan 2016)
ProPublica misrepresents calculations
✅ COMPAS demonstrates predictive parity

\[ \Pr{(\neg Y | \hat{Y}, W)} = 0.409 \approx \Pr{(\neg Y | \hat{Y}, B)} = 0.370 \\ \Pr{(Y | \neg \hat{Y}, W)} = 0.288 \approx \Pr{(Y | \neg \hat{Y}, B)} = 0.448 \]

Statistical criteria for fairness

(Barocas, Hardt, and Narayanan 2023)

“Dozens” of statistical criteria for fairness boil down to 3:

Independence
Separation
Sufficiency

Independence

aka disparate impact

\[ \left| \Pr(\hat{Y} | W) - \Pr(\hat{Y} | B) \right| < \epsilon \,, \] where \(\epsilon\) is a small positive constant (typically \(\epsilon < 0.2\))

❌ COMPAS FAILS:

\[ \left| 0.348- 0.588 \right| = 0.210 > \epsilon \,, \]

Separation

aka error rate parity

\[ \left| \Pr(\hat{Y} | Y, W) - \Pr(\hat{Y} | Y, B) \right| < \epsilon \\ \left| \Pr(\hat{Y} | \neg Y, W) - \Pr(\hat{Y} | \neg Y, B) \right| < \epsilon \]

❌ COMPAS FAILS:

\[ \left| 0.523 - 0.720 \right| = 0.197 < \epsilon \\ \left| 0.235 - 0.488 \right| = 0.253 > \epsilon \]

Sufficiency

aka well-calibration

\[ \left| \Pr(Y | \hat{Y}, W) - \Pr(Y | \hat{Y}, B) \right| < \epsilon \,, \]

✅ COMPAS PASSES:

\[ \left| 0.591 - 0.630 \right| = 0.039 < \epsilon \]

Kleinberg’s impossibility theorem

Unless the base rates are the same across the groups, you can’t satisfy all three criteria simultaneously (Kleinberg, Mullainathan, and Raghavan 2016)

COMPAS summary

ProPublica: COMPAS is biased because it lacks error rate parity
Northpointe: Yeah, but it’s well-calibrated!
Kleinberg: But since the base rates aren’t the same:
- …you’re both right
- …and you’re both wrong
So is the algorithm fair or not??

Introducing `faireR`

leverages yardstick and mlr3fairness
computes independence, separation, and sufficiency
- mean absolute difference across groups
visualizations
data sets
tidyverse-friendly

Using `faireR`

library(faireR)
compas_grp <- compas_binary |>
  mutate(
    y = two_year_recid,
    y_hat = factor(ifelse(decile_score >= 7, 1, 0))
  ) |>
  group_by(race)

# compute fairness for COMPAS
compas_grp |>
  fairness_cube()

# A tibble: 1 × 3
  independence separation sufficiency
         <dbl>      <dbl>       <dbl>
1        0.214      0.186      0.0509

Fairness IRL

The narrow view

People with similar qualifications should be treated similarly

individual fairness
Ex: the meritocracy, race-blind admissions, etc.

The broad view

People of equal ability and ambition should have similar chances

fairness across groups
Ex: all schools should be equally well-funded

The middle view

Adjust for past injustice (that caused the differences in qualifications) at the time of opportunity

until recently, common interpretation in the US
Ex: Texas 10% university admission policy (affirmed in 2016 and struck down in 2023)

What does this have to do with sports?

Providing vocabulary, Ex 1

Instead of:

two equal pairs should have as close to an equal chance of winning as possible. (Pollard, Noble, and Pollard 2022)

We have a broad view of fairness
We focus on sufficiency (aka well-calibration)

Providing vocabulary, Ex 2

Instead of:

A swimmer with no arms should be able to compete against a swimmer with one or two arms…If two athletes with the same disability compete, the more skilled and fitter should win. (Bartneck and Moltchanova 2024)

We have a middle view of fairness
Focus on separation and sufficiency

Providing vocabulary, Ex 3

Instead of:

We present two methods to distribute prize money across gender based on the individual performances w.r.t. gender-speciﬁc records. We suppose these “across gender distributions” to be fair, as they suitably respect that women generally are slower than men. (Martens and Starflinger 2022)

We have a middle view of fairness
Focus on independence

Called strikes in MLB

Are umpires fair to LHH?

csas_grp <- csas25 |>
  mutate(
    y = factor(is_within_strike_zone),
    y_hat = factor(is_called_strike)
  ) |>
  group_by(stand)

csas_grp |>
  fairness_cube()

# A tibble: 1 × 3
  independence separation sufficiency
         <dbl>      <dbl>       <dbl>
1       0.0104     0.0496      0.0856

\(Y\): Was it a strike?
\(\hat{Y}\): Was it called a strike?
✅ Independence
✅ Separation
✅ Sufficiency

A simple Hall of Fame classifier

hof_dt <- tree_spec |>
  fit(
    as.factor(inducted) ~ tH + tHR + mvp + gg + tW + tSO + tSV + cy,
    data = candidates
  )

See Mills and Salaga (2011) for a more serious attempt…

Is the classifier fair w.r.t. batters and pitchers?

hof2025 |>
  mutate(is_pitcher = tSO > 100) |>
  group_by(is_pitcher) |>
  fairness_cube()

# A tibble: 1 × 3
  independence separation sufficiency
         <dbl>      <dbl>       <dbl>
1      0.00322    0.00520      0.0323

✅

Is the classifier fair w.r.t. starters and relievers?

hof2025 |>
  mutate(
    is_pitcher = tSO > 100,
    is_reliever = tSV > 50
  ) |>
  filter(is_pitcher) |>
  group_by(is_reliever) |>
  fairness_cube()

# A tibble: 1 × 3
  independence separation sufficiency
         <dbl>      <dbl>       <dbl>
1      0.00446      0.275      0.0821

❌

Ironman Texas

Award prizes to:

Naive: Fastest 20, regardless of gender
Status Quo: Fastest 10 in each gender division
Martens: Fastest 20 in relation to gender-specific world records

ironman_grp <- ironman |>
  mutate(
    fastest20 = factor(dense_rank(overall_time) <= 20),
    status_quo = factor(division_rank <= 10),
    martens = factor(dense_rank(quotient_model) <= ifelse(gender == "Male", 12, 8))
  ) |>
  group_by(gender)

Texas Ironman naive

Texas Ironman status quo

Texas Ironman proposed

Texas Ironman fairness

ironman_grp |>
  fairness_cube(truth = fastest20, estimate = status_quo)

# A tibble: 1 × 3
  independence separation sufficiency
         <dbl>      <dbl>       <dbl>
1        0.101         NA          NA

ironman_grp |>
  fairness_cube(truth = status_quo, estimate = martens)

# A tibble: 1 × 3
  independence separation sufficiency
         <dbl>      <dbl>       <dbl>
1        0.126        0.3       0.118

Future work

Visualize fairness in 2D
- ROC curves
- Calibration plots
Visualize fairness in 3D
Distance metric for overall fairness?
Is mean absolute difference the best measure?
Find additional suitable applications

References

Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. “Machine Bias.” ProPublica; ProPublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.

Arlegi, Ritxar, and Dinko Dimitrov. 2020. “Fair Elimination-Type Competitions.” European Journal of Operational Research 287 (2): 528–35. https://doi.org/10.1016/j.ejor.2020.03.025.

Barocas, Solon, Moritz Hardt, and Arvind Narayanan. 2023. Fairness and Machine Learning: Limitations and Opportunities. MIT Press. https://mitpress.mit.edu/9780262048613/fairness-and-machine-learning/.

Bartneck, Christoph, and Elena Moltchanova. 2024. “Fair World Para Masters Point System for Swimming.” Journal of Quantitative Analysis in Sports 20 (2): 147–77. https://doi.org/10.1515/jqas-2023-0051.

Baumer, Benjamin S. 2024. “Editor’s Note: On Fairness in Sports Analytics.” Journal of Quantitative Analysis in Sports. De Gruyter. https://doi.org/10.1515/jqas-2023-0103.

De Veaux, Richard, Anna Plantinga, and Elizabeth Upton. 2022. “Are the Handicaps Fair? Age and Participation Effects in the Dipsea Race.” Chance 35 (4): 40–49. https://doi.org/10.1080/09332480.2022.2145138.

Dieterich, William, Christina Mendoza, and Tim Brennan. 2016. “COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity Performance of the COMPAS Risk Scales in Broward County.” Northpointe Inc. Research Department. https://go.volarisgroup.com/rs/430-MBX-989/images/ProPublica_Commentary_Final_070616.pdf.

Guyon, Julien. 2018. “What a Fairer 24 Team UEFA Euro Could Look Like.” Journal of Sports Analytics 4 (4): 297–317. https://doi.org/10.3233/JSA-180219.

Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. 2016. “Inherent Trade-Offs in the Fair Determination of Risk Scores.” arXiv Preprint arXiv:1609.05807. https://arxiv.org/abs/1609.05807.

Martens, Maren, and Verena Starflinger. 2022. “Alternative Prize Money Distributions for Higher Gender Equity in Sports.” In International Conference on Operations Research, 333–39. Springer. https://doi.org/10.1007/978-3-031-24907-5_40.

Mills, Brian M, and Steven Salaga. 2011. “Using Tree Ensembles to Analyze National Baseball Hall of Fame Voting Patterns: An Application to Discrimination in BBWAA Voting.” Journal of Quantitative Analysis in Sports 7 (4). https://doi.org/10.2202/1559-0410.1367.

Pollard, Graham, Ken Noble, and Geoff Pollard. 2022. “New Scoring System to Reduce Unfairness in Men’s Doubles.” Journal of Sports Analytics 8 (4): 291–98. https://doi.org/10.3233/JSA-220607.

Sauer, Pascal, Ágnes Cseh, and Pascal Lenzner. 2024. “Improving Ranking Quality and Fairness in Swiss-System Chess Tournaments.” Journal of Quantitative Analysis in Sports 20 (2): 127–46. https://doi.org/doi:10.1515/jqas-2022-0090.

Scelles, Nicolas, Aurélien François, and Maurizio Valenti. 2024. “Impact of the UEFA Nations League on Competitive Balance, Competitive Intensity, and Fairness in European Men’s National Team Football.” International Journal of Sport Policy and Politics 16 (3): 519–37. https://doi.org/10.1080/19406940.2024.2323012.

Reward Systems in Sports

I wrote a thing…

On fairness in sports (Baumer 2024)

Equality vs. equity

Examples

Fairness in Machine Learning

COMPAS: recidivism scores

ProPublica (Angwin et al. 2016)

Northpointe

Statistical criteria for fairness

Independence

Separation

Sufficiency

Kleinberg’s impossibility theorem

COMPAS summary

Introducing faireR

Using faireR

Fairness IRL

The narrow view

The broad view

The middle view

What does this have to do with sports?

Providing vocabulary, Ex 1

Providing vocabulary, Ex 2

Providing vocabulary, Ex 3

Called strikes in MLB

Are umpires fair to LHH?

A simple Hall of Fame classifier

Is the classifier fair w.r.t. batters and pitchers?

Is the classifier fair w.r.t. starters and relievers?

Ironman Texas

Texas Ironman naive

Texas Ironman status quo

Texas Ironman proposed

Texas Ironman fairness

Future work

References

Introducing `faireR`

Using `faireR`