Inference for two-way tables

IMS, Ch. 18

Smith College

Apr 10, 2026

Inference for two-way tables

Recall: Inference for two proportions

Method	null dist.	sampling dist.
1: probability	?	?
2: simulation	randomization test (centered at \(0\))	two bootstraps (centered at \(\hat{p}_1 - \hat{p}_2\))
3: normal approx.	\(N(0, SE_{pool})\)	\(N \left( \hat{p}_1 - \hat{p}_2, SE_{\hat{p}_1 - \hat{p}_2} \right)\)

See IMS, Chapter 17

Beyond the binary

Difference in proportions:
- binary response
- binary explanatory
Two-way tables:
- categorical response
- categorical explanatory
- 🎊 either/both can have more than two levels!

Inference for two-way tables

Method	null dist.	sampling dist.
1: probability	hypergeometric (Fisher’s exact test)	`NA`
2: simulation	permutation test (starting at \(0\))	`NA`
3: \(\chi^2\) approx.	\(\chi^2 (k = d.f.)\)	`NA`

See IMS, Chapters 18

R.A. Fisher

“a genius who almost single-handedly created the foundations for modern statistical science”
“the single most important figure in 20th century statistics”
a racist, eugenicist, and Nazi sympathizer
renaming of COPSS Award

How Eugenics Shaped Statistics

Statistical thinking and eugenicist thinking are, in fact, deeply intertwined, and many of the theoretical problems with methods like significance testing—first developed to identify racial differences—are remnants of their original purpose, to support eugenics.

see also Kennedy-Shaffer (2024)

The DREAM Act

library(tidyverse)
library(openintro)
dream |>
  janitor::tabyl(ideology, stance) |>
  janitor::adorn_totals(where = c("row", "col"))

     ideology  No Not sure Yes Total
 Conservative 151       35 186   372
      Liberal  52        9 114   175
     Moderate 161       28 174   363
        Total 364       72 474   910

Setup

\(H_0\): stance independent of ideology
\(H_A\): stance not independent of ideology
\(\alpha = 0.05\)
What is the test statistic???

Logic

If \(H_0\) is true, then joint probabilities equal product of marginal probabilities
If \(A, B\) are independent, then \(\Pr(A \cap B) = \Pr(A) \cdot \Pr(B)\)

Consider this test statistic:

\[ X^2 = \sum_{i,j} \frac{(observed_{ij} - expected_{ij})^2}{expected_{ij}} \]

Test statistic

library(infer)
X2 <- dream |>
  observe(stance ~ ideology, null = "independence", stat = "Chisq") |>
  pull(stat)
X2

X-squared 
 16.38749

How to construct the null dist?

If \(H_0\) is true, then \(X^2\) would be 0
But also \(X^2 \geq 0\)
So what is the sampling distribution of \(X^2\)?

Null distribution

library(infer)
dream_null <- dream |>
  specify(stance ~ ideology) |>
  hypothesize(null = "independence") |>
  generate(1000, type = "permute") |>
  calculate(stat = "Chisq")

dream_plot <- dream_null |>
  ggplot(aes(x = stat)) +
  geom_density(fill = "darkgray") +
  geom_vline(xintercept = 0, linetype = 3) +
  geom_vline(xintercept = X2, linetype = 2)

Permutation test

dream_plot

Chi-squared test

Statisticians have shown that \(X^2 \sim \chi^2(k)\), where
\(k\) is the number of degrees of freedom
\(k\) = (number or levels in response - 1) \(\cdot\) (number of levels in explanatory - 1)
In this case, \(k = (3-1) \cdot (3-1) = 4\)

Null approximation

dream_plot +
  stat_function(fun = dchisq, args = list(df = 4), color = "red")

p-values

# permutation test
dream_null |>
  get_p_value(obs_stat = X2, direction = "right")

# A tibble: 1 × 1
  p_value
    <dbl>
1   0.002

# chi-squared test
pchisq(q = X2, df = 4, lower.tail = FALSE)

  X-squared 
0.002540935

There is a statistically discernible association between stance on the DREAM Act and political ideology

Your turn

See handout