Inference for regression (testing)

IMS, Ch. 24

Smith College

Nov 30, 2022

Inference for regression

Method	null dist.	sampling dist.
1: simulation	randomization test (entered at \(\beta_1 = 0\))	bootstrap (centered at \(\hat{\beta}_1\))
2: probability	?	?
3: \(t\)-approx.	\(t(d.f.)\)	\(t \left(d.f. \right)\)

See IMS, Chapter 24

Why inference for regression?

Regression models have parameters (e.g., \(\beta_0, \beta_1\), etc.)

\[ Y = \beta_0 + \beta_1 \cdot X + \epsilon \, , \]

You know how to find \(\hat{\beta}_1\)
We can test hypotheses about \(\beta_1\) (i.e., \(\beta_1 = 0\))
We can find confidence intervals for \(\beta_1\)

Hypothesis testing

We are particularly interested in testing the hypothesis that \(\beta_1 = 0\)
We can do a randomization test
- Shuffle the response variable
Or, use a \(t\)-based approximation:
- test statistic is: \(t_{\beta_1} = \frac{\hat{\beta}_1 - 0}{SE_{\hat{\beta}_1}}\)
- degrees of freedom is \((n-1) - (\text{# of predictors})\)

Randomization test

Setup

library(tidyverse)
library(openintro)
births14

# A tibble: 1,000 × 13
    fage  mage mature     weeks premie visits gained weight lowbirthweight sex  
   <int> <dbl> <chr>      <dbl> <chr>   <dbl>  <dbl>  <dbl> <chr>          <chr>
 1    34    34 younger m…    37 full …     14     28   6.96 not low        male 
 2    36    31 younger m…    41 full …     12     41   8.86 not low        fema…
 3    37    36 mature mom    37 full …     10     28   7.51 not low        fema…
 4    NA    16 younger m…    38 full …     NA     29   6.19 not low        male 
 5    32    31 younger m…    36 premie     12     48   6.75 not low        fema…
 6    32    26 younger m…    39 full …     14     45   6.69 not low        fema…
 7    37    36 mature mom    36 premie     10     20   6.13 not low        fema…
 8    29    24 younger m…    40 full …     13     65   6.74 not low        male 
 9    30    32 younger m…    39 full …     15     25   8.94 not low        fema…
10    29    26 younger m…    39 full …     11     22   9.12 not low        male 
# ℹ 990 more rows
# ℹ 3 more variables: habit <chr>, marital <chr>, whitemom <chr>

births_plot <- ggplot(births14, aes(x = weeks, y = weight)) +
  geom_point() +
  geom_smooth(method = "lm")

Data space

births_plot

SLR model

mod <- lm(weight ~ weeks, data = births14)
coefs <- coef(mod)
coefs

(Intercept)       weeks 
 -3.5979701   0.2792151

Simulate the null distribution

library(infer)
slope_null_dist <- births14 |>
  specify(weight ~ weeks) |>
  hypothesize(null = "independence") |>
  generate(1000, type = "permute") |>
  calculate("slope")

# null distribution SE
slope_null_dist |>
  summarize(se = sd(stat))

# A tibble: 1 × 1
      se
   <dbl>
1 0.0161

slope_plot <- ggplot(slope_null_dist, aes(x = stat)) +
  geom_density(fill = "darkgray") +
  geom_vline(xintercept = 0, linetype = 3) +
  geom_vline(xintercept = coefs[2], linetype = 2)

Just one shuffled slope!

births_plot + aes(y = sample(weight))

And another one…

births_plot + aes(y = sample(weight))

And one more…

births_plot + aes(y = sample(weight))

Where is the test statistic?

slope_plot

Decision

We reject the null hypothesis that the true slope coefficient \(\beta_1 = 0\)
Gestational length has a statistically significant positive association with birthweight
Each additional week that a baby remains in the womb is associated with an increase in the expected birthweight of about one quarter of a pound (about the weight of 🍔)

\(t\)-based approximation

\(t\)-test

summary(mod)


Call:
lm(formula = weight ~ weeks, data = births14)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0175 -0.7187  0.0190  0.6886  3.6078 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.59797    0.52273  -6.883 1.03e-11 ***
weeks        0.27922    0.01349  20.699  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.094 on 998 degrees of freedom
Multiple R-squared:  0.3004,    Adjusted R-squared:  0.2997 
F-statistic: 428.4 on 1 and 998 DF,  p-value: < 2.2e-16

Using `broom`

mod_tidy <- broom::tidy(mod)
mod_tidy

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -3.60     0.523      -6.88 1.03e-11
2 weeks          0.279    0.0135     20.7  1.80e-79

How good is the approximation?

my_t <- function(x, df, se) { dt(x / se, df) / se }
SE <- mod_tidy |>
  filter(term == "weeks") |>
  pull(std.error)

slope_plot +
  stat_function(fun = my_t, args = list(df = 998, se = SE), color = "red")

Confidence intervals

Recall our bootstrap CI
Parametric confidence intervals are of the form: \[ \text{estimate} \pm \text{multiplier} \cdot \text{standard error} \]
\(t\)-confidence intervals for each coefficient have the form: \[ \hat{\beta_j} \pm \text{multiplier} \cdot \text{standard error}_{\hat{\beta}_j} \]

Computing confidence intervals

# using infer (randomization)
slope_null_dist |>
  get_ci() + 
  coefs["weeks"]

   lower_ci  upper_ci
1 0.2490562 0.3130666

# using base R
confint(mod)

                 2.5 %    97.5 %
(Intercept) -4.6237403 -2.572200
weeks        0.2527442  0.305686

# using broom
broom::tidy(mod, conf.int = TRUE)

# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   -3.60     0.523      -6.88 1.03e-11   -4.62     -2.57 
2 weeks          0.279    0.0135     20.7  1.80e-79    0.253     0.306

Diagnostics

Checking conditions

We assumed the slope coefficient followed a \(t\)-distribution
When we fit the regression model: \[ Y = \beta_0 + \beta_1 \cdot X + \epsilon \, , \]
We assume that \(\epsilon \sim N(0, \sigma)\), for some constant \(\sigma\)

L-I-N-E

Our inferences will only be valid if the following assumptions are reasonable:
- Linearity
- Independence (of residuals)
- Normality (of residuals)
- Equal Variance (of residuals)

Linearity (and independence)

plot(mod, which = 1)

Normality of residuals

plot(mod, which = 2)

Equal variance

plot(mod, which = 3)

Influential points

plot(mod, which = 5)

Check all conditions

library(performance)
check_model(mod, check = c("linearity", "qq", "homogeneity", "outliers"))

In-class micro-project

Teacher salaries

Recall the teacher salaries data set openintro::teacher
Use years of experience as an explanatory variable for total salary
Fit, interpret, and assess the model

Inference for regression (testing)

Inference for regression

Inference for regression

Why inference for regression?

Hypothesis testing

Randomization test

Setup

Data space

SLR model

Simulate the null distribution

Just one shuffled slope!

And another one…

And one more…

Where is the test statistic?

Decision

\(t\)-based approximation

\(t\)-test

Using broom

How good is the approximation?

Confidence intervals

Computing confidence intervals

Diagnostics

Checking conditions

L-I-N-E

Linearity (and independence)

Normality of residuals

Equal variance

Influential points

Check all conditions

In-class micro-project

Teacher salaries

Using `broom`