Inference for regression (bootstrap)

IMS, Ch. 24

Smith College

Nov 21, 2022

Inference for regression

Inference for regression

Method null dist. sampling dist.
1: simulation randomization test (entered at \(\beta_1 = 0\)) bootstrap (centered at \(\hat{\beta}_1\))
2: probability ? ?
3: \(t\)-approx. \(t(d.f.)\) \(t \left(d.f. \right)\)
  • See IMS, Chapter 24

Recall the babies

library(tidyverse)
library(openintro)
births14 |>
  head()
# A tibble: 6 × 13
   fage  mage mature      weeks premie visits gained weight lowbirthweight sex  
  <int> <dbl> <chr>       <dbl> <chr>   <dbl>  <dbl>  <dbl> <chr>          <chr>
1    34    34 younger mom    37 full …     14     28   6.96 not low        male 
2    36    31 younger mom    41 full …     12     41   8.86 not low        fema…
3    37    36 mature mom     37 full …     10     28   7.51 not low        fema…
4    NA    16 younger mom    38 full …     NA     29   6.19 not low        male 
5    32    31 younger mom    36 premie     12     48   6.75 not low        fema…
6    32    26 younger mom    39 full …     14     45   6.69 not low        fema…
# ℹ 3 more variables: habit <chr>, marital <chr>, whitemom <chr>
  • What is the relationship between birthweight (weight) and length of gestation (weeks)?

Data space

births_plot <- ggplot(births14, aes(x = weeks, y = weight)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm")
births_plot

Recall: simple linear regression

mod <- lm(weight ~ weeks, data = births14)
coefs <- coef(mod)
coefs
(Intercept)       weeks 
 -3.5979701   0.2792151 

Your turn

  • Interpret the slope coefficient in the context of the problem

  • Interpret the intercept in the context of the problem

Bootstrap for regression

One bootstrap resample

  • Take one bootstrap resample
one_resample <- births14 |>
  slice_sample(prop = 1, replace = TRUE)
  • Compute regression line
one_slr <- one_resample |>
  lm(formula = weight ~ weeks) |>
  coef() |>
  bind_rows()
one_slr
# A tibble: 1 × 2
  `(Intercept)` weeks
          <dbl> <dbl>
1         -2.77 0.257

Resample in the data space

births_plot +
  geom_point(data = one_resample, alpha = 0.5, color = "orange") +
  geom_abline(
    data = one_slr, 
    aes(slope = weeks, intercept = `(Intercept)`), 
    color = "orange"
  )

More resamples

resample_line <- function() {
  births14 |>
    slice_sample(prop = 1, replace = TRUE) |>
    lm(formula = weight ~ weeks) |>
    coef() |>
    bind_rows()
}
resample_line()
# A tibble: 1 × 2
  `(Intercept)` weeks
          <dbl> <dbl>
1         -2.37 0.247
resample_line()
# A tibble: 1 × 2
  `(Intercept)` weeks
          <dbl> <dbl>
1         -3.88 0.286
resample_line()
# A tibble: 1 × 2
  `(Intercept)` weeks
          <dbl> <dbl>
1         -2.84 0.259

Many, many resamples

resampled_lines <- map_dfr(1:10, ~resample_line())
resampled_lines
# A tibble: 10 × 2
   `(Intercept)` weeks
           <dbl> <dbl>
 1         -4.80 0.309
 2         -4.26 0.297
 3         -4.40 0.299
 4         -3.56 0.277
 5         -2.99 0.265
 6         -3.93 0.289
 7         -2.79 0.259
 8         -2.83 0.260
 9         -3.45 0.275
10         -2.97 0.262

Many resamples in the data space

births_plot +
  geom_abline(
    data = resampled_lines, 
    aes(slope = weeks, intercept = `(Intercept)`), 
    color = "orange"
  )

Using infer

library(infer)
sampling_dist <- births14 |>
  specify(weight ~ weeks) |>
  generate(1000, type = "bootstrap") |>
  calculate("slope")

ci <- sampling_dist |>
  get_ci()
ci
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.241    0.314

Sampling distribution of the slope

ggplot(sampling_dist, aes(x = stat)) +
  geom_density(fill = "darkgray") +
  geom_vline(xintercept = coefs[2], linetype = 2) + 
  geom_vline(data = pivot_longer(ci, everything()), aes(xintercept = value), linetype = 3)