06: Simple linear regression

IMS, Ch. 7

Smith College

Feb 9, 2026

Warmup

Warmup 1

  • An article reported that there was a 0.42 correlation between alcohol consumption and income among adults with a four-year college degree.

  • Is it reasonable to conclude that increasing one’s alcohol consumption will increase one’s income? Explain why or why not.

Warmup 2

  • A college newspaper interviews a psychologist about student ratings of the teaching of faculty members.

  • The psychologist says, “The evidence indicates that the correlation between the research productivity and teaching rating of faculty members is close to zero.”

  • The paper reports this as “Prof. McDaniel said that good researchers tend to be poor teachers, and vice versa.”

  • Explain why the paper’s report is wrong.

  • Write a statement in plain language (don’t use the word correlation) to explain the psychologist’s meaning.

Simple linear regression

Regression

  • Goal: understand changes in a numerical response variable in terms of a numerical explanatory variable.

  • A simple linear regression model for \(y\) in terms of \(x\) takes the form: \[ y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i \,, \text{ for } i=1,\ldots,n \]

  • \(y_i\) and \(x_i\) for \(i = 1, \ldots, n\) are the observations
  • \(\beta_0\) is the intercept
  • \(\beta_1\) is the slope coefficient
  • \(\epsilon_i\)’s are the errors, or noise

Computing

  • Use lm() to fit the model
  • Use coef() on the regression object to get the coefficients
library(tidyverse)
library(openintro)
mod <- lm(bwt ~ gestation, data = babies)
coef(mod)
(Intercept)   gestation 
-10.0641842   0.4642626 

View model in the data space

  • Use geom_smooth(method = "lm") to show on a plot
ggplot(babies, aes(x = gestation,  y = bwt)) +
  geom_point() + 
  geom_smooth(method = "lm", se = 0)

Properties of SLR models

  • Only one regression line that fits the data best using a least squares criteria
    • ordinary least squares regression line is unique
  • The true values of the unknown parameters \(\beta_0\) and \(\beta_1\) are estimated by \(b_0\) and \(b_1\)
    • (or if you prefer, \(\hat{\beta}_0\) and \(\hat{\beta}_1\))
  • The fitted values are given by \[ \hat{y}_i = b_0 + b_1 \cdot x_i \]

Population model vs. fitted model

  • Population model: a statistical model
    • with two parameters: \(\beta_0, \beta_1\)
    • values of \(\beta_0, \beta_1\) are unknowable!

\[ y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i \,, \text{ for } i=1,\ldots,n \]

  • Fitted model: a deterministic equation
    • with two fitted coefficients: \(b_0, b_1\) (alternatively, \(\hat{\beta}_0, \hat{\beta}_1\))
    • values of \(b_0, b_1\) are known and fixed!

\[ \hat{y}_i = b_0 + b_1 \cdot x_i \,, \text{ for } i=1,\ldots,n \]

Residuals

  • The model (almost) never fits perfectly

  • What is left over are the residuals (\(e_i = y_i - \hat{y}_i\))

  • Many of the assumptions we’ll make later involve the residuals

  • Residual analysis is important! (…more later)

Example: RailTrail

RailTrail

The Pioneer Valley Planning Commission (PVPC) collected data north of Chestnut Street in Florence, MA for ninety days from April 5, 2005 to November 15, 2005. Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station. The data is captured in the RailTrail data set.

library(mosaic)
head(RailTrail)
  hightemp lowtemp avgtemp spring summer fall cloudcover precip volume weekday
1       83      50    66.5      0      1    0        7.6   0.00    501    TRUE
2       73      49    61.0      0      1    0        6.3   0.29    419    TRUE
3       74      52    63.0      1      0    0        7.5   0.32    397    TRUE
4       95      61    78.0      0      1    0        2.6   0.00    385   FALSE
5       44      52    48.0      1      0    0       10.0   0.14    200    TRUE
6       69      54    61.5      1      0    0        6.6   0.02    375    TRUE
  dayType
1 weekday
2 weekday
3 weekday
4 weekend
5 weekday
6 weekday

Your turn: RailTrail

  • Create a scatterplot for the \(volume\) in terms of \(avgtemp\)
  • Describe the form, direction, and strength of the relationship
  • Use cor() to compute the correlation coefficient
  • Use geom_smooth() to view the model in the data space
  • Fit the linear regression model using lm()
  • Interpret the coefficients for the Intercept and avgtemp terms

Model visualization

Compare OLS regression line (right) with null model (left).