05: Bivariate relationships

IMS, Ch. 7.1

Smith College

Feb 6, 2026

A correction

You were right!

Simulate player heights

library(tidyverse)
# number of players
n_nba <- 32 * 12
n_wnba <- 15 * 12

players <- tibble(
  league = c(rep("nba", times = n_nba), rep("wnba", times = n_wnba)),
  height = c(
    rnorm(n_nba, mean = 78, sd = 3.5), 
    rnorm(n_wnba, mean = 72, sd = 3)
  )
)

players |> 
  head()
# A tibble: 6 × 2
  league height
  <chr>   <dbl>
1 nba      74.8
2 nba      79.5
3 nba      75.9
4 nba      78.4
5 nba      76.0
6 nba      74.0

Visualize distribution (bivariate)

ggplot(players, aes(x = height, fill = league)) +
  geom_density(alpha = 0.8)

Visualize distribution (univariate)

ggplot(players, aes(x = height)) +
  geom_density(alpha = 0.8)

Challenge

Thought experiment

Suppose that the government issued a tax rebate in the amount of $2000 to each American taxpayer.

  • How would the distribution of income change?
  • What would happen to your measures of center and spread?

Demo: stratified sampling

Bivariate Relationships

Two variables

  • Response variable (aka dependent variable):
    • the variable that you are trying to understand/model
  • Explanatory variable (aka independent variable, aka predictor):
    • the variable that you can measure that you think might be related to the response variable

Graphics for two numerical variables

  • Scatter plot
  • response variable on \(y\)-axis and explanatory variable on \(x\)-axis
  • geom_point()
  • What are we looking for?
    • Overall patterns and deviations from those patterns
    • Form (e.g. linear, quadratic, etc.), direction (positive or negative), and strength (how much scatter?)
    • Outliers

Example: birthweight of babies

library(openintro)
ggplot(babies, aes(x = gestation, y = bwt)) +
  geom_point()

Your turn: scatterplot

  • Use a scatter plot to analyze the relationship between the height and mass of characters in the starwars data frame

  • Characterize the distribution

    • Form
    • Direction
    • Strength
    • Outliers
  • Now do the same thing, for the penguins

Graphics for numerical response and categorical explanatory

  • Side-by-side box plots: geom_boxplot()
  • Multiple density plots: geom_density()
  • Use color/fill aesthetic or facet_wrap()

Aside: “ridgeline plots”

Example: birthweight of babies

ggplot(babies, aes(x = bwt, fill = factor(smoke))) +
  geom_boxplot()

Example: birthweight of babies

ggplot(babies, aes(x = bwt, fill = factor(smoke))) +
  geom_density(alpha = 0.8)

Example: birthweight of babies

ggplot(babies, aes(x = smoke, y = bwt)) +
  geom_jitter(width = 0.1)

Your turn: categorical

  • Visualize the relationship between birthweight (bwt) and first pregnancy (parity)

Correlation

(Pearson Product-Moment) correlation coefficient

  • measure of the strength and direction of the linear relationship between two numerical variables
  • denoted \(r\)
  • measured on the scale of \([-1, 1]\)
  • cor()

Anscombe

# A tibble: 4 × 5
  set       N `mean(x)` `mean(y)` `cor(x, y)`
  <chr> <int>     <dbl>     <dbl>       <dbl>
1 1        11         9      7.50       0.816
2 2        11         9      7.50       0.816
3 3        11         9      7.5        0.816
4 4        11         9      7.50       0.817
  • same mean
  • same standard deviation
  • same correlation coefficient is the same (up to three digits)!

Anscombe plots

ggplot(data = ds, aes(x = x, y = y)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = 0) + 
  facet_wrap(~set)

Datasaurus

More examples

Beware

  • Note that correlation only measures the strength of a linear relationship

  • Always graph your data!

Spurious Correlations