04: Center, Shape, and Spread

IMS, Ch. 4–5

Smith College

Feb 4, 2026

Warmup

Lurking Variables

For each of the following pairs of variables, a statistically significant positive relationship has been observed. Identify a potential lurking variable (omitted confounder) that might cause the spurious correlation.

The amount of ice cream sold in New England and the number of deaths by drowning
The salary of U.S. ministers and the price of vodka
The number of doctors in a region and the number of crimes committed in that region
The amount of coffee consumed and the prevalence of lung cancer

Summarizing distributions

Shape, Center, and Spread

Histogram

Summarize the shape of the distribution of one variable

library(tidyverse)
library(palmerpenguins)
ggplot(data = penguins, aes(x = body_mass_g)) +
  geom_histogram()

Density plot

Summarize the shape of the distribution of one variable

ggplot(data = penguins, aes(x = body_mass_g)) +
  geom_density()

Box plot

Summarize the shape of the distribution of one variable

ggplot(data = penguins, aes(x = body_mass_g)) +
  geom_boxplot()

Exercise 1

Use a data graphic to summarize the distribution of the height variable in the starwars data frame.

Histogram: two variables

ggplot(data = penguins, aes(x = body_mass_g, fill = species)) +
  geom_histogram()

Density plot: two variables

ggplot(data = penguins, aes(x = body_mass_g, fill = species)) +
  geom_density(alpha = 0.8)

Boxplot: two variables

ggplot(data = penguins, aes(x = body_mass_g, fill = species)) +
  geom_boxplot()

Measures of center

mean: mean()
median: median()

penguins |>
  summarize(
    number_of_penguins = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    median_mass = median(body_mass_g, na.rm = TRUE)
  )

# A tibble: 1 × 3
  number_of_penguins mean_mass median_mass
               <int>     <dbl>       <dbl>
1                344     4202.        4050

Measures of spread

standard deviation: sd()
variance: var()
range: range()
IQR: IQR()

penguins |>
  summarize(
    number_of_penguins = n(),
    sd_mass = sd(body_mass_g, na.rm = TRUE),
    var_mass = var(body_mass_g, na.rm = TRUE)
  )

# A tibble: 1 × 3
  number_of_penguins sd_mass var_mass
               <int>   <dbl>    <dbl>
1                344    802.  643131.

Using groups

define groups: group_by()

penguins |>
  group_by(species) |>
  summarize(
    number_of_penguins = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    sd_mass = sd(body_mass_g, na.rm = TRUE),
  )

# A tibble: 3 × 4
  species   number_of_penguins mean_mass sd_mass
  <fct>                  <int>     <dbl>   <dbl>
1 Adelie                   152     3701.    459.
2 Chinstrap                 68     3733.    384.
3 Gentoo                   124     5076.    504.

Exercise 2

Summarize the distribution of the height variable in the starwars data frame by computing:
- the number of observations
- the mean
- the standard deviation
- Bonus: choose a discrete variable to separate by group!

Exercise 3

Thought Experiment

Consider the following two variables:

The height of all professional basketball players
The annual income of all working adults in the United States

Sketch a density plot for each distribution. What features does it have?
Is it symmetric? Is it normal? It is unimodal?
Summarize each distribution numerically. Which measures are most appropriate?

Exercise 4 (Challenge)

Suppose that the government issued a tax rebate in the amount of $2000 to each American taxpayer.

How would the distribution of income change?
What would happen to your measures of center and spread?

04: Center, Shape, and Spread

Warmup

Lurking Variables

Summarizing distributions

Histogram

Density plot

Box plot

Exercise 1

Histogram: two variables

Density plot: two variables

Boxplot: two variables

Measures of center

Measures of spread

Using groups

Exercise 2

Exercise 3

Exercise 4 (Challenge)

Demo: stratified sampling