+ - 0:00:00
Notes for current slide
Notes for next slide

Data wrangling

Tidy data

Ben Baumer

SDS 192
Oct 5, 2020
(http://beanumber.github.io/sds192/lectures/mdsr_wrangling_06-tidyr.html)

1 / 15

2 / 15

Second normal form

  • E.F. Codd (1971)

  • This table is about tournaments

  • Winner Date of Birth is a fact about a person

    • it's redundant
    • it belongs in a table about people
3 / 15

Third normal form

  • "Like is stored with like"

  • No redundant information

  • Tables tend to be:

    • Long (many rows)
    • Narrow (few columns)
    • Efficient in terms of both storage and retrieval speed
  • Not that useful if you don't know how to join!

  • What is the average age (in years) of a tournament winner?

4 / 15

Tidy data: Wickham (2014)

5 / 15

Why tidy data?

library(tidyverse)
library(fivethirtyeight)
drugs <- drug_use %>%
mutate(age_min = parse_number(levels(age))) %>%
select(age_min, alcohol_use, marijuana_use, cocaine_use)
drugs
## # A tibble: 17 x 4
## age_min alcohol_use marijuana_use cocaine_use
## <dbl> <dbl> <dbl> <dbl>
## 1 12 3.9 1.1 0.1
## 2 13 8.5 3.4 0.1
## 3 14 18.1 8.7 0.1
## 4 15 29.2 14.5 0.5
## 5 16 40.1 22.5 1
## 6 17 49.3 28 2
## 7 18 58.7 33.7 3.2
## 8 19 64.6 33.4 4.1
## 9 20 69.7 34 4.9
## 10 21 83.2 33 4.8
## 11 22 84.2 28.4 4.5
## 12 24 83.1 24.9 4
## 13 26 80.7 20.8 3.2
## 14 30 77.5 16.4 2.1
## 15 35 75 10.4 1.5
## 16 50 67.2 7.3 0.9
## 17 65 49.3 1.2 0
6 / 15

Cumbersome to plot

ggplot(data = drugs, aes(x = age_min)) +
geom_line(aes(y = alcohol_use), color = "red") +
geom_line(aes(y = marijuana_use), color = "green") +
geom_line(aes(y = cocaine_use), color = "blue")

7 / 15

Example: countries over time

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
  • Not tidy because:
    • What are the variables?
    • Do the values measure the same kind of thing?
8 / 15

pivot_longer(): make wide into narrow

table4a %>%
pivot_longer(-country, names_to = "year", values_to = "cases")

9 / 15

Demo: Frozen longer

frozen
## # A tibble: 3 x 3
## person sunday monday
## <chr> <chr> <chr>
## 1 Elsa happy sad
## 2 Anna happy happy
## 3 Kristoff afraid sad
10 / 15

Demo: Frozen longer

frozen
## # A tibble: 3 x 3
## person sunday monday
## <chr> <chr> <chr>
## 1 Elsa happy sad
## 2 Anna happy happy
## 3 Kristoff afraid sad
frozen %>%
pivot_longer(-person, names_to = "day", values_to = "emotion")
## # A tibble: 6 x 3
## person day emotion
## <chr> <chr> <chr>
## 1 Elsa sunday happy
## 2 Elsa monday sad
## 3 Anna sunday happy
## 4 Anna monday happy
## 5 Kristoff sunday afraid
## 6 Kristoff monday sad
10 / 15

Let's tidy up that approval data...

drugs_tidy <- drugs %>%
pivot_longer(
cols = contains("_use"),
names_to = "substance",
values_to = "pct"
)
drugs_tidy
## # A tibble: 51 x 3
## age_min substance pct
## <dbl> <chr> <dbl>
## 1 12 alcohol_use 3.9
## 2 12 marijuana_use 1.1
## 3 12 cocaine_use 0.1
## 4 13 alcohol_use 8.5
## 5 13 marijuana_use 3.4
## 6 13 cocaine_use 0.1
## 7 14 alcohol_use 18.1
## 8 14 marijuana_use 8.7
## 9 14 cocaine_use 0.1
## 10 15 alcohol_use 29.2
## # … with 41 more rows
11 / 15

...and now the plot is simple

ggplot(drugs_tidy, aes(x = age_min, y = pct, color = substance)) +
geom_line()

12 / 15

Example: countries over time

head(table2)
## # A tibble: 6 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
  • Not tidy because:
    • Are the counts comparable?
13 / 15

pivot_wider(): make narrow into wide

table2 %>%
pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

https://github.com/hadley/r4ds/issues/544

14 / 15

Demo: Frozen wider

frozen_longer
## # A tibble: 6 x 3
## person day emotion
## <chr> <chr> <chr>
## 1 Elsa sunday happy
## 2 Elsa monday sad
## 3 Anna sunday happy
## 4 Anna monday happy
## 5 Kristoff sunday afraid
## 6 Kristoff monday sad
15 / 15

Demo: Frozen wider

frozen_longer
## # A tibble: 6 x 3
## person day emotion
## <chr> <chr> <chr>
## 1 Elsa sunday happy
## 2 Elsa monday sad
## 3 Anna sunday happy
## 4 Anna monday happy
## 5 Kristoff sunday afraid
## 6 Kristoff monday sad
frozen_longer %>%
pivot_wider(names_from = day, values_from = emotion)
## # A tibble: 3 x 3
## person sunday monday
## <chr> <chr> <chr>
## 1 Elsa happy sad
## 2 Anna happy happy
## 3 Kristoff afraid sad
15 / 15

2 / 15
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow