ETL for Medium Data

A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

Benjamin S. Baumer
https://bit.ly/2LzLZtm
Joint Statistical Meetings
July 31st, 2018

Four distinct ideas/challenges

A Grammar for
Reproducible and Painless
Extract-Transform-Load Operations
on Medium Data

An example: Citi Bike

Citi Bike: problems?

New York City’s municipal bike sharing
- 12,000 bikes, 472 stations
Problems:
- load balancing
- how to ensure enough bikes at each station?

Citi Bike: research

Data Analysis and optimization for (Citi)Bike sharing [O’Mahony, Shmoys (2015)]
Predicting bike usage for New York City’s bike sharing system [Singhvi, et al. (2015)]
Smarter tools for (Citi) bike sharing [O’Mahony (2015)]
Incorporating the impact of spatio-temporal interactions on bicycle sharing system demand: A case study of New York CitiBike system [Faghih-Imani, Eluru (2016)]

Citi Bike: data

We obtained bike usage statistics for April, May, June and July 2014 from Citi Bike’s website (https://www.citibikenyc.com/system-data). This dataset contains start station id, end station id, station latitude, station longitude and trip time for each bike trip. 332 bike stations have one or more originating bike trips. 253 of these are in Manhattan while 79 are in Brooklyn (left panel of Figure 1). We processed this raw data to get the number of bike trips between each station pair during morning rush hours.
–Singhvi, et al. (2015)

Could you reproduce this analysis?

Citi Bike: set up `citibike` database

Set up bikes object

library(citibike)

bikes <- etl("citibike", 
             dir = "~/dumps/citibike/",
             db = src_mysql_cnf("citibike"))

Populate a database

bikes %>%
  etl_update(years = 2014, months = 4:7)

Citi Bike: query database

trips <- bikes %>%
  tbl("trips")

trips %>%
  group_by(Start_Station_ID) %>%
  summarize(num_trips = n()) %>%
  filter(num_trips >= 1) %>%
  arrange(desc(num_trips)) %>%
  collect()

## # A tibble: 332 x 2
##    Start_Station_ID num_trips
##               <int>     <dbl>
##  1              519     50316
##  2              521     49511
##  3              293     45391
##  4              497     45154
##  5              426     40046
##  6              435     37542
##  7              285     36074
##  8              499     33849
##  9              151     33776
## 10              444     33663
## # ... with 322 more rows

Citi Bike: reproducibility?

How confident are you that we have a copy of the same data as these researchers?

same number of stations…
- same number of rows?
- rows contain the same information?
- integer overflows?
impossible to verify given the description

Replicability Crisis

Why Most Published Research Findings Are False (Ioannidis, 2005)
Announcement: Reducing our irreproducibility (Nature, 2013)
ASA’s Statement on p-Values (Wasserstein & Lazar, 2016)

replicability:
- different people get the same results with different data
- entails data collection

reproducibility:
- same/different people get the same results with same data
- entails data analysis

Reproducible scholarship

An article about a computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that pro- duced the result.
–Donoho (2010) paraphrasing Claerbout (1994)

Must include the code you wrote to get the data!

Data size for the single user

“Size”	size	hardware	software
small	< several GB	RAM	R
medium	several GB – a few TB	hard disk	SQL
big	many TB or more	cluster	Spark?

ETL operations

Extract -> download
Transform -> wrangle
Load -> into database (typically SQL)

A grammar?

In linguistics, grammar is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language.
–Wikipedia

The Grammar of Graphics (Wilkinson, 2006)
- ggplot2 (Wickham, 2009)
dplyr: a grammar of data manipulation (Wickham & Francois, 2016)
Covers many use cases with relatively few words
Once you learn the grammar, you can speak freely

A grammar for ETL

nouns:
- objects that inherit from class etl
verbs:
- etl_extract()
- etl_transform()
- etl_load()
adverbs:
- arguments to “verb” functions
predictable, pipeable syntax

tidyverse

ETL suite of packages

A CRAN package that provides a framework for ETL
- etl
An ecosystem of dependent packages for other data sources
- macleish
- airlines
- imdb
- nyctaxi
- nyc311
- fec
- citibike
- roll your own!

Example: `airlines`

Create an empty database in MySQL

system("mysql -e 'CREATE DATABASE IF NOT EXISTS airlines;'")

Set up the etl object

library(airlines)

src_db <- src_mysql_cnf("airlines")

ontime <- etl("airlines", db = src_db, dir = "~/dumps/airlines")

Example: `airlines` cont’d

Perform ETL operations

ontime %>%
  etl_extract(years = 1987:2017) %>%
  etl_transform(years = 2001:2010) %>%
  etl_load(years = 2009:2010, months = c(1:3, 5))

May take hours!

Example: `airlines` cont’d

Q: Which airline had the shortest average delay at SEA?

ontime %>%
  tbl("flights") %>%
  filter(year == 2010, dest == "SEA") %>%
  group_by(carrier) %>%
  summarize(num_flights = n(), 
            begin = min(month), end = max(month), 
            avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(avg_delay)

## # Source:     lazy query [?? x 5]
## # Database:   mysql 5.7.22-0ubuntu0.16.04.1 [root@127.0.0.1:/airlines]
## # Ordered by: avg_delay
##    carrier num_flights begin   end avg_delay
##    <chr>         <dbl> <int> <int>     <dbl>
##  1 UA             2236     1     5  -11.4   
##  2 DL             2598     1     5  -10.3   
##  3 US              941     1     5   -7.61  
##  4 AS            13488     1     5   -5.83  
##  5 B6              498     1     5   -3.39  
##  6 FL              196     1     5   -3.17  
##  7 OO             2747     1     5   -2.08  
##  8 WN             3919     1     5   -2.05  
##  9 AA             1591     1     5   -1.09  
## 10 HA              295     1     5   -0.0949
## # ... with more rows

Noun: an `etl` object is

a dplyr::src_sql object
- provides a connection to a SQL database
- local or remote
- SQLite by default (tempfile)
- Any dbplyr backend supported
- uses DBI::dbWriteTable methods

class(cars)

## [1] "etl_mtcars" "etl"        "src_dbi"    "src_sql"    "src"

Noun: an `etl` object has

a place to store files
- local storage for raw and processed data
- tempdir by default

summary(ontime)

## files:
##    n     size                              path
## 1 13 0.193 GB  /home/bbaumer/dumps/airlines/raw
## 2 11  0.49 GB /home/bbaumer/dumps/airlines/load

##       Length Class           Mode       
## con   1      MySQLConnection S4         
## disco 2      -none-          environment

Verbs: chaining operations

getS3method("etl_update", "default")

## function(obj, ...) {
##   obj <- obj %>%
##     etl_extract(...) %>%
##     etl_transform(...) %>%
##     etl_load(...)
##   invisible(obj)
## }
## <environment: namespace:etl>

Extending `etl`

Extending etl vignette
create_etl_package()
Sensible default methods
Write (at least one of) 3 methods:
- etl_extract.foo()
- etl_transform.foo()
- etl_load.foo()
Document, push, etc.

The End

Thanks to the dplyr and rstats-db developers!!
Read the paper (https://arxiv.org/abs/1708.07073)
- Coming soon to Journal of Computational and Graphical Statistics!
Check out these slides: (https://bit.ly/2LzLZtm)

beanumber @BaumerBen

ETL for Medium Data

A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

Four distinct ideas/challenges

Motivation

An example: Citi Bike

Citi Bike: problems?

Citi Bike: research

Citi Bike: data

Citi Bike: set up `citibike` database

Citi Bike: query database

Citi Bike: reproducibility?

2. Reproducibility

Replicability Crisis

Reproducible scholarship

4. Medium data

Data size for the single user

3. ETL

ETL operations

1. Grammar

A grammar?

My solution

A grammar for ETL

ETL suite of packages

Example: `airlines`

Example: `airlines` cont’d

Example: `airlines` cont’d

Noun: an `etl` object is

Noun: an `etl` object has

Verbs: chaining operations

Extending `etl`

The End

Thank you!

A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

Four distinct ideas/challenges

Motivation

An example: Citi Bike

Citi Bike: problems?

Citi Bike: research

Citi Bike: data

Citi Bike: set up citibike database

Citi Bike: query database

Citi Bike: reproducibility?

2. Reproducibility

Replicability Crisis

Reproducible scholarship

4. Medium data

Data size for the single user

3. ETL

ETL operations

1. Grammar

A grammar?

My solution

A grammar for ETL

ETL suite of packages

Example: airlines

Example: airlines cont’d

Example: airlines cont’d

Noun: an etl object is

Noun: an etl object has

Verbs: chaining operations

Extending etl

The End

Thank you!

Citi Bike: set up `citibike` database

Example: `airlines`

Example: `airlines` cont’d

Example: `airlines` cont’d

Noun: an `etl` object is

Noun: an `etl` object has

Extending `etl`