ETL for Medium Data

Ben Baumer

A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data





Benjamin S. Baumer
https://bit.ly/2rGrW0p
Symposium on Statistics and Data Science
May 17th, 2018

Four distinct ideas/challenges

  1. A Grammar for
  2. Reproducible and Painless
  3. Extract-Transform-Load Operations
  4. on Medium Data

Motivation

An example: Citi Bike

Citi Bike: problems?

Citi Bike: research

Citi Bike: data

We obtained bike usage statistics for April, May, June and July 2014 from Citi Bike’s website (https://www.citibikenyc.com/system-data). This dataset contains start station id, end station id, station latitude, station longitude and trip time for each bike trip. 332 bike stations have one or more originating bike trips. 253 of these are in Manhattan while 79 are in Brooklyn (left panel of Figure 1). We processed this raw data to get the number of bike trips between each station pair during morning rush hours.
–Singhvi, et al. (2015)

Citi Bike: set up citibike database

library(citibike)

bikes <- etl("citibike", 
             dir = "~/dumps/citibike/",
             db = src_mysql_cnf("citibike"))
bikes %>%
  etl_update(years = 2014, months = 4:7)

Citi Bike: query database

trips <- bikes %>%
  tbl("trips")

trips %>%
  group_by(Start_Station_ID) %>%
  summarize(num_trips = n()) %>%
  filter(num_trips >= 1) %>%
  arrange(desc(num_trips)) %>%
  collect()
## # A tibble: 332 x 2
##    Start_Station_ID num_trips
##               <int>     <dbl>
##  1              519     50316
##  2              521     49511
##  3              293     45391
##  4              497     45154
##  5              426     40046
##  6              435     37542
##  7              285     36074
##  8              499     33849
##  9              151     33776
## 10              444     33663
## # ... with 322 more rows

Citi Bike: reproducibility?

2. Reproducibility

Replicability Crisis

Reproducible scholarship

An article about a computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that pro- duced the result.
–Claerbout, 1994

4. Medium data

Data size for the single user

“Size” size hardware software
small < several GB RAM R
medium several GB – a few TB hard disk SQL
big many TB or more cluster Spark?

3. ETL

ETL operations

1. Grammar

A grammar?

In linguistics, grammar is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language.
Wikipedia

My solution

A grammar for ETL

tidyverse

ETL suite of packages

Example: airlines

system("mysql -e 'CREATE DATABASE IF NOT EXISTS airlines;'")
library(airlines)

src_db <- src_mysql_cnf("airlines", groups = "aws")

ontime <- etl("airlines", db = src_db, dir = "~/dumps/airlines") 

Example: airlines cont’d

ontime %>%
  etl_extract(years = 1987:2017) %>%
  etl_transform(years = 2001:2010) %>%
  etl_load(years = 2009:2010, months = c(1:3, 5))

Example: airlines cont’d

ontime %>%
  tbl("flights") %>%
  filter(year == 2010, dest == "IAD") %>%
  group_by(carrier) %>%
  summarize(num_flights = n(), 
            begin = min(month), end = max(month), 
            avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(avg_delay)
## # Source:     lazy query [?? x 5]
## # Database:   mysql 5.7.22-0ubuntu0.16.04.1 [bbaumer@127.0.0.1:/airlines]
## # Ordered by: avg_delay
##    carrier num_flights begin   end avg_delay
##    <chr>         <dbl> <int> <int>     <dbl>
##  1 US               30     1     5    -7.27 
##  2 UA             7800     1     5    -1.30 
##  3 OO                6     1     1     0.667
##  4 EV             1931     1     5     3.16 
##  5 YV             4749     1     5     4.80 
##  6 XE             3169     1     5     5.58 
##  7 B6             1769     1     5     5.86 
##  8 OH              273     1     5     6.17 
##  9 WN             1190     1     5     7.13 
## 10 DL              853     1     5     7.53 
## # ... with more rows

Noun: an etl object is

class(cars)
## [1] "etl_mtcars" "etl"        "src_dbi"    "src_sql"    "src"

Noun: an etl object has

summary(ontime)
## files:
##    n     size                              path
## 1 13 0.193 GB  /home/bbaumer/dumps/airlines/raw
## 2 11  0.49 GB /home/bbaumer/dumps/airlines/load
##       Length Class           Mode       
## con   1      MySQLConnection S4         
## disco 2      -none-          environment

Verbs: chaining operations

getS3method("etl_update", "default")
## function(obj, ...) {
##   obj <- obj %>%
##     etl_extract(...) %>%
##     etl_transform(...) %>%
##     etl_load(...)
##   invisible(obj)
## }
## <environment: namespace:etl>

Extending etl






The End

beanumber beanumber     BaumerBen @BaumerBen

Thank you!