Ben Baumer

Benjamin S. Baumer https://bit.ly/2LzLZtm Joint Statistical Meetings July 31st, 2018

We obtained bike usage statistics for April, May, June and July 2014 from Citi Bike’s website (https://www.citibikenyc.com/system-data). This dataset contains start station id, end station id, station latitude, station longitude and trip time for each bike trip. 332 bike stations have one or more originating bike trips. 253 of these are in Manhattan while 79 are in Brooklyn (left panel of Figure 1). We processed this raw data to get the number of bike trips between each station pair during morning rush hours. –Singhvi, et al. (2015)
citibike databasebikes objectlibrary(citibike)
bikes <- etl("citibike",
dir = "~/dumps/citibike/",
db = src_mysql_cnf("citibike"))bikes %>%
etl_update(years = 2014, months = 4:7)trips <- bikes %>%
tbl("trips")
trips %>%
group_by(Start_Station_ID) %>%
summarize(num_trips = n()) %>%
filter(num_trips >= 1) %>%
arrange(desc(num_trips)) %>%
collect()## # A tibble: 332 x 2
## Start_Station_ID num_trips
## <int> <dbl>
## 1 519 50316
## 2 521 49511
## 3 293 45391
## 4 497 45154
## 5 426 40046
## 6 435 37542
## 7 285 36074
## 8 499 33849
## 9 151 33776
## 10 444 33663
## # ... with 322 more rows
An article about a computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that pro- duced the result. –Donoho (2010) paraphrasing Claerbout (1994)

![]()
![]()
| “Size” | size | hardware | software |
|---|---|---|---|
| small | < several GB | RAM | R |
| medium | several GB – a few TB | hard disk | SQL |
| big | many TB or more | cluster | Spark? |
In linguistics, grammar is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language. –Wikipedia
ggplot2 (Wickham, 2009)dplyr: a grammar of data manipulation (Wickham & Francois, 2016)etletl_extract()etl_transform()etl_load()

airlinessystem("mysql -e 'CREATE DATABASE IF NOT EXISTS airlines;'")etl objectlibrary(airlines)
src_db <- src_mysql_cnf("airlines")
ontime <- etl("airlines", db = src_db, dir = "~/dumps/airlines") airlines cont’dontime %>%
etl_extract(years = 1987:2017) %>%
etl_transform(years = 2001:2010) %>%
etl_load(years = 2009:2010, months = c(1:3, 5))airlines cont’dontime %>%
tbl("flights") %>%
filter(year == 2010, dest == "SEA") %>%
group_by(carrier) %>%
summarize(num_flights = n(),
begin = min(month), end = max(month),
avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(avg_delay)## # Source: lazy query [?? x 5]
## # Database: mysql 5.7.22-0ubuntu0.16.04.1 [root@127.0.0.1:/airlines]
## # Ordered by: avg_delay
## carrier num_flights begin end avg_delay
## <chr> <dbl> <int> <int> <dbl>
## 1 UA 2236 1 5 -11.4
## 2 DL 2598 1 5 -10.3
## 3 US 941 1 5 -7.61
## 4 AS 13488 1 5 -5.83
## 5 B6 498 1 5 -3.39
## 6 FL 196 1 5 -3.17
## 7 OO 2747 1 5 -2.08
## 8 WN 3919 1 5 -2.05
## 9 AA 1591 1 5 -1.09
## 10 HA 295 1 5 -0.0949
## # ... with more rows
etl object isdplyr::src_sql object
tempfile)dbplyr backend supportedDBI::dbWriteTable methodsclass(cars)## [1] "etl_mtcars" "etl" "src_dbi" "src_sql" "src"
etl object hastempdir by defaultsummary(ontime)## files:
## n size path
## 1 13 0.193 GB /home/bbaumer/dumps/airlines/raw
## 2 11 0.49 GB /home/bbaumer/dumps/airlines/load
## Length Class Mode
## con 1 MySQLConnection S4
## disco 2 -none- environment
getS3method("etl_update", "default")## function(obj, ...) {
## obj <- obj %>%
## etl_extract(...) %>%
## etl_transform(...) %>%
## etl_load(...)
## invisible(obj)
## }
## <environment: namespace:etl>
etletl vignettecreate_etl_package()default methodsetl_extract.foo()etl_transform.foo()etl_load.foo()
dplyr and rstats-db developers!!