class: center, middle, inverse, title-slide # Database Querying with SQL ## SQL + dplyr integration ### Ben Baumer ### SDS 192April 24, 2020(
http://beanumber.github.io/sds192/lectures/mdsr_sql_07-dplyr.html
) --- class: center, middle, inverse # SQL + dplyr --- ## Ways to interact with MySQL 1. GUI SQL client: MySQL Workbench 2. command line SQL client (next slide) 3. generic SQL access in R: ~~`DBI::dbGetQuery()`~~ 4. `RMySQL` through R Markdown chunks (see labs & hw) 5. through `dplyr` with `dbplyr` .footnote[https://dbplyr.tidyverse.org/] --- ## `dplyr` + SQL connection - `dplyr` can access a SQL database directly - Instead of `tbl_df`, you have a `tbl_sql` -- - Data is stored and processed in SQL - Tiny memory footprint in R -- - **Lazy evaluation** - server-side processing -
: `dplyr` to SQL translation via `show_query()` --- ## Data size for a single user .pull-left[ ![](https://www.ricksdailytips.com/wp-content/uploads/2017/12/crucial-ram-sticks.jpg) ] -- .pull-right[ ![](https://thumbs.dreamstime.com/b/hand-open-hard-disk-drive-13760890.jpg) ] -- .pull-left[ ![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/IBM_Blue_Gene_P_supercomputer.jpg/1200px-IBM_Blue_Gene_P_supercomputer.jpg) ] --- ## Big Data? .footnote[https://en.wikipedia.org/wiki/Big_data] "Size" | size | hardware | software -------|------|----------|----- small | < several GB | RAM | R medium | several GB -- a few TB | hard disk | SQL big | many TB or more | cluster | Spark? --- ## Example: connect to IMDB ```r title <- db %>% tbl("title") class(title) ``` ``` ## [1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql" ## [4] "tbl_lazy" "tbl" ``` ```r title ``` ``` ## # Source: table<title> [?? x 12] ## # Database: mysql 5.5.58-0ubuntu0.14.04.1-log [sds192@scidb.smith.edu:/imdb] ## id title imdb_index kind_id production_year imdb_id phonetic_code ## <int> <chr> <chr> <int> <int> <int> <chr> ## 1 78460 Adul… <NA> 7 2017 NA A3432 ## 2 70273 (201… <NA> 7 2016 NA <NA> ## 3 60105 (201… <NA> 7 2014 NA <NA> ## 4 32120 (200… <NA> 7 2008 NA <NA> ## 5 97554 Schm… <NA> 7 2001 NA S2543 ## 6 57966 (#1.… <NA> 7 2013 NA <NA> ## 7 76391 Anni… <NA> 7 1971 NA A5162 ## 8 11952 Angu… <NA> 7 2009 NA A5214 ## 9 1554 New … <NA> 7 2003 NA N6452 ## 10 58442 Kiss… <NA> 7 2011 NA K2523 ## # … with more rows, and 5 more variables: episode_of_id <int>, season_nr <int>, ## # episode_nr <int>, series_years <chr>, md5sum <chr> ``` --- ## `title` contains 4.6 million rows, but... - ...it takes up almost no space ```r print(object.size(title), units = "Kb") ``` ``` ## 4.3 Kb ``` -- - `title` looks like a `data.frame` but... ```r nrow(title) ``` ``` ## [1] NA ``` -- ```r class(title) ``` ``` ## [1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql" ## [4] "tbl_lazy" "tbl" ``` -- - ...it's not actually a `data.frame` --- ## Example: `show_query()` ```r star_wars <- title %>% filter(title == "Star Wars", kind_id == 1) %>% select(production_year, title) ``` -- ```r star_wars ``` ``` ## # Source: lazy query [?? x 2] ## # Database: mysql 5.5.58-0ubuntu0.14.04.1-log [sds192@scidb.smith.edu:/imdb] ## production_year title ## <int> <chr> ## 1 1977 Star Wars ``` -- ```r show_query(star_wars) ``` ``` ## <SQL> ## SELECT `production_year`, `title` ## FROM `title` ## WHERE ((`title` = 'Star Wars') AND (`kind_id` = 1.0)) ``` --- ## Translation of basic functions .footnote[https://dbplyr.tidyverse.org/articles/translation-function.html#known-functions-1] ```r library(dbplyr) translate_sql(ceiling(mpg)) ``` ``` ## <SQL> CEIL(`mpg`) ``` -- ```r translate_sql(mean(mpg)) ``` ``` ## Warning: Missing values are always removed in SQL. ## Use `AVG(x, na.rm = TRUE)` to silence this warning ## This warning is displayed only once per session. ``` ``` ## <SQL> AVG(`mpg`) OVER () ``` -- ```r translate_sql(cyl == 4) ``` ``` ## <SQL> `cyl` = 4.0 ``` -- ```r translate_sql(cyl %in% c(4, 6, 8)) ``` ``` ## <SQL> `cyl` IN (4.0, 6.0, 8.0) ``` --- ## Code pass-thru of other functions .footnote[https://dbplyr.tidyverse.org/articles/translation-function.html#unknown-functions-1] ```r # no PASTE() in SQL translate_sql(paste0("hp", "wt", "vs")) ``` ``` ## <SQL> CONCAT_WS('', 'hp', 'wt', 'vs') ``` -- ```r # works, but no CONCAT() in R translate_sql(CONCAT("hp", "wt", "vs")) ``` ``` ## <SQL> CONCAT('hp', 'wt', 'vs') ``` -- ```r # nonsense translate_sql(CRAZY_FUNCTION(mpg)) ``` ``` ## <SQL> CRAZY_FUNCTION(`mpg`) ``` --- ## Fine-looking R code ```r title %>% filter(title %like% '%Star Wars%', kind_id == 1, !is.na(production_year)) %>% select(title, production_year) %>% arrange(production_year) ``` ``` ## # Source: lazy query [?? x 2] ## # Database: mysql 5.5.58-0ubuntu0.14.04.1-log [sds192@scidb.smith.edu:/imdb] ## # Ordered by: production_year ## title production_year ## <chr> <int> ## 1 Star Wars 1977 ## 2 Star Wars: Episode V - The Empire Strikes Back 1980 ## 3 Star Wars Underoos 1980 ## 4 Star Wars: Episode VI - Return of the Jedi 1983 ## 5 Tezukuri no Star Wars 1990 ## 6 Star Wars: Episode I - The Phantom Menace 1999 ## 7 Star Wars Gangsta Rap 2000 ## 8 Star Wars Returns 2001 ## 9 Star Wars: Attack of the Clones - A Jigsaw Puzzle 2002 ## 10 Star Wars Episode V 1/2: The Han Solo Affair 2002 ## # … with more rows ``` --- ## Weird-looking hybrid code ```r title %>% filter(title %like% '%Star Wars%', kind_id == 1, !is.na(production_year)) %>% * mutate(before_dash = SUBSTRING_INDEX(title, '-', 1)) %>% select(before_dash, production_year) %>% arrange(production_year) ``` ``` ## # Source: lazy query [?? x 2] ## # Database: mysql 5.5.58-0ubuntu0.14.04.1-log [sds192@scidb.smith.edu:/imdb] ## # Ordered by: production_year ## before_dash production_year ## <chr> <int> ## 1 "Star Wars" 1977 ## 2 "Star Wars: Episode V " 1980 ## 3 "Star Wars Underoos" 1980 ## 4 "Star Wars: Episode VI " 1983 ## 5 "Tezukuri no Star Wars" 1990 ## 6 "Star Wars: Episode I " 1999 ## 7 "Star Wars Gangsta Rap" 2000 ## 8 "Star Wars Returns" 2001 ## 9 "Star Wars: Attack of the Clones " 2002 ## 10 "Star Wars Episode V 1/2: The Han Solo Affair" 2002 ## # … with more rows ``` --- ## Pipelines are evaluated lazily ```r old_movies <- title %>% filter( production_year < 1960, kind_id == 1 ) class(old_movies) ``` ``` ## [1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql" ## [4] "tbl_lazy" "tbl" ``` ```r dim(old_movies) ``` ``` ## [1] NA 12 ``` ```r print(object.size(old_movies), units = "Kb") ``` ``` ## 7.4 Kb ``` --- ## Use `collect()` to bring into R ```r old_movies_local <- old_movies %>% * collect() class(old_movies_local) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` ```r dim(old_movies_local) ``` ``` ## [1] 215234 12 ``` ```r print(object.size(old_movies_local), units = "Mb") ``` ``` ## 48.8 Mb ``` --- ## Lazy evaluation - Pipelines that don't ask for data - evaluated fast - take up almost no space in memory - Processing triggered when data is needed - `print()`, `head()`, `glimpse()`, etc. - plotting - `collect()`