+ - 0:00:00
Notes for current slide
Notes for next slide

Database Querying with SQL

SQL + dplyr integration

Ben Baumer

SDS 192
April 24, 2020
(http://beanumber.github.io/sds192/lectures/mdsr_sql_07-dplyr.html)

1 / 16

SQL + dplyr

2 / 16

Ways to interact with MySQL

  1. GUI SQL client: MySQL Workbench

  2. command line SQL client (next slide)

  3. generic SQL access in R: DBI::dbGetQuery()

  4. RMySQL through R Markdown chunks (see labs & hw)

  5. through dplyr with dbplyr

https://dbplyr.tidyverse.org/

3 / 16

dplyr + SQL connection

  • dplyr can access a SQL database directly

  • Instead of tbl_df, you have a tbl_sql

4 / 16

dplyr + SQL connection

  • dplyr can access a SQL database directly

  • Instead of tbl_df, you have a tbl_sql

  • Data is stored and processed in SQL

    • Tiny memory footprint in R
4 / 16

dplyr + SQL connection

  • dplyr can access a SQL database directly

  • Instead of tbl_df, you have a tbl_sql

  • Data is stored and processed in SQL

    • Tiny memory footprint in R
  • Lazy evaluation
    • server-side processing
    • : dplyr to SQL translation via show_query()
4 / 16

Data size for a single user

5 / 16

Data size for a single user

5 / 16

Data size for a single user

5 / 16

Big Data?

https://en.wikipedia.org/wiki/Big_data

"Size" size hardware software
small < several GB RAM R
medium several GB -- a few TB hard disk SQL
big many TB or more cluster Spark?
6 / 16

Example: connect to IMDB

title <- db %>%
tbl("title")
class(title)
## [1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql"
## [4] "tbl_lazy" "tbl"
title
## # Source: table<title> [?? x 12]
## # Database: mysql 5.5.58-0ubuntu0.14.04.1-log [sds192@scidb.smith.edu:/imdb]
## id title imdb_index kind_id production_year imdb_id phonetic_code
## <int> <chr> <chr> <int> <int> <int> <chr>
## 1 78460 Adul… <NA> 7 2017 NA A3432
## 2 70273 (201… <NA> 7 2016 NA <NA>
## 3 60105 (201… <NA> 7 2014 NA <NA>
## 4 32120 (200… <NA> 7 2008 NA <NA>
## 5 97554 Schm… <NA> 7 2001 NA S2543
## 6 57966 (#1.… <NA> 7 2013 NA <NA>
## 7 76391 Anni… <NA> 7 1971 NA A5162
## 8 11952 Angu… <NA> 7 2009 NA A5214
## 9 1554 New … <NA> 7 2003 NA N6452
## 10 58442 Kiss… <NA> 7 2011 NA K2523
## # … with more rows, and 5 more variables: episode_of_id <int>, season_nr <int>,
## # episode_nr <int>, series_years <chr>, md5sum <chr>
7 / 16

title contains 4.6 million rows, but...

  • ...it takes up almost no space
print(object.size(title), units = "Kb")
## 4.3 Kb
8 / 16

title contains 4.6 million rows, but...

  • ...it takes up almost no space
print(object.size(title), units = "Kb")
## 4.3 Kb
  • title looks like a data.frame but...
nrow(title)
## [1] NA
8 / 16

title contains 4.6 million rows, but...

  • ...it takes up almost no space
print(object.size(title), units = "Kb")
## 4.3 Kb
  • title looks like a data.frame but...
nrow(title)
## [1] NA
class(title)
## [1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql"
## [4] "tbl_lazy" "tbl"
8 / 16

title contains 4.6 million rows, but...

  • ...it takes up almost no space
print(object.size(title), units = "Kb")
## 4.3 Kb
  • title looks like a data.frame but...
nrow(title)
## [1] NA
class(title)
## [1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql"
## [4] "tbl_lazy" "tbl"
  • ...it's not actually a data.frame
8 / 16

Example: show_query()

star_wars <- title %>%
filter(title == "Star Wars", kind_id == 1) %>%
select(production_year, title)
9 / 16

Example: show_query()

star_wars <- title %>%
filter(title == "Star Wars", kind_id == 1) %>%
select(production_year, title)
star_wars
## # Source: lazy query [?? x 2]
## # Database: mysql 5.5.58-0ubuntu0.14.04.1-log [sds192@scidb.smith.edu:/imdb]
## production_year title
## <int> <chr>
## 1 1977 Star Wars
9 / 16

Example: show_query()

star_wars <- title %>%
filter(title == "Star Wars", kind_id == 1) %>%
select(production_year, title)
star_wars
## # Source: lazy query [?? x 2]
## # Database: mysql 5.5.58-0ubuntu0.14.04.1-log [sds192@scidb.smith.edu:/imdb]
## production_year title
## <int> <chr>
## 1 1977 Star Wars
show_query(star_wars)
## <SQL>
## SELECT `production_year`, `title`
## FROM `title`
## WHERE ((`title` = 'Star Wars') AND (`kind_id` = 1.0))
9 / 16

Translation of basic functions

https://dbplyr.tidyverse.org/articles/translation-function.html#known-functions-1

library(dbplyr)
translate_sql(ceiling(mpg))
## <SQL> CEIL(`mpg`)
10 / 16

Translation of basic functions

https://dbplyr.tidyverse.org/articles/translation-function.html#known-functions-1

library(dbplyr)
translate_sql(ceiling(mpg))
## <SQL> CEIL(`mpg`)
translate_sql(mean(mpg))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## <SQL> AVG(`mpg`) OVER ()
10 / 16

Translation of basic functions

https://dbplyr.tidyverse.org/articles/translation-function.html#known-functions-1

library(dbplyr)
translate_sql(ceiling(mpg))
## <SQL> CEIL(`mpg`)
translate_sql(mean(mpg))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## <SQL> AVG(`mpg`) OVER ()
translate_sql(cyl == 4)
## <SQL> `cyl` = 4.0
10 / 16

Translation of basic functions

https://dbplyr.tidyverse.org/articles/translation-function.html#known-functions-1

library(dbplyr)
translate_sql(ceiling(mpg))
## <SQL> CEIL(`mpg`)
translate_sql(mean(mpg))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## <SQL> AVG(`mpg`) OVER ()
translate_sql(cyl == 4)
## <SQL> `cyl` = 4.0
translate_sql(cyl %in% c(4, 6, 8))
## <SQL> `cyl` IN (4.0, 6.0, 8.0)
10 / 16

Code pass-thru of other functions

https://dbplyr.tidyverse.org/articles/translation-function.html#unknown-functions-1

# no PASTE() in SQL
translate_sql(paste0("hp", "wt", "vs"))
## <SQL> CONCAT_WS('', 'hp', 'wt', 'vs')
11 / 16

Code pass-thru of other functions

https://dbplyr.tidyverse.org/articles/translation-function.html#unknown-functions-1

# no PASTE() in SQL
translate_sql(paste0("hp", "wt", "vs"))
## <SQL> CONCAT_WS('', 'hp', 'wt', 'vs')
# works, but no CONCAT() in R
translate_sql(CONCAT("hp", "wt", "vs"))
## <SQL> CONCAT('hp', 'wt', 'vs')
11 / 16

Code pass-thru of other functions

https://dbplyr.tidyverse.org/articles/translation-function.html#unknown-functions-1

# no PASTE() in SQL
translate_sql(paste0("hp", "wt", "vs"))
## <SQL> CONCAT_WS('', 'hp', 'wt', 'vs')
# works, but no CONCAT() in R
translate_sql(CONCAT("hp", "wt", "vs"))
## <SQL> CONCAT('hp', 'wt', 'vs')
# nonsense
translate_sql(CRAZY_FUNCTION(mpg))
## <SQL> CRAZY_FUNCTION(`mpg`)
11 / 16

Fine-looking R code

title %>%
filter(title %like% '%Star Wars%',
kind_id == 1,
!is.na(production_year)) %>%
select(title, production_year) %>%
arrange(production_year)
## # Source: lazy query [?? x 2]
## # Database: mysql 5.5.58-0ubuntu0.14.04.1-log [sds192@scidb.smith.edu:/imdb]
## # Ordered by: production_year
## title production_year
## <chr> <int>
## 1 Star Wars 1977
## 2 Star Wars: Episode V - The Empire Strikes Back 1980
## 3 Star Wars Underoos 1980
## 4 Star Wars: Episode VI - Return of the Jedi 1983
## 5 Tezukuri no Star Wars 1990
## 6 Star Wars: Episode I - The Phantom Menace 1999
## 7 Star Wars Gangsta Rap 2000
## 8 Star Wars Returns 2001
## 9 Star Wars: Attack of the Clones - A Jigsaw Puzzle 2002
## 10 Star Wars Episode V 1/2: The Han Solo Affair 2002
## # … with more rows
12 / 16

Weird-looking hybrid code

title %>%
filter(title %like% '%Star Wars%',
kind_id == 1,
!is.na(production_year)) %>%
mutate(before_dash = SUBSTRING_INDEX(title, '-', 1)) %>%
select(before_dash, production_year) %>%
arrange(production_year)
## # Source: lazy query [?? x 2]
## # Database: mysql 5.5.58-0ubuntu0.14.04.1-log [sds192@scidb.smith.edu:/imdb]
## # Ordered by: production_year
## before_dash production_year
## <chr> <int>
## 1 "Star Wars" 1977
## 2 "Star Wars: Episode V " 1980
## 3 "Star Wars Underoos" 1980
## 4 "Star Wars: Episode VI " 1983
## 5 "Tezukuri no Star Wars" 1990
## 6 "Star Wars: Episode I " 1999
## 7 "Star Wars Gangsta Rap" 2000
## 8 "Star Wars Returns" 2001
## 9 "Star Wars: Attack of the Clones " 2002
## 10 "Star Wars Episode V 1/2: The Han Solo Affair" 2002
## # … with more rows
13 / 16

Pipelines are evaluated lazily

old_movies <- title %>%
filter(
production_year < 1960,
kind_id == 1
)
class(old_movies)
## [1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql"
## [4] "tbl_lazy" "tbl"
dim(old_movies)
## [1] NA 12
print(object.size(old_movies), units = "Kb")
## 7.4 Kb
14 / 16

Use collect() to bring into R

old_movies_local <- old_movies %>%
collect()
class(old_movies_local)
## [1] "tbl_df" "tbl" "data.frame"
dim(old_movies_local)
## [1] 215234 12
print(object.size(old_movies_local), units = "Mb")
## 48.8 Mb
15 / 16

Lazy evaluation

  • Pipelines that don't ask for data

    • evaluated fast
    • take up almost no space in memory
  • Processing triggered when data is needed

    • print(), head(), glimpse(), etc.
    • plotting
    • collect()
16 / 16

SQL + dplyr

2 / 16
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow