class: center, middle, inverse, title-slide # SQL Bootcamp ## short course ### Ben Baumer ### June 17, 2019 --- class: center, middle # SQL --- ## A brief primer on SQL .pull-left[  ] .pull-right[ - SQL is not just one thing - MySQL, PostgreSQL, SQLite - Oracle, Big Query, Vertica - Theory developed in 1970s (E.F. Codd) - Ingres implemented mid-1970s - robust, time-tested, well understood ] --- ## We'll be using PostgreSQL .pull-left[  ] .pull-right[ - Arguably the most "advanced" RDBMS - See also: MySQL, SQLite - Big Data SQL variants: - Vertica, Google BigQuery, Apache Spark ] --- ## The thing about SQL is... -- .pull-left[ Since you already know .centered[] ] -- .pull-right[ You **already** know SQL!  ] --- ## Same concepts, different syntax .pull-left[ dplyr ```r table %>% filter(field == "value") %>% left_join(lkup, by = c("lkup_id" = "id") %>% group_by(year) %>% summarize(N = sum(1)) %>% filter(N > 100) %>% arrange(desc(N)) %>% head(10) ``` ] .pull-right[ MySQL ```sql SELECT year, sum(1) as N FROM table t LEFT JOIN lkup l ON t.lkup_id = l.id WHERE field = "value" GROUP BY year HAVING N > 100 ORDER BY N desc LIMIT 0, 10; ``` ] -- - You may have learned `dplyr` first, but... - SQL came first - and it's more widely-used --- ## Why `dplyr` vs. SQL? .pull-left[ R + `dplyr` good at: - fitting models - plotting - wrangling data of all kinds - working with small data - being an *interface to SQL* ] .pull-right[ SQL good at: - storage and retrieval - medium-to-big data - multi-user, asynchronous access - serving institutional needs - web/mobile apps ] --- # Connecting dplyr to SQL --- ## `dplyr` + SQL connection - `dplyr` can access a SQL database directly - Instead of `tbl_df`, you have a `tbl_sql` - Data is stored and processed in SQL - Tiny memory footprint in R - **Lazy evaluation** - server-side processing - `dplyr` to SQL translation via `show_query()` - See [`dbplyr` package](https://github.com/tidyverse/dbplyr) --- ## Example: `tbl_sql` ```r db <- src_postgres( dbname = "imdb", host = "localhost", port = 5432, user = "postgres", password = "postgres" ) movies <- tbl(db, "movies") class(movies) ``` ``` ## [1] "tbl_PostgreSQLConnection" "tbl_dbi" ## [3] "tbl_sql" "tbl_lazy" ## [5] "tbl" ``` ```r print(object.size(movies), units = "Kb") ``` ``` ## 3.5 Kb ``` --- ## `tbl_sql` works just like a `tbl_df` ```r movies ``` ``` ## # Source: table<movies> [?? x 5] ## # Database: postgres 9.5.17 [postgres@localhost:5432/imdb] ## id title production_year votes rating ## <int> <chr> <int> <int> <dbl> ## 1 1696072 (500) Days of Summer 2009 237877 7.8 ## 2 1697216 10 Things I Hate About You 1999 141525 7 ## 3 1697262 10,000 BC 2008 82065 4.9 ## 4 1698059 12 Angry Men 1957 251500 8.9 ## 5 1698236 127 Hours 2010 184672 7.7 ## 6 1698327 13 Going on 30 2004 75575 6 ## 7 1698534 1408 2007 149826 6.8 ## 8 1698721 16 Blocks 2006 88119 6.6 ## 9 1698796 17 Again 2009 90147 6.3 ## 10 1700353 2 Fast 2 Furious 2003 107460 5.5 ## # … with more rows ``` --- ## Example: `show_query()` ```r movies %>% filter(title == 'Star Wars', kind_id == 1) %>% select(production_year, title) %>% show_query() ``` ``` ## <SQL> ## SELECT "production_year", "title" ## FROM "movies" ## WHERE (("title" = 'Star Wars') AND ("kind_id" = 1.0)) ``` --- class: center, middle # Common questions --- ## Do we need pgAdmin? - Short answer: No - Long answer: I find it helpful for: - exploring databases with a GUI - fiddling around with queries - more informative error messages - built-in help - adding keys and indexes (more on that later) - then copy-and-paste working queries into R Markdown --- ## Ways to interact with PostgreSQL 1. GUI SQL client: pgAdmin 2. command line SQL client (next slide) 3. generic SQL access in R: ~~`DBI::dbGetQuery()`~~ 4. `RPostgreSQL` through R Markdown chunks (see labs & hw) 5. Lots of other ways... -- I prefer: - (1) for SQL development - (4) for SQL reporting --- ## What's the deal with table aliases? Consider: ```sql SELECT title.title, company_name.name, title.production_year FROM movie_companies JOIN company_name ON movie_companies.company_id = company_name.id JOIN title ON title.id = movie_companies.movie_id WHERE company_name.name LIKE '%All Rise Films%' AND title.kind_id = 1 AND movie_companies.company_type_id = 2; ``` --- ## What's the deal with table aliases? Instead: ```sql SELECT t.title, cn.name, t.production_year FROM movie_companies AS mc JOIN company_name AS cn ON mc.company_id = cn.id JOIN title AS t ON t.id = mc.movie_id WHERE cn.name LIKE '%All Rise Films%' AND t.kind_id = 1 AND mc.company_type_id = 2; ``` --- ## People in the IMDB - `name`: people and their names - `id` column is primary key - `person_id` column references `name.id` from: - `person_info` - `cast_info` - `aka_name` --- ## Movies in the IMDB - `title`: movies and their titles - `id` column is primary key - `movie_id` column references `title.id` from: - `cast_info` - `aka_title` - `movie_companies` - `movie_info` - `movie_info_idx` - `movie_keyword` - `movie_link` --- background-image: url(https://i.imgur.com/pDq0n.png) background-size: contain --- class: center, middle # Keys and Indexes --- background-image: url(http://static.libsyn.com/p/assets/c/2/1/b/c21be1062e52949b/Unused-Indexes-vs-Foreign-Keys.png) background-size: contain --- ## Keys and Indexes - Keys: - `PRIMARY KEY`: unique, non-`NULL`, only one per table - `UNIQUE KEY`: unique, may include `NULL` - `FOREIGN KEY`: references primary key in another table - Indexes: - No constraints, just about speed - Take up space on disk - Will they be used? - A `PRIMARY KEY` is always indexed --- background-image: url(https://i.imgur.com/pDq0n.png) background-size: contain --- ## A DB without indexes is like Dory .centered[] --- ## An index is a lookup table .centered[] --- ## So why not just build indexes on all the columns? - Takes up space on disk - Takes time to build - Slows down inserts - Not always that big of an improvement --- background-image: url(http://static.libsyn.com/p/assets/c/2/1/b/c21be1062e52949b/Unused-Indexes-vs-Foreign-Keys.png) background-size: contain