class: center, middle, inverse, title-slide # Mini-Lecture 29 ## Database querying with SQL ### Ben Baumer ### SDS 192April 17, 2019(
http://beanumber.github.io/sds192/lectures/29-sql.html
) --- ## Extra Credit talk .pull-left[ - Stephanie Hicks Johns Hopkins Bloomberg School of Public Health Thursday, April 18th, 6 pm Seelye 106 ] .pull-right[  ] --- ## Rookie data viz mistakes - From Di Cook - Professor of Business Analytics, Monash Univ. - Ph.D. in Statistics, Rutgers - Hadley Wickham's dissertation advisor - (http://www.dicook.org/2018/04/14/content/post/2018-04-14-rookie-mistakes/) --- class: center, middle # SQL --- ## A brief primer on SQL .pull-left[  ] .pull-right[ - SQL is not just one thing - MySQL, PostgreSQL, SQLite - Oracle, Big Query, Vertica - Theory developed in 1970s (E.F. Codd) - Ingres implemented mid-1970s - robust, time-tested, well understood ] --- ## We'll be using MySQL .pull-left[  ] .pull-right[ - Arguably the most popular RDBMS - open source - See also: PostgreSQL, SQLite - bought by Sun in 2008, now owned by Oracle - [MariaDB](https://en.wikipedia.org/wiki/MariaDB) is GPL-only fork ] --- ## The thing about SQL is... -- .pull-left[ Since you already know .centered[] ] -- .pull-right[ You **already** know SQL!  ] --- ## Same concepts, different syntax .pull-left[ dplyr ```r table %>% filter(field == "value") %>% left_join(lkup, by = c("lkup_id" = "id") %>% group_by(year) %>% summarize(N = sum(1)) %>% filter(N > 100) %>% arrange(desc(N)) %>% head(10) ``` ] .pull-right[ MySQL ```sql SELECT year, sum(1) as N FROM table t LEFT JOIN lkup l ON t.lkup_id = l.id WHERE field = "value" GROUP BY year HAVING N > 100 ORDER BY N desc LIMIT 0, 10; ``` ] -- - You may have learned `dplyr` first, but... - SQL came first - and it's more widely-used --- ## Why `dplyr` vs. SQL? .pull-left[ R + `dplyr` good at: - fitting models - plotting - wrangling data of all kinds - working with small data - being an *interface to SQL* ] .pull-right[ SQL good at: - storage and retrieval - medium-to-big data - multi-user, asynchronous access - serving institutional needs - web/mobile apps ] --- class: center, middle # An example --- ## *Henry & June (1990)* .pull-left[  ] .pull-right[ - The first film to earn an [NC-17](https://en.wikipedia.org/wiki/Motion_Picture_Association_of_America_film_rating_system#X_replaced_by_NC-17) rating from the MPAA was [Henry & June (1990)](https://en.wikipedia.org/wiki/Henry_%26_June). However, that is not the NC-17 film with the earliest `production_year` in the IMDB. - Write a query that returns the list of *feature-length* NC-17 movies. What are the main reasons given for NC-17 ratings? ] --- ## Connecting to MySQL .pull-left[  ] .pull-right[ - Client GUI - [MySQL Workbench](https://dev.mysql.com/downloads/workbench/) - many other GUIs - Command line client - `mysql -h scidb.smith.edu` - R - `DBI`, `dbplyr` - [db.rstudio.com](http://db.smith.edu) ] --- ## Connect (pw: `RememberPi`)  --- ## Solution > Hint: Note that both `info_type = LD certification` and `info_type = mpaa` can contain MPAA ratings. ```sql SELECT t.title, t.production_year, mi.info FROM movie_info mi LEFT JOIN title t ON mi.movie_id = t.id WHERE info_type_id IN (67, 97) AND info LIKE '%NC-17%' AND kind_id = 1 ORDER BY production_year; ``` - Most of the NC-17 ratings seem to be given for explicit sexual content. --- ## English-language films only ```sql SELECT t.title, t.production_year, mi.info as rating, mi2.info as language FROM title t JOIN movie_info mi ON mi.movie_id = t.id *JOIN movie_info mi2 * ON mi2.movie_id = t.id WHERE mi.info_type_id IN (67, 97) AND mi.info LIKE '%NC-17%' AND kind_id = 1 * AND mi2.info_type_id = 4 * AND mi2.info = 'English' ORDER BY production_year; ``` --- ## Practice makes perfect - [Lab 19](../lab-sql.html) - [HW #6](../hw/hw_feedback.html) is posted (due tonight!) - [HW #7](../hw/hw_sql.html) is posted (due Sunday 4/28)