To facilitate our connection to the MySQL database server, we will need to install the
# do NOT install RMySQL if you are on the RStudio Server install.packages("RMySQL")
Goal: by the end of this lab, you will be able to write basic
SELECT queries in SQL and retrieve the results into R.
The data we will be using is stored on a server in Bass Hall. It’s called
scidb.smith.edu. We can connect through the
dbConnect() function provided by the
DBI package (which is loaded automatically when you load
RMySQL). You will also need the
RMySQL package installed.
library(tidyverse) library(RMySQL) <- dbConnect( db MySQL(), host = "scidb.smith.edu", user = "sds192", password = "DSismfc@S", dbname = "imdb" )::opts_chunk$set(connection = db, max.print = 20)knitr
This chunk of code will allow you to connect to
scidb. Note that this creates a database connection object named
db, which has the class
Also, we set the
connection parameter for all future chunks in this R Markdown file. Note also that the
max.print argument sets the maximum number of results printed by each query.
Each of the following chunks makes use of the SQL engine functionality in
knitr. You may want to read about this. Each of the following chunks is an
sql chunk – as opposed to an
To retrieve the results from a query in R, use the
dbGetQuery() function from the
DBI package (which is automatically loaded when you load
RMySQL). Its first argument is a database connection object, and the second argument is an SQL query as a character vector.
This query returns the list of kinds of “movies” stored in the IMDB.
SELECT * FROM kind_type;
Of course, you will often want to store the result of your query as a data frame. This can be achieved by setting the
output.var argument in the chunk. Here we retrieve the list of different types of information stored in the database, save it as a data frame in R, and show the first few rows.
-- output.var="info_types" in this chunk SELECT * FROM info_type;
We now have a
info_types in our environment.
That’s all you need to know about how to get data from MySQL into R. The rest of this lab consists of practicing writing SQL queries. It may be useful to reference the full documentation for
Tip: Practice writing the SQL in the MySQL Workbench first. Once you’ve got the query you want, then copy-and-paste it back into your R Markdown file.
For example, let’s say we wanted to find Ben’s favorite movie, which is (obviously) The Empire Strikes Back.
The titles are stored in the
title field (i.e. column) in the
title table. [Note: your professors are not responsible for naming these tables and fields!] Each row in the
title table corresponds to a single movie, but of course, we need to restrict the rows we retrieve to only those where the
title field equals
The Empire Strikes Back. The following query achieves this.
Note: SQL does not require the
==for testing equality, since you aren’t ever changing the data.
Note: You have to use
'single quotes since you are working within a
SELECT * FROM title WHERE title = 'The Empire Strikes Back';
That retrieved a lot of movies! Let’s see if we can refine our query. First, movies (as opposed to TV episodes, etc.) have the
kind_id value of
SELECT * FROM title WHERE title = 'The Empire Strikes Back' AND kind_id = 1;
Huh. Now we don’t get any results at all. We’ve made our query too specific. It turns out that the actual title of the movie according to IMDB is
Star Wars: Episode V - The Empire Strikes Back. Let’s soften our query by searching for the phrase
The Empire Strikes Back within the title. We can do this using the
LIKE function along with some wildcards (
% in SQL).
SELECT * FROM title WHERE title LIKE '%The Empire Strikes Back%' AND kind_id = 1;
Finally, let’s put those in chronological order.
SELECT t.title, t.production_year FROM title t WHERE t.title LIKE '%The Empire Strikes Back%' AND t.kind_id = 1 ORDER BY production_year;
Find your favorite movie in the
Find Viola Davis’s
id in the
-- SAMPLE SOLUTION SELECT * FROM name WHERE name LIKE '%Davis, Viola%';
Note that she is Viola Davis (I).
In the IMDB, the
title table contains information about movies, the
name table contains the names of people, the
char_name table contains information about the names of characters, and the
cast_info table contains information about which people played which roles in which movies. Linking the tables together is essential in order to extract information from the database.
Since we already know that the ID of The Empire Strikes Back is
4260164, we can use that to find all of the cast assignments.
SELECT * FROM cast_info WHERE movie_id = 4260164;
Note that this returns a list of person-role pairs.
cast_infothat correspond to Viola Davis as an actress.
-- SAMPLE SOLUTION SELECT n.name, ci.role_id FROM cast_info ci JOIN name n ON n.id = ci.person_id WHERE ci.person_id = 2977372;
Next, we can join the
cast_info table on the
name table to recover the names of the people.
SELECT n.name, ci.role_id FROM cast_info ci JOIN name n ON n.id = ci.person_id WHERE movie_id = 4260164;
Note how we have used table aliases to save some typing.
-- SAMPLE SOLUTION SELECT n.name, ci.role_id, cn.name FROM cast_info ci JOIN name n ON n.id = ci.person_id JOIN char_name cn ON cn.id = ci.person_role_id WHERE ci.person_id = 2977372;
Of course, we can join more than two tables together in a single query.
production_year, and the name of the character that she played.
-- SAMPLE SOLUTION SELECT t.title, t.production_year, cn.name FROM cast_info ci JOIN title t ON ci.movie_id = t.id JOIN char_name cn ON cn.id = ci.person_role_id WHERE ci.person_id = 2977372 AND t.kind_id = 1 AND ci.role_id = 2 ORDER BY production_year;
Please respond to the following prompt on Slack in the `#mod-sql channel.
Prompt: Does your understanding of
dplyrhelp with understanding SQL? What questions to you still have about SQL?