Database querying (and spatial data)

Learning Goal

To write database queries in SQL
To use spatial data to make informative maps in R

Readings

Ch. 12–14 of Modern Data Science with R

Mini-Project

You may work with a partner or two (post groups to #mp3) to address a data science question of interest. You will use SQL to query a database and possible sf to plot spatial data.

Your deliverable will be a readable blog post written in R Markdown. The final submission should be production-quality, replete with hyperlinks, images, data tables and/or graphics, and correct spelling and grammar. Please use the template provided by the sds package!

Your project must use SQL to query data from scidb.

The argument you make in your article must be supported by data stored on scidb. You may do your data wrangling work in R, but you should use SQL first to retrieve data as efficiently as possible. Your article must include at least one data table or data graphic.

Your piece should adhere to some guidelines of good journalism for an online platform. The first key is to have a good subject that people want to read about. Ideally, find a news “hook” or “peg”—a recent news event or theme under current discussion to which you can peg your piece (think #metoo, #blacklivesmatter, or equivalent current discourse). A current or upcoming movie is also a good hook. Also, your piece should:

Employ language that is readable and clear without jargon, and with any technical terms either rephrased or explained.
Keep paragraphs short.
Have a strong headline or title that will pull people in.
Include hyperlinks to any relevant recent pieces on the web.

Your project will use data stored on scidb in at least one of the six databases listed below. You will write at least one query in SQL that will pull in data relevant to your question. Examination of that data will inform your response to the question, and you will then formulate your arguments accordingly. Recall that (as always) communication is a critical component of data science, so details like axis labels, figure captions, spelling, and grammar, are just as important as writing your queries correctly and making a logical argument.

Be extra careful when writing your queries! Just because the query executes without an error does not mean that it will return exactly what you want. The computer is dumb—it just carries out instructions. You are smart—it’s your job to translate your ideas into a syntax that the computer can understand. Know your data!!!

Data

Data are available on scidb.smith.edu.

library(tidyverse)
library(RMySQL)
db <- dbConnect(MySQL(), 
                host = "scidb.smith.edu", 
                user = "sds192", 
                password = "DSismfc@S", 
                dbname = "imdb")
knitr::opts_chunk$set(connection = db, max.print = 20)

SHOW DATABASES;

8 records
Database
information_schema
airlines
citibike
fec
imdb
lahman
nyctaxi
yelp

Databases

Your project should focus on one of the following databases:

airlines: on-time flight data from the Bureau of Transportation Statistics
citibike: trip-level data from New York City’s municipal bike rental system
fec: campaign finance data from the Federal Election Commission
imdb: a copy of the Internet Movie Database
lahman: historical season-level baseball statistics
nyctaxi: ride-level data from New York City’s Taxi & Limousine Commission

Most of these databases contain data that has spatial features. For example, airlines, citibike, and nyctaxi contain lat-long coordinates that can be coverted into sf objects. fec, imdb, and lahman contain addresses, countries, or states that can be plotted spatially. Your project need not make use of spatial data, but of course it might be richer if it does.

Rubric

There are 20 points available on this Mini-Project.

Baseline (C level work)

+2 for having an .Rmd that compiles without errors
+1 unnecessary messages from R are hidden from being displayed in the HTML
-1 for installing packages in your Markdown file
+1 for using code folding to keep your code hidden, but still make it available.
+1 for readable, well-documented code (i.e. appropriate comments, using line breaks, breaking code up into logical chunks, following a style guide)
+1 for including a catchy and/or engaging title
+1 for including a problem statement (what are you trying to figure out?)
+2 for using data from more than one SQL table in your analysis
+2 for creating and referencing a data table or graphic to support your findings
+1 for having at least 800 words and no more than 1200 words

Average (B level work)

+1 for using a GitHub repository and including a link to it in the footnotes
+1 for using appropriate HTML features (e.g., hyperlinks, images, embedded videos, footnotes, etc.)
+1 for joining multiple tables using SQL
+1 for including relevant external information to your analysis (e.g. important cultural context that the reader might not have, historical events that could be related to what you found, etc.)

Advanced (A level work)

+1 for creating an appropriate spatial data graphic (e.g., leaflet or ggplot2 graphic that uses an sf spatial object)
+1 for including a (well-supported!) alternate hypothesis / counter-argument
+0–2 reflects the professor’s judgment of the overall quality of your submission
- 0: graphics are relatively simple, confusing, sloppily executed, or not well thought out; analysis is muddled or uninformed; queries are inefficient
- 1: a solid all-around effort, graphics are appropriate, labeled, and contextualized; analysis is reasonable, considered, and well-researched; queries are reasonably prescise and efficient
- 2: graphics are special, showing creativity, an original design, and/or exceptional attention to detail; analysis is exceptionally thorough and well thought out; queries are cleverly precise and efficient