Database querying

Learning Goal

To write database queries in SQL

Readings

Ch. 15–16 of Modern Data Science with R

Mini-Project

In groups of three, you will craft a data-driven article of roughly 300–500 words. Your deliverable will be a readable blog post written in R Markdown. The final submission should be production-quality, replete with hyperlinks, images, data tables and/or graphics, and correct spelling and grammar.

You project must use SQL to query data from scidb.

The argument you make in your article must be supported by data stored on scidb. You may do some of your data wrangling work in R, but you should use SQL first to retrieve data as efficiently as possible. Your article must include at least one data table or data graphic.

Your piece should:

Employ language that is readable and clear without jargon, and with any technical terms either rephrased or explained.
Keep paragraphs short.
Include a counterargument or other/opposite perspective, often about 2/3rds of the way through.
Have a strong headline or title that will pull people in.
Include hyperlinks, if applicable, to any relevant recent pieces on the web.

Your project will use data stored on scidb in at least one of the six databases listed below. You will write one or more queries in SQL that will pull in data relevant to your question. Examination of that data will inform your response to the question, and you will then formulate your arguments accordingly. Recall that (as always) communication is a critical component of data science, so details like axis labels, figure captions, spelling, and grammar, are just as important as writing your queries correctly and making a logical argument.

Warning

Be extra careful when writing your queries! Just because the query executes without an error does not mean that it will return exactly what you want. The computer is dumb—it just carries out instructions. You are smart—it’s your job to translate your ideas into a syntax that the computer can understand. Know your data!!!

Data

Data are available on scidb.smith.edu.

library(tidyverse)
library(RMySQL)
db <- dbConnect(
    MySQL(), 
    host = "scidb.smith.edu", 
    user = "sds192", 
    password = "DSismfc@S", 
    dbname = "imdb"
)
knitr::opts_chunk$set(connection = db, max.print = 20)

SHOW DATABASES;

8 records
Database
information_schema
airlines
citibike
fec
imdb
lahman
nyctaxi
yelp

Databases

Your project should focus on one of the following databases:

airlines: on-time flight data from the Bureau of Transportation Statistics (raw data documentation)
citibike: trip-level data from New York City’s municipal bike rental system (raw data documentation)
fec: campaign finance data from the Federal Election Commission (raw data documentation)
imdb: a copy of the Internet Movie Database (raw data documentation)
lahman: historical season-level baseball statistics (raw data documentation)
nyctaxi: ride-level data from New York City’s Taxi & Limousine Commission (raw data documentation)
yelp: restaurant reviews in the Phoenix, AZ area (raw data documentation)

Scope

Due to the compressed time during the Interterm, you may limit the scope of your analysis to results that can be achieved via a single query. In many cases, a simple table of results will be sufficient—you should not feel obligated to create a fancy data graphic.

Submission

Push all commits to the appropriate repository in our private GitHub organization.
Submit rendered .html file to Moodle

Standards

In this assignment:

You must attempt the following standards:

You should attempt the following standards:

You may attempt the following standards: