Learning Goal

  • To write database queries in SQL

Mini-Project

In groups of three, you will craft a data-driven article of roughly 300–500 words. Your deliverable will be a readable blog post written in R Markdown. The final submission should be production-quality, replete with hyperlinks, images, data tables and/or graphics, and correct spelling and grammar.

You project must use SQL to query data from scidb.

The argument you make in your article must be supported by data stored on scidb. You may do some of your data wrangling work in R, but you should use SQL first to retrieve data as efficiently as possible. Your article must include at least one data table or data graphic.

Your piece should:

  • Employ language that is readable and clear without jargon, and with any technical terms either rephrased or explained.
  • Keep paragraphs short.
  • Include a counterargument or other/opposite perspective, often about 2/3rds of the way through.
  • Have a strong headline or title that will pull people in.
  • Include hyperlinks, if applicable, to any relevant recent pieces on the web.

Your project will use data stored on scidb in at least one of the six databases listed below. You will write one or more queries in SQL that will pull in data relevant to your question. Examination of that data will inform your response to the question, and you will then formulate your arguments accordingly. Recall that (as always) communication is a critical component of data science, so details like axis labels, figure captions, spelling, and grammar, are just as important as writing your queries correctly and making a logical argument.

Warning

Be extra careful when writing your queries! Just because the query executes without an error does not mean that it will return exactly what you want. The computer is dumb—it just carries out instructions. You are smart—it’s your job to translate your ideas into a syntax that the computer can understand. Know your data!!!

Data

Data are available on scidb.smith.edu.

library(tidyverse)
library(RMySQL)
db <- dbConnect(
    MySQL(), 
    host = "scidb.smith.edu", 
    user = "sds192", 
    password = "DSismfc@S", 
    dbname = "imdb"
)
knitr::opts_chunk$set(connection = db, max.print = 20)
SHOW DATABASES;
8 records
Database
information_schema
airlines
citibike
fec
imdb
lahman
nyctaxi
yelp

Databases

Your project should focus on one of the following databases:

Scope

Due to the compressed time during the Interterm, you may limit the scope of your analysis to results that can be achieved via a single query. In many cases, a simple table of results will be sufficient—you should not feel obligated to create a fancy data graphic.

Submission

  1. Push all commits to the appropriate repository in our private GitHub organization.
  2. Submit rendered .html file to Moodle

Standards

In this assignment:

  • You should attempt the following standards:


References