In groups of three, you will craft a data-driven article of roughly 300–500 words. Your deliverable will be a readable blog post written in R Markdown. The final submission should be production-quality, replete with hyperlinks, images, data tables and/or graphics, and correct spelling and grammar.
You project must use SQL to query data from
The argument you make in your article must be supported by data stored on
scidb. You may do some of your data wrangling work in R, but you should use SQL first to retrieve data as efficiently as possible. Your article must include at least one data table or data graphic.
Your piece should:
Your project will use data stored on
scidb in at least one of the six databases listed below. You will write one or more queries in SQL that will pull in data relevant to your question. Examination of that data will inform your response to the question, and you will then formulate your arguments accordingly. Recall that (as always) communication is a critical component of data science, so details like axis labels, figure captions, spelling, and grammar, are just as important as writing your queries correctly and making a logical argument.
Be extra careful when writing your queries! Just because the query executes without an error does not mean that it will return exactly what you want. The computer is dumb—it just carries out instructions. You are smart—it’s your job to translate your ideas into a syntax that the computer can understand. Know your data!!!
Data are available on
library(tidyverse) library(RMySQL) <- dbConnect( db MySQL(), host = "scidb.smith.edu", user = "sds192", password = "DSismfc@S", dbname = "imdb" )::opts_chunk$set(connection = db, max.print = 20)knitr
Your project should focus on one of the following databases:
airlines: on-time flight data from the Bureau of Transportation Statistics (raw data documentation)
citibike: trip-level data from New York City’s municipal bike rental system (raw data documentation)
fec: campaign finance data from the Federal Election Commission (raw data documentation)
imdb: a copy of the Internet Movie Database (raw data documentation)
lahman: historical season-level baseball statistics (raw data documentation)
nyctaxi: ride-level data from New York City’s Taxi & Limousine Commission (raw data documentation)
yelp: restaurant reviews in the Phoenix, AZ area (raw data documentation)
Due to the compressed time during the Interterm, you may limit the scope of your analysis to results that can be achieved via a single query. In many cases, a simple table of results will be sufficient—you should not feel obligated to create a fancy data graphic.
.htmlfile to Moodle