You may work with a partner or two (post groups to #mp3
) to address a data science question of interest. You will use SQL to query a database and possible sf
to plot spatial data.
Your deliverable will be a readable blog post written in R Markdown. The final submission should be production-quality, replete with hyperlinks, images, data tables and/or graphics, and correct spelling and grammar. Please use the template provided by the sds
package!
Your project must use SQL to query data from scidb
.
The argument you make in your article must be supported by data stored on scidb
. You may do your data wrangling work in R, but you should use SQL first to retrieve data as efficiently as possible. Your article must include at least one data table or data graphic.
Your piece should adhere to some guidelines of good journalism for an online platform. The first key is to have a good subject that people want to read about. Ideally, find a news “hook” or “peg”—a recent news event or theme under current discussion to which you can peg your piece (think #metoo, #blacklivesmatter, or equivalent current discourse). A current or upcoming movie is also a good hook. Also, your piece should:
Your project will use data stored on scidb
in at least one of the six databases listed below. You will write at least one query in SQL that will pull in data relevant to your question. Examination of that data will inform your response to the question, and you will then formulate your arguments accordingly. Recall that (as always) communication is a critical component of data science, so details like axis labels, figure captions, spelling, and grammar, are just as important as writing your queries correctly and making a logical argument.
Be extra careful when writing your queries! Just because the query executes without an error does not mean that it will return exactly what you want. The computer is dumb—it just carries out instructions. You are smart—it’s your job to translate your ideas into a syntax that the computer can understand. Know your data!!!
Data are available on scidb.smith.edu
.
library(tidyverse)
library(RMySQL)
<- dbConnect(MySQL(),
db host = "scidb.smith.edu",
user = "sds192",
password = "DSismfc@S",
dbname = "imdb")
::opts_chunk$set(connection = db, max.print = 20) knitr
SHOW DATABASES;
Database |
---|
information_schema |
airlines |
citibike |
fec |
imdb |
lahman |
nyctaxi |
yelp |
Your project should focus on one of the following databases:
airlines
: on-time flight data from the Bureau of Transportation Statisticscitibike
: trip-level data from New York City’s municipal bike rental systemfec
: campaign finance data from the Federal Election Commissionimdb
: a copy of the Internet Movie Databaselahman
: historical season-level baseball statisticsnyctaxi
: ride-level data from New York City’s Taxi & Limousine CommissionMost of these databases contain data that has spatial features. For example, airlines
, citibike
, and nyctaxi
contain lat-long coordinates that can be coverted into sf
objects. fec
, imdb
, and lahman
contain addresses, countries, or states that can be plotted spatially. Your project need not make use of spatial data, but of course it might be richer if it does.
There are 20 points available on this Mini-Project.
.Rmd
that compiles without errorsleaflet
or ggplot2
graphic that uses an sf
spatial object)