| Checkpoint | Due Date | Credit |
|---|---|---|
| Proposal | Mar 11 | 5 |
| Exploratory Data Analysis | Apr 8 | 15 |
| Statistical Analysis Plan | Apr 20 | 15 |
| Oral Review | Finals Week | 30 |
| Technical Report | May 8 @ 3pm | 30 |
| Reflection | May 8 @ 3pm | 5 |
Overview
Working in groups of three students, you will conduct a statistical study on a topic of your choice. The project is an opportunity to show off what you’ve learned about data analysis, visualization, and statistical inference. It is a major component of the class, and successful completion is required to pass.
Big picture
Many interesting quantitative questions involve the relationships among several variables. You will be given a choice of several rich datasets to explore, available for download via this Google Drive folder:
- Police Scorecard Data: data from the Police Scorecard Project examining levels of police violence, accountability, racial bias and other policing outcomes for municipal and county law enforcement
- Birthing Parent Smoking and Infant Health: data on infant gestational period and birth weight, as well as parental demographic and health information
- Lego Prices: data on Lego characteristics (e.g., set theme, size, and number of pieces) and the retail price on Amazon.com
- Goodreads Book Ratings: data on book characteristics (e.g., genre, publication year, language, and length) and Goodreads rating
- Airline Delays: data on airline delay type and length for flights in December 2019 and December 2020
- Environmental Sustainability Index (ESI): data on 146 countries assessing both their overall sustainability and their performance in a variety of different environmental areas circa 2005
You may also select another dataset provided through the following R packages:
If your chosen dataset is not listed above, you will need to gain instructor permission/approval to use it. From these datasets, you will pose—as precisely as possible—the problem that you wish to solve or the relationships that you wish to better understand. What kinds of summary measures and visuals would best inform these questions? What kinds of statistical tests or models might be appropriate in this context?
You will then:
- Form a hypothesis, a priori (before you analyze the data!), about the results you expect to see.
- Conduct exploratory data analysis and visualization.
- Draft a statistical analysis plan outlining the formal statistical approach you will take to address your hypothesis (and then conduct the analysis!).
- Present your results to the professors in an oral presentation.
- Write a final technical report describing your study and its findings.
Group Formation
You will work in a group of three. Tag your group members in a post on #proj-groups-sec3, or I will assign you to random group.
Key Project Deliverables
All deliverables must be submitted electronically via Moodle by midnight on the dates below. Only one person from the group should submit the group’s product for each checkpoint (with the exception of the reflection, which is individual).
Submission
All deliverables described above must be delivered electronically via Moodle by 11:55pm (five minutes before midnight) on the dates above (unless otherwise noted). Only one person from the group should submit the group’s product for each checkpoint (with the exception of the Reflection, which is individual).
Assessment Criteria
Your project will be evaluated based on the following criteria:
- General: Is the topic original, interesting, and substantial – or is it trite, pedantic, and trivial? How much creativity, initiative, and ambition did the group demonstrate? Is the basic question driving the project worth investigating, or is it obviously answerable without a data-based study?
- Design: Are the variables chosen appropriately and defined clearly, and is it clear how they were measured/observed? Can the effects of lurking variables be controlled for? Is there sufficient data to make meaningful conclusions?
- Analysis: Are the chosen analyses appropriate for the variables/relationships under investigation, and are the assumptions underlying these analyses met? Do the analyses involve fitting and interpreting a multiple regression model? Are the analyses carried out correctly? Is there an effective mix of graphical, numerical, and inferential analyses? Did the group make appropriate conclusions from the analyses, and are these conclusions justified?
- Technical Report: How effectively does the written report communicate the goals, procedures, and results of the study? Are the claims adequately supported? How well is the report structured and organized? Does the writing style enhance what the group is trying to communicate? How well is the report edited? Are the statistical claims justified? Are text and analyses effectively interwoven in the technical report? Clear writing, correct spelling, and good grammar are important.
- Oral Review: Does the group have a good grasp of the research they have done? Can they respond to reasonable questions in real-time? Is it clear that they have a working mental model of both the statistic aspects of the project and of the data itself?