About the Course

Ben Baumer


Student hours:

  • Tuesdays from 2:45 pm – 4:00 am ET in McConnell 214 or via Zoom

  • Fridays from 10:00 pm – 11:30 pm via Zoom

  • by appointment (either in McConnell 214 or via Zoom)



This one-semester course leverages students’ previous coursework to address a real-world data analysis problem. Students collaborate in teams on research projects sponsored by academia, government, and/or industry. These projects reinforce skills developed in the course of the SDS major, including: statistical modeling, programming with data, written, oral, and graphical communication about data, and the ability to bring data science skills to bear on a problem in a specific domain of application. In addition, professional skills developed include: ethics, project management, collaborative software development, documentation, and consulting. Regular team meetings, weekly progress reports, interim and final reports, and multiple presentations are required. Open only to majors. Topics will vary from year to year.

This course satisfies the capstone requirement for the major in Statistical & Data Sciences. In most years, it addresses all of the major’s learning goals:

  • Identify and work with a wide variety of data types (including, but not limited to, categorical, numerical, text, spatial and temporal) and formats (e.g. CSV, XML, JSON, relational databases, audio, video, etc.).
  • Extract meaningful information from data sets that have a variety of sizes and formats.
  • Fit and interpret statistical models, including but not limited to linear regression models. Use models to make predictions, and evaluate the efficacy of those models and the accuracy of those predictions.
  • Understand the strengths and limits of different research methods for the collection, analysis and interpretation of data. Be able to design studies for various purposes.
  • Attend to and explain the role of uncertainty in inferential statistical procedures.
  • Read and understand data analyses used in research reports. Contribute to the data analysis portion of a research project in at least one applied discipline.
  • Compute with data in at least one high-level programming language, as evidenced by the ability to analyze a complex data set.
  • Work in multiple languages and computational environments.
  • Convey quantitative information in written, oral and graphical forms of communication to both technical and non-technical audiences.
  • Assess the ethical implications to society of data-based research, analyses, and technology in an informed manner. Use resources, such as professional guidelines, institutional review boards, and published research, to inform ethical responsibilities.


  • Weapons of Math Destruction, Cathy O’Neil, Broadway Books, 2017.
  • Supplemental reading:
    • Algorithmims of Oppression, Safia Noble, NYU Press, 2018.
    • Race After Technology, Ruha Benjamin. Wiley, 2019.
    • Automating Inequality, Eubanks, Picador, 2019. [Amazon]
    • DataCamp, online programming courses for data science. Available for free through GitHub Classroom.
    • Shiny is an interactive web application framework for R. Available for free via our RStudio Server.


    Classes meet Tuesdays and Thursdays from 1:20–2:35 pm ET in Bass 002.


    Smith is committed to providing support services and reasonable accommodations to all students with disabilities. To request an accommodation, please register with the Disability Services Office at the beginning of the semester. To do so, call (413) 585-2071 to arrange an appointment with Laura Rauscher, Director of Disability Services.


    Masking (college-wide)

    For the health and safety of all members of our campus community, students are expected to follow all COVID-related policies on campus. At the start of the Fall 2021 semester, there are two policies in effect that deserve special mention. Students who are ill must not attend class, and they will be offered reasonable accommodations for missed work. Students must follow the college’s masking policy while it remains in effect. If you are unwilling to mask, you will be asked to leave the class. If you do not leave the class, the instructor will end the class, and the Dean of Students office will be notified that you have disrupted class.


    I am committed to fostering a classroom environment where all students thrive. I am committed to affirming the identities, realities and voices of all students, especially those from historically marginalized or underrepresented backgrounds. I am dedicated to creating a space where everyone in the class is respected, is free from discrimination based on race, ethnicity, sexual orientation, religion, gender identity, disability status, and other identities, and feel welcome and ready to learn at your highest potential.

    If you have any concerns or suggestions for how to make this class more inclusive, please reach out to me. I am here to support your learning and growth as data scientists and people!


    You choose whether you will attend class. If you choose to attend, I expect your full attention. If you choose not to attend, you accept responsibility for any lost educational value.

    : If you need to attend class remotely, please use the Zoom link posted on Moodle.

    We hope it goes without saying that during class, you should not use your computer or cell phone for personal email, web browsing, Facebook, or any activity that’s not related to the class.


    Much of this course will operate on a collaborative basis, and you are expected and encouraged to work together with a partner or in small groups to study, complete assignments, and work on projects. However, all work that you submit for credit must be your own. Copying and pasting sentences, paragraphs, or blocks of code from another student or from online sources is not acceptable and will receive no credit. No interaction with anyone but the instructors is allowed on any exams or quizzes. All students, staff and faculty are bound by the Smith College Honor Code, which Smith has had since 1944.

    Academic Honor Code Statement

    Smith College expects all students to be honest and committed to the principles of academic and intellectual integrity in their preparation and submission of course work and examinations. Students and faculty at Smith are part of an academic community defined by its commitment to scholarship, which depends on scrupulous and attentive acknowledgement of all sources of information, and honest and respectful use of college resources.

    Code of Conduct

    As the instructor and assistants for this course, we are committed to making participation in this course a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion. Examples of unacceptable behavior by participants in this course include the use of sexual language or imagery, derogatory comments or personal attacks, deliberate misgendering or use of “dead” names, trolling, public or private harassment, insults, or other unprofessional conduct.

    As the instructor and assistants we have the right and responsibility to point out and stop behavior that is not aligned to this Code of Conduct. Participants who do not follow the Code of Conduct may be reprimanded for such behavior. Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the instructor.

    All students, the instructor, the lab instructor, and all assistants are expected to adhere to this Code of Conduct in all settings for this course: lectures, labs, office hours, tutoring hours, and over Slack.

    This Code of Conduct is adapted from the Contributor Covenant, version 1.0.0, available here.


    Class time will be approximately divided as follows: 25% discussions about data science ethics, and 75% group work on projects. Tuesdays will typically focus on project work, while Thursdays will typically focus on ethics discussions.

    This is a 4 credit course, meaning that by federal guidelines, it should consume about 12 hours per week of your time. We meet for 2.5 hours per week. That means you should be spending about 9.5 hours per week, or nearly 2 hours per day, on this course outside of class.


    The following breakdown of assignments and their point values is subject to change. Please see the schedule for the most current due dates.

    Ethics (17%)

    You will write several short papers about ethics, and one longer essay.

    • 6 short responses about data science ethics (1 pts each = 6 pts)
    • ethics essay topic statement (2 pts)
    • ~1500-word essay on data science ethics, technology, and society (9 pts)
    Project (66%)

    You will all collaborate on group projects over the course of the semester. These projects will be very open-ended.

    • thrice weekly standup meetings
    • 4 sprint demos (3 pts each = 12 pts | rubric)
    • 4 sprint retrospectives (3 pts each = 12 pts | rubric)
    • 1 in-person or recorded presentations (4 pts | rubric)
    • 1st draft draft of final paper (6 pts)
    • 2nd draft draft of final paper (12 pts)
    • 10-page MDPI formatted final paper as PDF via R Markdown (20 pts | rubric)
    • schedule a final presentation with your client!
    Participation (17%)

    You will learn on your own outside of class. You and I will come up with a plan to formalize and assess that learning. Active participation in class discussions, engagement with group work, chatter on Slack and GitHub, and regular “attendance” are also required.

    • independent learning plan (2 pts | rubric)
    • independent learning assessment (5 pts | rubric)
    • my perception of your level of engagement, as measured by contributions in class discussions, activity on GitHub, posts on Slack, focus during lab meetings, etc. (10 pts)


    When grading your written work, we are looking for solutions that are technically correct and reasoning that is clearly explained. Neatness and organization are valued, with brief, clear answers that explain your thinking. If we cannot read or follow your work, we cannot give you full credit for it. Rubrics will be published in advance for all assignments.


    Extensions up to 48 hours will typically be granted when requested at least 48 hours in advance. Longer extensions, or those requested within 48 hours of a deadline will typically not be granted. Please plan accordingly. Please note that because many of the assignments in this class are collaborative, individual extensions for group assignments will be problematic. All extended deadlines will appear on Moodle.

    Late assignments will be penalized at the rate of 20% per day, up to a minimum grade of 20% of the assigned value.


    You will complete all of your assignments in R Markdown. Unless otherwise noted, you should expect to submit your assignment by uploading an HTML file (with a .html file extension) to Moodle.


    Moodle and course website

    The course website and Moodle will be updated regularly with lecture handouts, project information, assignments, and other course resources. Homework and grades will be submitted to Moodle. Please check both regularly.


    The use of the R statistical computing environment with the RStudio interface is thoroughly integrated into the course. You have two options for using RStudio:

    • The server version of RStudio on the web. The advantage of using the server version is that all of your work will be stored in the cloud, where it is automatically saved and backed up. This means that you can access your work from any computer with a web browser (Firefox is recommended) and an Internet connection.
    • A desktop version of RStudio installed on your machine. The downside to this approach is that your work is only stored locally, and you will have to manage your own installation.

    Note that you do not have to choose one or the other – you may use both. However, it is important that you understand the distinction so that you can keep track of your work. Both R and RStudio are free and open-source, and are installed on most computer labs on campus. Please see the Resources page for help with R.

    Unless otherwise noted, you should assume that it will be helpful to bring a laptop to class. It is not required, but since there are no workstations in the classroom, we will need a critical mass (i.e. at least 12) computers in the classroom pretty much everyday.


    • Slack is the primary forum for course-related discussions of all kinds. Please do not email me with course-related questions! Instead, post those #questions on Slack. If discretion is absolutely necessary, private message me on Slack.
    • GitHub will host all of the code for projects associated with this course. All repositories are private by default.
    • It is very important that all project-related communication take place in Slack or GitHub channels that all group members can see! Private texts and side conversations will very quickly lead to other group members feeling excluded.


    Your ability to communicate results—which may be technical in nature—to your audience – which is likely to be non-technical—is critical to your success as a data scientist. The assignments in this class will place an emphasis on the clarity of your writing.

    Tentative Schedule

    The following outline gives a basic description of the course. Please see the detailed schedule for more specific information about readings and assignments.

    Week Topic Reading Assignments
    1 setup, github, sponsors
    2 sprint 1 Coded Bias ethics 1
    3 sprint 1 WMD, DF, Ch. 1 ethics 2
    4 sprint 1 TBD demo & retrospective
    5 sprint 2 TBD ethics 3, presentation
    6 sprint 2 TBD ethics 4, final paper (1st draft)
    7 sprint 2 TBD demo & retrospective
    8 sprint 3 TBD ethics 5
    9 sprint 3 TBD ethics essay (draft)
    10 sprint 3 demo & retrospective
    11 sprint 4 final paper (2nd draft)
    12 sprint 4 ethic essay
    13 presentations Final Presentations, Final paper