Projects for Spring 2025
Project 1: Chicago Cubs
Contact Person: Jasmine Horan
Preferred Contact Method: Email, Videoconference (e.g., Zoom, Google Hangouts)
Description of Organization
NA
Project Description
In baseball it is crucial for players and coaches to prepare for games. They must study their opponents and gather as much information as possible on their game strategy and the tendencies/profiles of opposing players. This usually takes the form of Advance Scouting Reports, which consist of basic player information (like handedness, how hard they throw, etc.) as well as more detailed explanation of the player’s profile (for pitchers, what types of pitches do they throw and how often do they throw them).
This project has two main objectives—train an AI model to: 1) Generate a short, high-level written scouting report of a pitcher’s recent performance. It should summarize key takeaways from the pitcher’s past several games and highlight any notable differences relative to the rest of their season’s performance. (This shouldn’t be batter-specific; instead, it should serve as a general overview of the pitcher that anyone could find useful). 2) Generate more detailed written scouting reports of that pitcher that are batter-specific, highlighting certain strengths/weaknesses with a particular batter in mind. (For example, if I’m a hitter and I struggle against high fastballs, this report might mention the pitcher’s FB% in the upper half of the strike zone).
You can get creative with these reports— if desired, you can also create visualizations to complement the written material. Your intended audience is players, coaches, and Front Office staff.More info
-
How many rows of data do you imagine will be involved in total?:
- hundreds of thousands
-
What kind of data will the project involve?:
- rectangular (i.e., rows and columns, spreadsheets, CSVs)
-
Tags:
- Statistical modeling, Predictive modeling (aka, machine learning), Data analysis, Data visualization, Text analysis / regular expressions / natural language processing, Artificial Intelligence
-
Is there anything else you’d like to share with the students?:
- NA
Repository
https://github.com/sds-capstone/chicubs-s25Video
https://drive.google.com/drive/folders/1YlQ12oHs0c-eBQy0bvG1xR2IhIA0hl_x
Project 2: Federal Reserve Bank of Cleveland
Contact Person: Thealexa (Lex) Becker
Preferred Contact Method: Email
Description of Organization
The Federal Reserve Bank of Cleveland serves the Fourth Federal Reserve District, comprising Ohio, western Pennsylvania, eastern Kentucky, and the northern panhandle of West Virginia. The Fourth District has 169 counties, 68 of which are located within the District’s metropolitan statistical areas (MSAs). Cleveland, Cincinnati, and Columbus, Ohio, and Pittsburgh, Pennsylvania, are the most densely populated MSAs in the Fourth District and are home to roughly half the District’s population of nearly 17 million people.
As a bankers’ bank, the Cleveland Fed provides cash services to financial institutions, accepting deposits of cash from and supplying currency and coin to banks in and around the Fourth District and credit and debiting accounts accordingly. The Cleveland Fed supervises four of the nation’s largest bank and savings and loan holding companies and several hundred additional financial holding companies and state-chartered member banks.Project Description
Title of proposal: Retrieval Augmented Generation (RAG) Model Development Software recommended: Students will use Python (and possibly R) to refine a RAG model built in the previous semester by Smith capstone students. Previous experience with Python is useful, but this is an excellent intro to Python project for students who might want to get some experience before graduation. Summary of the project: The main focus of the project is to use a RAG model to query against a corpus of regulatory documents. RAG models improve on searches that might be done with exacting phrases or keyword searches. These models have become an area of increased focus in the industry as they have shown improvements over other search methods. The paper describing this method can be found here: https://arxiv.org/abs/2005.11401 The huggingface model page can be found here: https://huggingface.co/blog/ray-rag https://huggingface.co/docs/transformers/en/model_doc/rag
This project will make use of the RAG model developed by students in the previous semester, add data to the model and perform enhancements, and then create a front end app to display results in Dash.More info
-
How many rows of data do you imagine will be involved in total?:
- The data are documents hosted on a publicly available website. The exact list of documents the students will work with will be provided by sponsors at the start of the project.
-
What kind of data will the project involve?:
- The data are documents hosted on this website: https://www.federalreserve.gov/supervisionreg/topics/topics.htm
-
Tags:
- Statistical modeling, Data analysis, Data wrangling, Text analysis / regular expressions / natural language processing, transformers, retrieval augmented generation
-
Is there anything else you’d like to share with the students?:
- Thealexa (Lex) Becker is a 2013 math and economics graduate of Smith College. Lex, along with Catherine Chen and Eric Jones are data scientists at FRB Cleveland. Together they have a range of backgrounds from economics to math to engineering.
Repository
https://github.com/sds-capstone/cleveland-s25Video
Project 3: Dance Data Project®
Contact Person: Elizabeth Yntema
Preferred Contact Method: Email
Description of Organization
Dance Data Project® is a metrics based advocacy and education not for profit focused on increasing leadership opportunities and fair pay for women in the dance industry
Project Description
We have 2 potential projects: The first is to start the work (it will take several teams to bring to completion) of creating an online “hiring agency” which will create an interactive map that lists the talent in a particular region or city -From choreographers, to composers, to set, costume & lighting designers, stage managers and other back of curtain jobs. The idea, very similar to Maestra for Broadway, is to overcome the excuse of “we don’t know any good female…” the website would use DDP’s already existing Leader Board, which is now at over 1400 names. we would need to weed out those women no longer active, and build out a website within DDP’s current URL, or an adjacent one, that also can incorporate uploaded resumes, YouTube clips etc. Bu we would have to screen for appropriateness and prevent being trolled/spammed. Some form of automatic updating - request for new work may be necessary. Ancillary activities would be dance and music sector wide outreach to potential users, to inform them of this new service.
The second potential project is two fold: To track what happens to women v. men retiring from dance (primarily ballet/classical dance careers). This would require, at a minimum, scraping dance company and dance magazine websites to look for retirement announcements, and then follow up, either by web searches, or direct outreach, to determine the career trajectory of former dancers. Divided by corps de ballet, soloist and finally Principal, the working premise is that men are systematically moved into leadership positions, through a variety of roles: curating a dance festival, becoming a rehearsal director, or head of school, etc. DDP has already begun to track Leadership Transitions, but we are unable to follow what happens to the average “line” dancer. This might require cross departmental work to create and then distribute a “retirement” survey, which would be managed by departmental students and faculty at Smith College. It’s not a small undertaking.More info
-
How many rows of data do you imagine will be involved in total?:
- thousands
-
What kind of data will the project involve?:
- Don’t know
-
Tags:
- Statistical modeling, Web app development, Web scraping, Data analysis, Record linkage, Data wrangling, Data visualization, Experimental design, Text analysis / regular expressions / natural language processing
-
Is there anything else you’d like to share with the students?:
- We have demonstrated that working with DDP is an opportunity to create real world change through data analysis. The arts - where so much of the work is done (particularly at a lower level, less well compensated level) is done by women, is under appreciated and unrecognized. To put it in more radical, economic terms: by refusing to measure the systematic exclusion of women from better compensated, and more prestigious leadership positions, the arts continue to perpetuate a retrograde system whose bedrock is the under or unpaid labor of women.
Repository
https://github.com/sds-capstone/dancedata-s25Video
https://drive.google.com/file/d/1CmlVC3Kn4eacyLtD01Q3snwiy8vUXLnS/view?ts=6796ce94
Project 4: Elmhurst Middle School
Contact Person: Jessica Majerus
Preferred Contact Method: Email, Videoconference (e.g., Zoom, Google Hangouts)
Description of Organization
We are a large comprehensive traditional middle school that serves a population of students who are multiplied marginalized.
Project Description
We would like to explore our students’ math data and see if we can try to find trends that will help us to make better choices about instruction and placement in intervention math classes.
More info
-
How many rows of data do you imagine will be involved in total?:
- hundreds
-
What kind of data will the project involve?:
- Don’t know
-
Tags:
- Data analysis, probably need help knowing this :)
-
Is there anything else you’d like to share with the students?:
- Elmhurst students are awesome and brilliant, and they are not accessing grade level math now. This project could support with their access being improved and their math futures!
Repository
https://github.com/sds-capstone/ems-s25Video
Project 5: The Future of Heat Initiative
Contact Person: Mike Bloomberg
Preferred Contact Method: Email
Description of Organization
The Future of Heat Initiative is a nonprofit dedicated to helping regulators, legislators, and energy advocates shape a clean, safe, and affordable future for heat.
Project Description
Over 77 million U.S. households rely on utility-delivered gas for home heating, making it a critical part of the current energy system. However, methane—a primary component of natural gas—has a global warming potential 84–86 times higher than CO₂ over a 20-year timeframe. Methane leaks from the gas system pose significant environmental and safety hazards, particularly as aging pipes become more prone to failure. While utilities invest approximately $25 billion annually in gas distribution infrastructure, primarily for pipe replacement programs, these investments raise energy costs for customers and can perpetuate reliance on fossil fuel infrastructure. This project will leverage data collected by the Pipeline and Hazardous Materials Safety Administration (PHMSA), a federal agency responsible for ensuring the safe transportation of energy and hazardous materials. PHMSA has tracked pipeline incidents since 1970, offering a rich dataset for analysis. By identifying patterns and correlations in gas pipeline incidents and developing user-friendly tools to interact with the data, this project will provide actionable insights to help regulators prioritize safety investments, reduce emissions, and protect consumers from rising costs.
More info
-
How many rows of data do you imagine will be involved in total?:
- tens of thousands
-
What kind of data will the project involve?:
- rectangular (i.e., rows and columns, spreadsheets, CSVs), potentially geospatial, and potential transforming CSVs to other data types
-
Tags:
- Statistical modeling, Predictive modeling (aka, machine learning), Web app development, Data analysis, Data wrangling, Data visualization, Geospatial analysis (i.e., shapefiles, mapping), Artificial Intelligence
-
Is there anything else you’d like to share with the students?:
- Full proposal here - https://docs.google.com/document/d/1YxDp67tsj4iyVxaPJ1eZNHejT23pRVitglVZn0S9D1k/edit?usp=sharing
Repository
https://github.com/sds-capstone/fhi-s25Video
https://drive.google.com/file/d/1Upi98SS2nOv4ZDtpPMj621gL-jyKEMns/view?usp=drive_link
Project 6: 99P Labs / Honda Research Institute US
Contact Person: Ryan Lingo
Preferred Contact Method: Email, Videoconference (e.g., Zoom, Google Hangouts), Slack, GitHub
Description of Organization
Research and Innovation Lab
Project Description
Problem to Solve We want to automatically understand the content of an image—identifying objects, their relationships, and generating concise, accurate descriptions. This addresses the challenge of scene comprehension at scale, reducing the manual effort in labeling or interpreting images.
Goals/Objectives 1. Object Detection & Relationship Mapping: Use a vision model to detect objects and how they relate to one another in a scene. 2. Caption Generation: Leverage a language model to produce clear, coherent captions based on the detected objects and relationships. 3. Evaluation: Determine the most effective metrics and user feedback methods to assess caption quality, accuracy, and clarity. 4. Usability & Feedback: Provide a simple interface where users can upload images, see results, and offer feedback on caption correctness.
Envisioned Final Product A web-based tool where users upload an image and see a list of detected objects, a map of their relationships, and a concise caption describing the overall scene. The interface would highlight any errors or “hallucinated” details, and users could rate or comment on the captions, helping refine and improve the system.More info
-
How many rows of data do you imagine will be involved in total?:
- not sure the team will create the dataset
-
What kind of data will the project involve?:
- Don’t know
-
Tags:
- Predictive modeling (aka, machine learning), Web app development, Data visualization, Experimental design, Artificial Intelligence
-
Is there anything else you’d like to share with the students?:
- NA