Preface
What’s New in the Third Edition?
In the ten years since the publication of the first edition of this book, there have been many new developments in R, including the introduction of many new packages. (As we are writing this preface, there are currently 20,122 packages available on the CRAN package repository.) In particular, the tidyverse collection of packages has streamlined workflows for data visualization and data manipulation. In this third edition, we have revised the R code to embrace the new functions and paradigms available through the tidyverse. This includes porting the book’s source code from LaTeX to Quarto, enabling us to simultaneously maintain an online web version of the book at: https://beanumber.github.io/abdwr3e/.
This third edition introduces three chapters with new material. One important aspect of working as a baseball analyst is communicating one’s findings to other people in the organization. Chapters 14 Making a Scientific Presentation using Quarto and 15 Using Shiny for Baseball Applications focus on communicating results via presentations using the Quarto publishing system and web applications using the Shiny package. Given the availability of large quantities of baseball data, one challenge is how to efficiently work with these data. 12 Working with Large Data explores methods for downloading, storing, retrieving, and analyzing large Statcast datasets. The “Batted Ball Data from Statcast” chapter from the previous revision of the book has been rewritten in 13 Home Run Hitting to focus on the interesting pattern of home run hitting during the Statcast era. Appendices Appendix A — Retrosheet Files Reference, Appendix B — Historical Notes on PITCHf/x Data, and Appendix C — Statcast Data Reference have been revised to reflect new realities, most notably including new functionality in the baseballr package and the disappearance of the PITCHf/x data source.
Preface from the First Edition
Baseball has always had a fascination with statistics. Schwarz (2004) documents the quantitative measurements of teams and players since the beginning of professional baseball history in the 19th century. Since the foundation of the Society for American Baseball Research (SABR) in 1971, an explosion of new measures have been developed for understanding offensive and defensive contributions of players. One can learn much about the current developments in sabermetrics by viewing articles at websites such as https://www.baseballprospectus.com, https://www.hardballtimes.com, and https://www.fangraphs.com.
The quantity and granularity of baseball data exhibited remarkable growth since the birth of the Internet. The first data were collected for players and teams for individual seasons—these data were what would be displayed on the back side of a Topps baseball card. The volunteer-run Project Scoresheet organized the collection of play-by-play game data, and these data are currently freely available at the Retrosheet organization at https://www.retrosheet.org. Since 2006, PITCHf/x data has measured the speed and trajectory of every pitched ball, and since 2015, Statcast has collected the speeds and locations of batted balls and the locations and movements of baserunners and fielders at fractions of a second.
The ready availability of these large baseball datasets has led to challenges for baseball enthusiasts interested in answering baseball questions with these data. It can be problematic to download and organize the data. Standard statistical software packages may be well-suited for working with small datasets of a specific format, but they are less helpful in merging datasets of different types or performing particular types of analyses, say contour graphs of pitch locations, that are helpful for PITCHf/x data.
Fortunately, a new open-source statistical computing environment, R, has experienced increasing popularity within the statistics, data science, and computer science communities. R is a system for statistical computation and graphics, and it is a computer language designed for typical and possibly specialized statistical and graphical applications. The software is available for Linux, Windows, and Macintosh platforms from http://www.r-project.org.
The public availability of baseball data and the open-source R software is an attractive marriage. R provides a large range of tools for importing, arranging, and organizing large datasets. Through the use of built-in functions and collections of packages from the R user-community, one can perform various data and graphical analyses, and communicate this work easily to other baseball enthusiasts over the Internet. In 2014, one of us asked a number of MLB team analytics groups about their use of R and here are some responses:
- “We use: R, MySQL / Oracle, Perl, PHP”.
- “We do use R extensively, and it is our primary statistical package. The only other major tool we use is probably Excel”.
- “We do use R here. It is our primary statistical package for projects that need something more than the statistical functions in Excel”.
- “With the occasional exception of Python+NumPy, R is the only statistical programming language or package we use”.
- “We do use R. It’s used in conjunction with Excel for analysis”.
It is clear that R is a major tool for the analytical work of MLB teams.
The purpose of this book is to introduce R to sabermetricians, baseball enthusiasts, and students interested in exploring baseball data.
Overview of Chapters
The contents of this book can be divided into three themes: chapters devoted to popular topics within sabermetrics, chapters focusing on particular datasets, and chapters that illustrate R tools.
Sabermetrics
- Chapter 4: The Relation Between Runs and Wins
- Chapter 5: Value of Plays Using Run Expectancy
- Chapter 6: Balls and Strikes Effects
- Chapter 7: Catcher Framing
- Chapter 8: Career Trajectories
- Chapter 9: Simulation
- Chapter 10: Exploring Streaky Performances
Baseball Data Sets
- Chapter 1: The Baseball Datasets
- Chapter 13: Home Run Hitting
- Appendix A: Retrosheet Files Reference
- Appendix B: Historical notes on PITCHf/x
- Appendix C: Statcast Data Reference
R tools
- Chapter 2: Introduction to R
- Chapter 3: Graphics
- Chapter 11: Using a Database to Compute Park Factors
- Chapter 12: Working with Large Data
- Chapter 14: Making a Scientific Presentation Using Quarto
- Chapter 15: Using Shiny for Baseball Applications
Two fundamental ideas in sabermetrics are the relationship between runs and wins, and the measurement of the value of baseball events by runs. 4 The Relation Between Runs and Wins explores the famous Pythagorean formula derived by Bill James, and Chapters 5 Value of Plays Using Run Expectancy and 6 Balls and Strikes Effects describe the value of plays and pitch sequences using run expectancy. It is fascinating to explore career performance trajectories of ballplayers, and 8 Career Trajectories illustrates the use of R to fit quadratic models to player trajectories. 9 Simulation illustrates the use of R simulation functions to simulate a game of baseball by a Markov chain model and simulate a season of baseball competition. Baseball fans are interested in streaky patterns of performance of teams and players and 10 Exploring Streaky Performances explores methods of describing and understanding the significance of streaky patterns of hitting.
1 The Baseball Datasets provides an overview of the publicly available baseball datasets and 13 Home Run Hitting describes many of the new variables available in the Statcast system. The datafiles available through Retrosheet (Appendix A — Retrosheet Files Reference), MLBAM Gameday (Appendix B — Historical Notes on PITCHf/x Data), and Statcast (Appendix C — Statcast Data Reference) are relatively sophisticated, so we provide detailed descriptions for downloading and reading these data into R.
2 Introduction to R gives a gentle introduction to the type of data structures and exploratory and data management capabilities of R. One of the strongest features of R is its graphics capabilities—3 Graphics provides an overview of the ggplot2 graphics package. Given the large size of baseball datasets, it may be more convenient to work with a relational database and 11 Using a Database to Compute Park Factors illustrates the application of several R packages to interface with a MySQL database. This material motivates a discussion about issues working with large datasets and additional technologies in 12 Working with Large Data. The book concludes in Chapters 14 Making a Scientific Presentation using Quarto and 15 Using Shiny for Baseball Applications by describing tools for communicating results of baseball work.
How to Use this Book
We encourage the reader to work on the book datasets and try out the presented R code as the chapters are read. All of the small data files and R code used in the book are available at the GitHub repository for the associated R package abdwr3edata (http://github.com/beanumber/abdwr3edata). In addition, at the “Exploring Baseball Data with R” book blog at https://baseballwithr.wordpress.com, these authors and others provide advice on using R in sabermetrics research and keep the reader informed of new developments in R software and baseball datasets.
There is an active academic research community in baseball as demonstrated by published referred articles in journals, particularly The Journal of Quantitative Analysis in Sports (JQAS) and the Journal of Sports Analytics. The recently published articles Brill, Deshpande, and Wyner (2023), Gerber and Craig (2021), Bouzarth et al. (2021), Hirotsu and Bickel (2019), and Healey (2019) in JQAS describe work on pitcher fatigue, prediction of future performance, proper defensive positioning, measuring the value of the sacrifice bunt, and measuring the value of a pitch. Reading these articles and attending sports analytics conferences (e.g., the New England Symposium on Statistics in Sports or the Carnegie Mellon Sports Analytics Conference) are great ways to deepen your knowledge. Recent work in sports analytics more broadly includes a new CRAN Task View for Sports Analytics that includes many of the R packages used in this book, a systematic review of these packages and their properties (Casals et al. 2023), and an attempt to connect big ideas (many of which originated in baseball and are described in this book) across various sports (Baumer, Matthews, and Nguyen 2023).
We imagine this book as a first step towards a professional career in baseball analytics. Other stops along the path to professionalization might include the SABR Analytics Certification courses. Three levels are offered, with the highest level presenting R programming material consistent with what appears in this book.
Acknowledgements
We have appreciated all of the positive comments and suggestions on the first two editions. We’re especially grateful to Jason Osborne and a slew of GitHub users for catching errors in the previous editions that we were able to correct in this edition. We believe the book is useful for quantitatively oriented baseball fans who would like to learn R to perform their own analyses. We agree with Donoho (2017) that a careful study of Tango, Lichtman, and Dolphin (2007) and this book would be an excellent introduction to data preparation and exploration within the context of baseball.
The authors are very grateful for the efforts of our editors, John Kimmel and David Grubbs, who played an important role in our collaboration and provided us with timely reviews that led to significant improvements of the manuscript. We wish to thank our partners, Anne, Ramona, and Cory, and our children, Lynne, Bethany, Steven, Alice, and Arlo for their encouragement and inspiration. Although the three of us live thousands of miles apart, we share a passion for statistics, baseball, and the knowledge that one can learn about the game through the exploration of data.
Northampton, MA (and Medellín, Colombia) and Findlay, OH
December 2023