Goal: by the end of this lab, you will be able to use
ggplot2 to build different data graphics.
Remember: before we can use a library like
ggplot2, we have to load it. In this case, we load the
tidyerse package, which automatically loads
ggplot2 for us.
themesystem for polishing plot appearance (more on this later)
The big idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:
ggplot2, we can specify different parts of the plot, and combine them together using the
+ operator. [Note that the
+ operator is similar to the
%>% pipe operator but is not interchangeable!]
Let’s start by taking a look at some data on housing prices:
<- read.csv("http://www.science.smith.edu/~jcrouser/SDS192/landdata-states.csv") housing glimpse(housing)
## Rows: 7,803 ## Columns: 11 ## $ State <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK"… ## $ region <chr> "West", "West", "West", "West", "West", "West", "Wes… ## $ Date <dbl> 2010.25, 2010.50, 2009.75, 2010.00, 2008.00, 2008.25… ## $ Home.Value <int> 224952, 225511, 225820, 224994, 234590, 233714, 2329… ## $ Structure.Cost <int> 160599, 160252, 163791, 161787, 155400, 157458, 1600… ## $ Land.Value <int> 64352, 65259, 62029, 63207, 79190, 76256, 72906, 694… ## $ Land.Share..Pct. <dbl> 28.6, 28.9, 27.5, 28.1, 33.8, 32.6, 31.3, 29.9, 28.7… ## $ Home.Price.Index <dbl> 1.481, 1.484, 1.486, 1.481, 1.544, 1.538, 1.534, 1.5… ## $ Land.Price.Index <dbl> 1.552, 1.576, 1.494, 1.524, 1.885, 1.817, 1.740, 1.6… ## $ Year <int> 2010, 2010, 2009, 2009, 2007, 2008, 2008, 2008, 2008… ## $ Qrtr <int> 1, 2, 3, 4, 4, 1, 2, 3, 4, 1, 2, 2, 3, 4, 1, 2, 3, 4…
Geometric objects or
geoms are the actual marks we put on a plot. Examples include:
geom_point, for scatter plots, dot plots, etc)
geom_line, for time series, trend lines, etc)
geom_boxplot, for, well, boxplots!)
A plot should have at least one
geom, but there is no upper limit. You can add a
geom to a plot using the
You can get a list of available geometric objects using the code below:
help.search("geom_", package = "ggplot2")
or simply type
geom_<tab> in RStudio to see a list of functions starting with
ggplot2, aesthetic means “something you can see”. Each aesthetic is a mapping between a visual cue and a variable. Examples include:
Each type of
geom accepts only a subset of all aesthetics—refer to the
geom help pages to see what mappings each
geom accepts. Aesthetic mappings are set with the
Now that we know about geometric objects and aesthetic mapping, we’re ready to make our first
ggplot: a scatterplot. We’ll use
geom_point to do this, which requires
aes mappings for
y; all others are optional.
<- housing %>% hp2013Q1 filter(Date == 2013.25) ggplot(hp2013Q1, aes(y = Structure.Cost, x = Land.Value)) + geom_point()
# sample solution ggplot(hp2013Q1, aes(y = Home.Value, x = Land.Value)) + geom_point()
The output of the
ggplot() function is an object. Since we want to modify the plot that we created above, it’s helpful to store the plot as an object.
<- ggplot(hp2013Q1, base_plot aes(y = Structure.Cost, x = Land.Value))
To actually show the plot, we just print it. Note that this plot doesn’t show anything because we haven’t added any
geoms yet! Still, the aesthetic mapping are defined, and any subsequent
geoms that we add will use those mappings.
+ base_plot geom_point()
# sample solution <- ggplot(hp2013Q1, home_value_plot aes(y = Home.Value, x = Land.Value)) + geom_point()
A plot constructed with
ggplot can have more than one
geom. In that case, the mappings established in the
ggplot() call are plot defaults that can be added to or overridden. For example, we could connect all of the points using
geom_line(). Note that now we see both points and lines!
+ base_plot geom_point() + geom_line()
geom_line()in this case? Do the lines help us understand the connections between the observations better? What do the lines represent?
Not all geometric objects are simple shapes—
geom_smooth() includes both a line and a ribbon.
+ base_plot geom_point() + geom_smooth()
geom accepts a particular set of aesthetics (i.e., mappings)—for example
geom_text() accepts a
+ base_plot geom_text(aes(label = State), size = 3)
Note that variables are mapped to aesthetics with the
aes() function, while fixed visual cues are set outside the
aes() call. This sometimes leads to confusion, as in this example:
+ base_plot geom_point(aes(size = 2), # not what you want because 2 is not a variable color = "red") # this is fine -- turns all points red
aes() function can also be used outside of a call to a
geom. Here, we update the
base_plot to map color to home value.
<- base_plot + base_plot aes(color = Home.Value)
home_value_plot, map color to the cost of the structure and show your scatterplot.
# sample solution + home_value_plot aes(color = Structure.Cost) + geom_point()
Other aesthetics are mapped in the same way as
y in the previous example.
+ base_plot geom_point(aes(shape = region))
## Warning: Removed 1 rows containing missing values (geom_point).
Aesthetic mapping (i.e., with
aes()) only says that a variable should be mapped to an aesthetic. It doesn’t say how that should happen. For example, when mapping a variable to shape with
aes(shape = z) you don’t say what shapes should be used. Similarly,
aes(color = z doesn’t say what colors should be used. Describing what colors/shapes/sizes etc. to use is done by modifying the corresponding scale. In
Scales are modified with a series of functions using a
scale_<aesthetic>_<type> naming scheme. Try typing
scale_<tab> to see a list of scale modification functions.
The following arguments are common to most
name: the first argument specifies the axis or legend title
limits: the minimum and maximum of the
breaks: the points along the scale where labels should appear
labels: the text that appear at each break
Specific scale functions may have additional arguments; for example, the
scale_color_continuous() function has arguments
high for setting the colors at the low and high end of the scale.
Start by constructing a dotplot showing the distribution of home values by
<- ggplot(housing, aes(y = State, x = Home.Price.Index)) + home_plot geom_point(aes(color = Date), alpha = 0.3, size = 1.5, position = position_jitter(width = 0, height = 0.25))
First, we will change the label on the vertical axis.
<- home_plot + home_plot scale_y_discrete(name = "State Abbreviation")
Now let’s modify the
labels for the
x axis and color scales:
+ home_plot scale_color_continuous(breaks = c(1975.25, 1994.25, 2013.25), labels = c(1971, 1994, 2013))
Next change the low and high values to
<- home_plot + home_plot scale_color_continuous(breaks = c(1975.25, 1994.25, 2013.25), labels = c(1971, 1994, 2013), low = "blue", high = "red") home_plot
ggplot2 has a wide variety of color
scales; here is an example using
scale_color_gradient2 to interpolate between three different colors:
+ home_plot scale_color_gradient2(breaks = c(1975.25, 1994.25, 2013.25), labels = c(1971, 1994, 2013), low = "blue", high = "red", mid = "gray60", midpoint = 1994.25)
geom_vline()to add a dotted, black, vertical line to the plot we created above.
# sample solution + home_plot geom_vline(aes(xintercept = 1), linetype = 3, color = "black") + scale_color_gradient2(breaks = c(1975.25, 1994.25, 2013.25), labels = c(1971, 1994, 2013), low = "blue", high = "red", mid = "gray60", midpoint = 1994.25)
ggplot2are added sequentially. How would you put the dotted vertical line you created in the previous exercise behind the data values?
Here’s a (partial) combination matrix of available scales:
Note: in RStudio, you can type
scale_ followed by TAB to get the whole list of available scales.
ggplot2parlance for small multiples
ggplot2offers two functions for creating small multiples:
facet_wrap(): define subsets as the levels of a single grouping variable
facet_grid(): define subsets as the crossing of two grouping variables
geomswithin a plot
Let’s start by using a technique we already know: map
<- ggplot(housing, aes(x = Date, y = Home.Value)) state_plot + state_plot geom_line(aes(color = State))
There are two problems here: there are too many states to distinguish each one by
color, and the lines obscure one another.
We can fix the previous plot by faceting by
State rather than mapping
+ state_plot geom_line() + facet_wrap(~State, ncol = 10)
There is also a
facet_grid() function for faceting in two dimensions.
facet_wrapto create a data graphic of your choice that illustrates something interesting about home prices.
# sample solution ggplot(housing, aes(x = Date, y = Home.Price.Index, color = State)) + geom_hline(aes(yintercept = 1), linetype = 3, color = "black") + geom_point(alpha = 0.1) + geom_line() + facet_wrap(~region)
Please response to the following prompt on Slack in the
Prompt: Post an image of the data graphic you created in Exercise 7
This lab is based on the “Introduction to R Graphics with
ggplot2” workshop, which is a product of the Data Science Services team Harvard University. The original source is released under a Creative Commons Attribution-ShareAlike 4.0 Unported. This lab was adapted for SDS192: and Introduction to Data Science in Spring 2017 by R. Jordan Crouser at Smith College.