In this lab, we will learn how to use rvest
and lubridate
to scrape tabular data from web pages, and work with dates and times, respectively.
library(tidyverse)
Goal: by the end of this lab, you will be able to pull data from the web directly into R and work sensibly with date/time variables.
rvest
The Internet is a great place to get data. We can use rvest
to scrape data in HTML tables from the web, but it will often require extensive cleaning before it can be used appropriately.
Consider the following list of largest box office opening weekends:
(http://www.boxofficemojo.com/alltime/weekends/)
We’ll use rvest
to bring this table into R.
library(rvest)
<- "http://www.boxofficemojo.com/alltime/weekends/" url
First, we’ll need to read the contents of the page in HTML. The read_html()
function provided by rvest
ingests HTML:
<- read_html(url)
html_bom class(html_bom)
html_bom
Unfortunately, this isn’t very readable. What we want is to extract the data that is embedded in the HTML tables. Let’s start by just grabbing those tables, which are inside html table
elements. We can use html_nodes()
to do this:
<- html_bom %>%
tables html_nodes("table")
tables
In this case, there is only one table
elements on that page (most of them used to create the borders). We are only interested in the big one with all the data. This happens to be the 1st element in the list (note: we figured this out by trial and error).
1]] tables[[
The html_table()
function will pull the data out of this table
and convert it into a data frame. The header = TRUE
option tells R that we want to use the first row as our variable names.
<- tables[[1]] %>%
movies html_table(header = TRUE)
glimpse(movies)
In this case we only had 1 table, so it was not too hard to use trial-and-error to figure out which was the one we wanted. But we could also be a bit more systematic.
Let’s use map()
to extract all 6 tables:
<- map(tables, html_table, fill = TRUE) list_of_tables
Note that list_of_tables
is a list
of length 1
.
class(list_of_tables)
length(list_of_tables)
str(list_of_tables)
Since html_table()
maps HTML tables to data.frame
s in R, each of the one elements in the list list_of_tables
is a data.frame
.
map(list_of_tables, class)
However, some of the tables are bigger than others.
map()
and dim()
to determine the size of each table in list_of_tables
.map(list_of_tables, dim)
It’s obvious from the web page itself that the table we want has 9 variables and 214 rows. Only the 1st element of our list meets that criteria.
While we now have the data, note that it is very messy:
character
vectors, even though they contain quantitative information. In particular, there are columns for dollars, percentages, and dates that are all in the wrong format.Because of this mismatch, plotting the data will not work as expected.
ggplot(
data = movies,
aes(x = Date, y = Opening)
+
) geom_point(aes(size = `% of Total`))
Note even close to what we want! The parse_number()
function from the readr
package is extremely useful for cleaning up dollar signs, commas, and percentage signs. We’ll use this in conjunction with the mutate()
verb to rename the columns as the same time.
<- movies %>%
movies mutate(
opening = parse_number(Opening),
percent_total = parse_number(`% of Total`)/100
)glimpse(movies)
Now when we plot the quantitative data, we get something that makes more sense.
ggplot(data = movies, aes(x = Date, y = opening)) +
geom_point(aes(size = percent_total))
num_theaters
that stores the number of theaters as an integer.<- movies %>%
movies mutate(num_theaters = parse_number(Theaters))
Avg.
and Total Gross
, respectively, that are integers and have names that follow the style guide.<- movies %>%
movies mutate(
avg_gross = parse_number(Average),
total_gross = parse_number(`Total Gross`)
)
lubridate
Unfortunately, the dates are still a problem. Let’s take a closer look at those dates:
%>%
movies select(Date) %>%
glimpse()
We see that the dates are in month/day/year format. The lubridate
package provides functionality for working with dates, and we can use the mdy()
function to convert the character
vector into a date
class.
library(lubridate)
<- movies %>%
movies mutate(release_date = mdy(Date))
glimpse(movies)
Now our plot makes sense, especially if we use the scales
package to format our axes.
ggplot(data = movies, aes(x = release_date, y = opening)) +
# We want a scatterplot, and we'll use both color and size to show percent_total
geom_point(aes(color = percent_total, size = percent_total)) +
# Clever trick to combine color and size into a single legend
guides(color = guide_legend("Percent Total"),
size = guide_legend("Percent Total")) +
# Format the y-axis to show $ amount
scale_y_continuous("Opening Day Gross", labels = scales::dollar) +
# Label our axes
scale_x_date("Release Date") +
scale_color_viridis_c()