In this lab, we will learn how to ingest data from a variety of external sources.

library(tidyverse)

Goal: by the end of this lab, you will be able to read data into R from different sources.

Data from R packages

Many R packages contain data.frames. Sometimes, delivering these data sets is the main purpose of the package. Other times, these data sets are used in the examples that illustrate the functions that the package provides. Either way, they are available once you have the package loaded.

To see what data frames are available to your R session, use the data() command. This will cause a list of data objects organized by package to pop up in your RStudio window.

You should NOT include this command in an R Markdown document, since that window won’t pop up!

data()

If you want to see only those data objects provided by a specific package, you can specify the package argument.

data(package = "dplyr")
  1. List all of the data objects provided by the ggplot2 package.

Note that not all data objects listed by data() are data.frames! Some of them are lists and some have class matrix.

Lazy-loading

Typically, data objects provided by R packages are loaded lazily. This means that you don’t have to explicitly load the data with the data() function. Once the package is loaded, the data is made available to you. However, not all packages do this.

Data from a CSV

The simplest and most common way to get data into R is to have it stored as a CSV. CSV means Comma Separated Values, and it is a non-proprietary format for sharing tabular data in rows and columns.

The read.csv() command from base R has been superseded by the read_csv() command from readr. readr is part of the tidyverse, and read_csv() is faster than read.csv() and will not automatically convert your character vectors to factors. I can’t think of any reason to use read.csv() when you could use read_csv().

You can read a CSV file straight from the Internet via a URL. CSVs should have the file extension .csv.

url <- "http://gattonweb.uky.edu/sheather/book/docs/datasets/magazines.csv"
nyc <- read_csv(url)
glimpse(nyc)
  1. Use read_csv() to load data from the URL http://gattonweb.uky.edu/sheather/book/docs/datasets/magazines.csv into a data frame in R. How many rows and columns does it have?
# sample solution
magazines <- read_csv(url)
dim(magazines)

Writing CSVs

To write a data frame to a CSV, use the write_csv() function. Note that by default, the file will be written to the working directory (see below).

  1. Use write_csv() to write the magazines data frame to a CSV.
# sample solution
write_csv(magazines, path = "magazines.csv")
  1. Try to write the CSV to happytimes/magazines.csv. What does the error message say? What does it mean?
# sample solution
write_csv(magazines, path = "happytimes/magazines.csv")

Paths

You can also load CSVs from your local computer. To do so, you need to use the correct path, which should be relative and not absolute.

Please read the linked articles if this is new to you!

The key to getting paths to work is to understand R’s working directory. If you are unsure, use getwd().

getwd()

NEVER USE THE setwd() COMMAND! It’s almost as bad as attach().

Special paths

There are a few special paths that are useful:

  • . is the working directory
  • ~ is the user’s home directory. On my computer, ~ is equivalent to /home/bbaumer.
  • .. is the parent directory. So ../magazines.csv refers to a file called magazines.csv in the directory that is one level above the working directory.

You can always use list.files() to see what files are available in the working directory (or any other directory).

list.files()
list.files("~")
  1. Use list.files() to see what files are available in the parent directory of your working directory.

There are several problems that people commonly run into when trying to specify the right path:

R session vs. R Markdown

By default, the working directory for your R Markdown document is the directory in which that document is stored. That may differ from the working directory of your current R session.

R project root directory vs. R Markdown

Suppose that you stored your R Markdown files in a directory called rmd that lives within the root directory of your project. Then the working directory of the R Markdown files will be different than your project root directory.

Paths don’t match on other people’s computers

A file that exists in a certain place on your computer may not exist in that same place on someone else’s computer. Since it is unworkable to force people to put files in the same places, we often use relative paths. If you and your partner both checkout the same project from GitHub, and use relative paths, then these relative paths should work on both of your computers, regardless of where the project root directory is on either of your computers.

For example, this lab file is called lab-import.Rmd and it is stored in the www directory of this project. The magazines data file we created earlier is in the root directory of this project. The appropriate relative path depends on my working directory.

getwd()
## [1] "/home/bbaumer/Dropbox/git/sds192/www"
# works in console, but not when knitting
mags <- read_csv("magazines.csv")
## Error: 'magazines.csv' does not exist in current working directory ('/home/bbaumer/Dropbox/git/sds192/www').
# works when knitting, but not in console
mags <- read_csv("../magazines.csv")
## Error: '../magazines.csv' does not exist in current working directory ('/home/bbaumer/Dropbox/git/sds192/www').
glimpse(mags)
## Error in glimpse(mags): object 'mags' not found

The here package is designed to work around this problem. Using the here() function results in paths that are always relative to the project root directory.

# works in both places
mags <- read_csv(here::here("magazines.csv"))
## Error: '/home/bbaumer/Dropbox/git/sds192/magazines.csv' does not exist.
glimpse(mags)
## Error in glimpse(mags): object 'mags' not found
  1. Read the previous example carefully and make sure you understand what is going on.

  2. Create a new directory in your current project called happytimes. Open a new R Markdown document and save it in happytimes. Include a chunk with the getwd() command in the Markdown file, and knit it. Do you get the same result as when you run getwd() in the console of your R session?

Data from an Excel file

R can read a worksheet from an Excel file into a data frame using the read_excel() function provided by the readxl package. Excel files should have the file extension .xls or .xlsx.

Note that a worksheet in an Excel file may not be just rows and columns of tabular data. In this event, read_excel() will get confused and you may have to specify which cells you want to import. You can do this by specifying the range argument.

Unlike read_csv(), read_excel() can’t read directly from a URL, so you will have to have the file stored locally. [Here, we use the download.file() function to download it. ]

src <- "http://gattonweb.uky.edu/sheather/book/docs/datasets/GreatestGivers.xls"
lcl <- basename(src)
download.file(url = src, destfile = lcl)

Then we import it.

library(readxl)
philanthropists <- read_excel(lcl)

It is not advisable to store your data in an .xls format. Use .csv or a database.

Data from a Google spreadsheet

The googlesheets package enables you to import a Google Sheet directly into R.

This is just a bit more complicated because of the permissions that Google puts on these files, but it works well. Please see the package documentation to use it.

Data in other formats

R can read data from just about any format. See the foreign package for SPSS, Stata, or SAS files. See the jsonlite package for JSON and xml2 for XML. Google or consult me for any other formats!

Saving and loading data

Like other data analysis programs, R has its own data format. These files should have the .rda or .RData file extensions. If you are sure you only want to use some data in R, it’s a good solution because it can be read quickly and compressed so that it takes up less space on disk.

You can save any object using the save() command. Use the .rda file extension and the compress argument.

save(magazines, file = "magazines.rda", compress = "xz")
  1. Save your magazines object to your home directory (i.e., use ~).

Load .rda files back into your R session with load().

load("~/magazines.rda")

The object stored in the file will automatically appear in your workspace.

Your learning

Please respond to the following prompt on Slack in the #mod-programming channel.

Prompt: What questions do you still have about importing data into R? Are there other sources of data that you would like to get into R?