In this lab, we will learn how to ingest data from a variety of external sources.
library(tidyverse)
Goal: by the end of this lab, you will be able to read data into R from different sources.
Many R packages contain data.frame
s. Sometimes, delivering these data sets is the main purpose of the package. Other times, these data sets are used in the examples that illustrate the functions that the package provides. Either way, they are available once you have the package loaded.
To see what data frames are available to your R session, use the data()
command. This will cause a list of data objects organized by package to pop up in your RStudio window.
You should NOT include this command in an R Markdown document, since that window won’t pop up!
data()
If you want to see only those data objects provided by a specific package, you can specify the package
argument.
data(package = "dplyr")
ggplot2
package.Note that not all data objects listed by
data()
aredata.frame
s! Some of them arelist
s and some have classmatrix
.
Typically, data objects provided by R packages are loaded lazily. This means that you don’t have to explicitly load the data with the data()
function. Once the package is loaded, the data is made available to you. However, not all packages do this.
The simplest and most common way to get data into R is to have it stored as a CSV. CSV means Comma Separated Values, and it is a non-proprietary format for sharing tabular data in rows and columns.
The read.csv()
command from base
R has been superseded by the read_csv()
command from readr
. readr
is part of the tidyverse
, and read_csv()
is faster than read.csv()
and will not automatically convert your character vectors to factors. I can’t think of any reason to use read.csv()
when you could use read_csv()
.
You can read a CSV file straight from the Internet via a URL. CSVs should have the file extension .csv
.
<- "http://gattonweb.uky.edu/sheather/book/docs/datasets/magazines.csv"
url <- read_csv(url)
nyc glimpse(nyc)
read_csv()
to load data from the URL http://gattonweb.uky.edu/sheather/book/docs/datasets/magazines.csv
into a data frame in R. How many rows and columns does it have?# sample solution
<- read_csv(url)
magazines dim(magazines)
To write a data frame to a CSV, use the write_csv()
function. Note that by default, the file will be written to the working directory (see below).
write_csv()
to write the magazines data frame to a CSV.# sample solution
write_csv(magazines, path = "magazines.csv")
happytimes/magazines.csv
. What does the error message say? What does it mean?# sample solution
write_csv(magazines, path = "happytimes/magazines.csv")
You can also load CSVs from your local computer. To do so, you need to use the correct path, which should be relative and not absolute.
Please read the linked articles if this is new to you!
The key to getting paths to work is to understand R’s working directory. If you are unsure, use getwd()
.
getwd()
NEVER USE THE
setwd()
COMMAND! It’s almost as bad asattach()
.
There are a few special paths that are useful:
.
is the working directory~
is the user’s home directory. On my computer, ~
is equivalent to /home/bbaumer
...
is the parent directory. So ../magazines.csv
refers to a file called magazines.csv
in the directory that is one level above the working directory.You can always use list.files()
to see what files are available in the working directory (or any other directory).
list.files()
list.files("~")
list.files()
to see what files are available in the parent directory of your working directory.There are several problems that people commonly run into when trying to specify the right path:
By default, the working directory for your R Markdown document is the directory in which that document is stored. That may differ from the working directory of your current R session.
Suppose that you stored your R Markdown files in a directory called rmd
that lives within the root directory of your project. Then the working directory of the R Markdown files will be different than your project root directory.
A file that exists in a certain place on your computer may not exist in that same place on someone else’s computer. Since it is unworkable to force people to put files in the same places, we often use relative paths. If you and your partner both checkout the same project from GitHub, and use relative paths, then these relative paths should work on both of your computers, regardless of where the project root directory is on either of your computers.
For example, this lab file is called lab-import.Rmd
and it is stored in the www
directory of this project. The magazines data file we created earlier is in the root directory of this project. The appropriate relative path depends on my working directory.
getwd()
## [1] "/home/bbaumer/Dropbox/git/sds192/www"
# works in console, but not when knitting
<- read_csv("magazines.csv") mags
## Error: 'magazines.csv' does not exist in current working directory ('/home/bbaumer/Dropbox/git/sds192/www').
# works when knitting, but not in console
<- read_csv("../magazines.csv") mags
## Error: '../magazines.csv' does not exist in current working directory ('/home/bbaumer/Dropbox/git/sds192/www').
glimpse(mags)
## Error in glimpse(mags): object 'mags' not found
The here
package is designed to work around this problem. Using the here()
function results in paths that are always relative to the project root directory.
# works in both places
<- read_csv(here::here("magazines.csv")) mags
## Error: '/home/bbaumer/Dropbox/git/sds192/magazines.csv' does not exist.
glimpse(mags)
## Error in glimpse(mags): object 'mags' not found
Read the previous example carefully and make sure you understand what is going on.
Create a new directory in your current project called happytimes
. Open a new R Markdown document and save it in happytimes
. Include a chunk with the getwd()
command in the Markdown file, and knit it. Do you get the same result as when you run getwd()
in the console of your R session?
R can read a worksheet from an Excel file into a data frame using the read_excel()
function provided by the readxl
package. Excel files should have the file extension .xls
or .xlsx
.
Note that a worksheet in an Excel file may not be just rows and columns of tabular data. In this event, read_excel()
will get confused and you may have to specify which cells you want to import. You can do this by specifying the range
argument.
Unlike read_csv()
, read_excel()
can’t read directly from a URL, so you will have to have the file stored locally. [Here, we use the download.file()
function to download it. ]
<- "http://gattonweb.uky.edu/sheather/book/docs/datasets/GreatestGivers.xls"
src <- basename(src)
lcl download.file(url = src, destfile = lcl)
Then we import it.
library(readxl)
<- read_excel(lcl) philanthropists
It is not advisable to store your data in an .xls
format. Use .csv
or a database.
The googlesheets
package enables you to import a Google Sheet directly into R.
This is just a bit more complicated because of the permissions that Google puts on these files, but it works well. Please see the package documentation to use it.
R can read data from just about any format. See the foreign
package for SPSS, Stata, or SAS files. See the jsonlite
package for JSON and xml2
for XML. Google or consult me for any other formats!
Like other data analysis programs, R has its own data format. These files should have the .rda
or .RData
file extensions. If you are sure you only want to use some data in R, it’s a good solution because it can be read quickly and compressed so that it takes up less space on disk.
You can save any object using the save()
command. Use the .rda
file extension and the compress
argument.
save(magazines, file = "magazines.rda", compress = "xz")
magazines
object to your home directory (i.e., use ~
).Load .rda
files back into your R session with load()
.
load("~/magazines.rda")
The object stored in the file will automatically appear in your workspace.
Please respond to the following prompt on Slack in the #mod-programming
channel.
Prompt: What questions do you still have about importing data into R? Are there other sources of data that you would like to get into R?