In this lab, we will learn how to write user-defined functions.
library(tidyverse)
library(babynames)
Goal: by the end of this lab, you will be able to write a function in R and execute it.
We already know how to filter for a particular name:
%>%
babynames filter(name == "Benjamin")
Suppose that we want to find the year in which that name was most popular (see Exercise 2 from Lab 4). To do this we need a pipeline that consists of several verbs chained together.
%>%
babynames filter(name == "Benjamin") %>%
group_by(year) %>%
summarize(total = sum(prop)) %>%
arrange(desc(total)) %>%
head(1) %>%
select(year)
But we might want to do this for many names, and it would be tedious to have to re-type – or even just re-run – the same code over and over again. An elegant solution is to write a function. For example, here we write a function called most_popular_year()
that will return the year in which a specific name was most popular.
<- function(name_arg) {
most_popular_year %>%
babynames filter(name == name_arg) %>%
group_by(year) %>%
summarize(total = sum(prop)) %>%
arrange(desc(total)) %>%
head(1) %>%
select(year)
}
Now we can run our function on several different names without having to re-type all of that code. Here we find the popularity of names associated with actors and actresses who won at the 91st Academy Awards.
most_popular_year(name_arg = "Olivia")
most_popular_year(name_arg = "Regina")
most_popular_year(name_arg = "Rami")
most_popular_year(name_arg = "Mahershala")
R doesn’t have formal type signatures for its functions the way that some other programming languages do. However, being aware of what kind of objects your functions take, and what kind of objects your function returns, is usually very important.
You can always show the arguments that a given function takes by using the formals()
function.
formals(most_popular_year)
In this case, the most_popular_year()
function takes a single argument called name_arg
, which should be a character vector, and returns a tbl_df
.
More details about functions that exist within packages are available via help(name_of_function)
.
By default, an R function returns the result of the last command that is executed by the function. For most_popular_year()
, there is only one “line” of code (i.e., the whole pipeline), and the result of that will be a tbl_df
.
Alternatively, you can use return(blah)
to explicitly return objects. (I think) that every R function returns something (i.e., there is no such thing as a “void” function).
If you want an argument to your function have a default value, specify it in the function definition.
The way that we have defined most_popular_year()
, there is no default value for name_arg
. Thus, if we call the function with no arguments, it will break.
most_popular_year()
In this case, this is probably the desired behavior, since it doesn’t make sense to call this function without specifying a name. However, we could have defined it with a default value, say "Benjamin"
.
<- function(name_arg = "Benjamin") {
most_popular_year_ben %>%
babynames filter(name == name_arg) %>%
group_by(year) %>%
summarize(total = sum(prop)) %>%
arrange(desc(total)) %>%
head(1) %>%
select(year)
}
Now we can call the function without specifying the name_arg
argument, but in that case we’ll get the results for "Benjamin"
.
most_popular_year_ben()
We can still of course still override the default value of name_arg
:
most_popular_year_ben(name_arg = "Jordan")
How did our function know about the babynames
table? Why wasn’t that an input to the function? The answer to the first question involes the notion of variable scoping, while the answer to the second question is a design choice.
The rules for variable scoping in R are…complicated. But what is important for you to understand is that R will look for objects in the global environment if it can’t find them locally. So when we run most_popular_year()
, R will look for a data frame called babynames
in the global environment. If it exists, then the function should work, but if not, it won’t. Thus, whether a user-defined function in R works as expected depends on what is in the global environment. This behavior is different than most compiled programming languages (e.g. C++, Java, etc.), but it is designed to make it easy to script with functions on-the-fly.
Note that if we unload the babynames
package, thus removing the babynames
table from the environment, our function no longer works.
detach("package:babynames", unload = TRUE)
# should throw an error
most_popular_year("Benjamin")
Don’t forget to bring babynames
back.
library(babynames)
To be more explicit, we could pass the table that we want to search for to the function. We can achieve this by re-writing the function to take a data
argument:
<- function(data, name_arg) {
most_popular_year2 %>%
data filter(name == name_arg) %>%
group_by(year) %>%
summarize(total = sum(prop)) %>%
arrange(desc(total)) %>%
head(1) %>%
select(year)
}# will throw error because we didn't specify "data"
most_popular_year2(name_arg = "Casey")
# works
most_popular_year2(data = babynames, name_arg = "Casey")
This also enables us to apply our function to subsets of the original data. So we can search for the most popular year for Casey
among boys and girls separately.
%>%
babynames filter(sex == "F") %>%
most_popular_year2(name_arg = "Casey")
%>%
babynames filter(sex == "M") %>%
most_popular_year2(name_arg = "Casey")
Note that the order of the arguments matters only if they are not named.
most_popular_year2(babynames, "Emma")
most_popular_year2("Emma", babynames)
most_popular_year2("Emma", data = babynames)
To be safe (and explicit), name your arguments unless you have a good reason not to.
These exercises use the nycflights13
data package.
library(nycflights13)
DL
), will retrieve the five most common airport destinations from NYC in 2013, and how often the carrier flew there.# SAMPLE SOLUTION
<- function(carrier_arg) {
top_dests %>%
flights filter(carrier == carrier_arg) %>%
group_by(dest) %>%
summarize(num_flights = n()) %>%
arrange(desc(num_flights)) %>%
head(5)
}
DL
).# SAMPLE SOLUTION
top_dests("DL")
AA
). How many of these destinations are shared with Delta?# SAMPLE SOLUTION
top_dests("AA")
SFO
), will retrieve the five most common carriers that service that airport from NYC in 2013, and what their average arrival delay time was.# SAMPLE SOLUTION
<- function(faa) {
common_carriers %>%
flights filter(dest == faa) %>%
group_by(carrier) %>%
summarize(num_flights = n()) %>%
arrange(desc(num_flights)) %>%
head(5)
}
common_carriers("SFO")
Please respond to the following prompt on Slack in the #mod-programming
channel.
Prompt: After completing Lab 10, what questions do you still have about how functions work in R? Feel free to compare your experience with another programming language if you know one.