In this lab, we will continue to develop our data transformation skills by learning how to use the `group_by()`

function in conjunction with the `summarize()`

verb that we learned previously.

```
library(tidyverse)
library(babynames)
```

**Goal**: by the end of this lab, you will be able to use `group_by()`

to perform summary operations on groups.

`group_by()`

Consider Exercise 4 from the Single Table Anaylsis Lab:

- In which year was that name given to M and F babies most equally (i.e.
**closest to a 50/50 split**)?

How would you do this? You could, of course, scan the data visually to estimate the percentages in each year:

```
%>%
babynames filter(name == "Jackie")
```

But this is very inefficient and does not provide an exact solution.

The key to solving this problem is to recognize that we need to collapse the **two** rows corresponding to each assigned sex in each year into a single row that contains the information for both sexes (i.e. `summarize()`

). Unfortunately, there is no way for R to know what to compute its own—we have to tell it.

The `group_by`

function specifies a variable on which the data frame will be collapsed. Each row in the result set will correspond to one unique value of that variable. In this case, we want to group by `year`

. [This is sometimes called “rolling up” a data set.]

```
%>%
babynames filter(name == "Jackie") %>%
group_by(year)
```

This doesn’t actually do much, since we haven’t told R what to compute. `summarize()`

takes a list of definitions for columns you want to see in the result set. The key to understanding `summarize()`

is to note that it operates on vectors (which may contain many values), but each variable defined within `summarize()`

**must return a single value** (per group). [Why?] Thus, the variables defined by the arguments to `summarize()`

are usually *aggregate* functions like `sum()`

, `mean()`

, `length()`

, `max()`

, `n()`

, etc.

```
%>%
babynames filter(name == "Jackie") %>%
group_by(year) %>%
summarize(
num_sexes = n(),
total = sum(n),
boys = sum(ifelse(sex == "M", n, 0)),
girls = total - boys,
girl_pct = girls / total
)
```

- Which year had the greatest number of births?

```
# sample solution
%>%
babynames group_by(year) %>%
summarize(num_births = sum(n)) %>%
arrange(desc(num_births))
```

- In a single pipeline, compute the earliest and latest year that each name appears?

```
# sample solution
%>%
babynames group_by(name) %>%
summarize(earliest = min(year), latest = max(year))
```

- There are 15 names that have been assigned
**to both sexes**in all 138 years. List them.

```
# sample solution
%>%
babynames group_by(name) %>%
summarize(num_appearances = n()) %>%
filter(num_appearances == 276)
```

- Among popular names (let’s say at least 1% of the births in a given year), which name is the
*youngest*—meaning that its first appearance as a popular name is the most recent?

```
# sample solution
%>%
babynames mutate(is_popular = prop >= 0.01) %>%
filter(is_popular == TRUE) %>%
group_by(name) %>%
summarize(earliest = min(year)) %>%
arrange(desc(earliest))
```

It seems like there is more diversity of names now than in the past. How have the number of names used changed over time? Has it been the same for boys and girls?

Find the most popular names of the 1990s.

```
# sample solution
%>%
babynames filter(year >= 1990 & year < 2000) %>%
group_by(name) %>%
summarize(num_births = sum(n)) %>%
arrange(desc(num_births))
```

- Use
`ggplot2`

and`group_by()`

to create an interesting and informative data graphic. It need not be about`babynames`

. Post your graphic and a short description of it to Slack.

If you are looking for some more practice, try these, using the `nycflights13`

package.

- What was the daily average number of flights leaving each of the three NYC airports in 2013?

```
# sample solution
library(nycflights13)
%>%
flights group_by(origin, month, day) %>%
summarize(num_flights = n()) %>%
group_by(origin) %>%
summarize(num_days = n(), avg_daily = mean(num_flights))
```

- For each carrier, compute the number of total flights, the average departure delay, the number of unique destinations serviced, and the number of unique planes used.

```
# sample solution
%>%
flights group_by(carrier) %>%
summarize(
total_flights = n(),
mean_delay = mean(dep_delay, na.rm = TRUE),
num_dests = n_distinct(dest),
num_planes = n_distinct(tailnum)
%>%
) arrange(desc(total_flights))
```

Plot the distribution of average daily delay time across the entire year for each of the three airports.

Challenge: Plot the average arrival delay time as a function of the distance flown

*to the nearest 100 miles*for each of the three airports.

Please respond to the following prompt on Slack in the `#mod-wrangling`

channel.

Prompt: Use

`ggplot2`

and`group_by()`

to create an interesting and informative data graphic. It need not be about baby names. Post your graphic and a short description of it to Slack.