Module 8 Groupings

In some instances data from individual observations need to be summarized by groupings of those observations. For example, you may want to compute the total number of COVID cases by county (across all months) for the data frame shown below.³⁴

#R>  # A tibble: 48 x 4
#R>     County  Month Year  Cases
#R>     <chr>   <chr> <chr> <dbl>
#R>   1 Ashland Mar   2020      1
#R>   2 Ashland Apr   2020      1
#R>   3 Ashland May   2020      0
#R>   4 Ashland Jun   2020      1
#R>   5 Ashland Jul   2020     16
#R>   6 Ashland Aug   2020     16
#R>   7 Ashland Sep   2020     94
#R>   8 Ashland Oct   2020    167
#R>   9 Ashland Nov   2020    378
#R>  10 Ashland Dec   2020    331
#R>  # ... with 38 more rows

How to summarize and wrangle data by groups is introduced in this module.

8.1 Declaring Groupings

Groups may be declared by including the grouping variable or variables in group_by(). For example the code below declares groupings based on levels in County.

covid_byCO <- covABD %>%
  group_by(County)
covid_byCO

#R>  # A tibble: 48 x 4
#R>  # Groups:   County [3]
#R>     County  Month Year  Cases
#R>     <chr>   <chr> <chr> <dbl>
#R>   1 Ashland Mar   2020      1
#R>   2 Ashland Apr   2020      1
#R>   3 Ashland May   2020      0
#R>   4 Ashland Jun   2020      1
#R>   5 Ashland Jul   2020     16
#R>   6 Ashland Aug   2020     16
#R>   7 Ashland Sep   2020     94
#R>   8 Ashland Oct   2020    167
#R>   9 Ashland Nov   2020    378
#R>  10 Ashland Dec   2020    331
#R>  # ... with 38 more rows

Simply using group_by() will not produce any immediately noticeable difference in the data frame. For example, the only perceptible difference above is the addition of the “Groups: County[3]” line in the output. Using group_by() only adds a grouping declaration to a data frame. How this is useful is demonstrated in the next sections.

There can be multiple levels of groupings. For example, the code below will group Year within County.

covid_byCOYR <- covABD %>%
  group_by(County,Year)
covid_byCOYR

#R>  # A tibble: 48 x 4
#R>  # Groups:   County, Year [6]
#R>     County  Month Year  Cases
#R>     <chr>   <chr> <chr> <dbl>
#R>   1 Ashland Mar   2020      1
#R>   2 Ashland Apr   2020      1
#R>   3 Ashland May   2020      0
#R>   4 Ashland Jun   2020      1
#R>   5 Ashland Jul   2020     16
#R>   6 Ashland Aug   2020     16
#R>   7 Ashland Sep   2020     94
#R>   8 Ashland Oct   2020    167
#R>   9 Ashland Nov   2020    378
#R>  10 Ashland Dec   2020    331
#R>  # ... with 38 more rows

When grouping by more than one variable, subsequent variables are always nested within groups of previous variables.

8.2 Summarizing by Groups

Adding groupings to a data frame becomes most useful when that data frame is submitted to summarize() to summarize results by groups. The summarize() function takes arguments that are a name for the summary set equal to a function that creates a summary. The summary function can be any function that returns a single numeric result (Table 8.1).

Table 8.1: Common summary functions used in `summarize()`, especially with `group_by()`. Note that `x` generically represents a variable in the data frame and would be replaced with a specific variable name (see examples in main text).
Function	Summary value returned
`n()`	Number of observations³⁵
`sum(!is.na(x))`.]	Count of non-missing values in `x`
`sum(x)`.]	Sum values in `x`
`mean(x)`.]	Mean (average) of values in `x`
`median(x)`.]	Median of values in `x`
`sd(x)`.]	Standard deviation of values in `x`
`IQR(x)`.]	Inter-quartile range of values in `x`
`max(x)`.]	Maximum value of `x`
`min(x)`.]	Minimum value of `x`
`quantile(x,p)`.]	100\(\times\)`p`% quantile of values in `x`
`first(x)`.]	Value of first observation of `x`
`last(x)`.]	Value of last observation of `x`
`n_distinct(x)`.]	Number of distance (i.e., unique) values of `x`.

For example, the code below finds the sample size (i.e., number of months) and total number of cases of COVID by county (across all months) using the first “grouped” data frame created in Section 8.1.

sum_covid_byCO <- covid_byCO %>%
  summarize(num_mons=n(),
            num_cases=sum(Cases))
sum_covid_byCO

#R>  # A tibble: 3 x 3
#R>    County   num_mons num_cases
#R>    <chr>       <int>     <dbl>
#R>  1 Ashland        16      1291
#R>  2 Bayfield       16      1169
#R>  3 Douglas        16      4166

The results from applying summarize() to a “grouped” data frame is itself a data frame with the last level of grouping removed. In the example above, there was only one level of grouping (i.e., County) so the returned result was simply a data frame with the grouping removed. However, applying the same summaries to the data frame that had groupings by both County and Year returns a data frame with summaries by year within each county, with the returned data frame retaining the first grouping (i.e., by County) but not the last (i.e., by Year).

sum_covid_byCOYR <- covid_byCOYR %>%
  summarize(num_mons=n(),
            num_cases=sum(Cases))
sum_covid_byCOYR

#R>  # A tibble: 6 x 4
#R>  # Groups:   County [3]
#R>    County   Year  num_mons num_cases
#R>    <chr>    <chr>    <int>     <dbl>
#R>  1 Ashland  2020        10      1005
#R>  2 Ashland  2021         6       286
#R>  3 Bayfield 2020        10       903
#R>  4 Bayfield 2021         6       266
#R>  5 Douglas  2020        10      3081
#R>  6 Douglas  2021         6      1085

Summarizing this summarized but still grouped data frame will then summarize the summarized data across the remaining groupings (i.e., by County).³⁶ Note that these results are the same as when the summarization was just by County (i.e., compare sum_covid_byCO from above to sum_covid_byCO1 below).

sum_covid_byCO1 <- sum_covid_byCOYR %>%
  summarize(num_mons=sum(num_mons),
            num_cases=sum(num_cases))
sum_covid_byCO1

#R>  # A tibble: 3 x 3
#R>    County   num_mons num_cases
#R>    <chr>       <int>     <dbl>
#R>  1 Ashland        16      1291
#R>  2 Bayfield       16      1169
#R>  3 Douglas        16      4166

Nested levels of groupings can be very powerful, but they should be used carefully. As a general rule, multi-level summarizations on multiple grouping variables only work properly for counts and sums. Multi-level summarizations are unlikely to give the desired results when using other summaries, such as the mean or standard deviation.

For example, consider this simple data frame called trouble with two grouping variables and a single measurement variable.

#R>  # A tibble: 11 x 3
#R>     group1 group2 value
#R>     <chr>  <chr>  <dbl>
#R>   1 A      z         10
#R>   2 A      z          9
#R>   3 A      y         10
#R>   4 A      y         12
#R>   5 A      y         13
#R>   6 A      y         14
#R>   7 A      y         55
#R>   8 B      z         10
#R>   9 B      z          9
#R>  10 B      y         11
#R>  11 B      y         55

The code below computes the sample size, sum, and mean of value for the two groups defined by group1.

sum_trouble_1 <- trouble %>%
  group_by(group1) %>%
  summarize(n=n(),
            sum=sum(value),
            mn=mean(value))
sum_trouble_1

#R>  # A tibble: 2 x 4
#R>    group1     n   sum    mn
#R>    <chr>  <int> <dbl> <dbl>
#R>  1 A          7   123  17.6
#R>  2 B          4    85  21.2

The code below computes the same summaries for the four groups defined by group2 nested within group1.

sum_trouble_2 <- trouble %>%
  group_by(group1,group2) %>%
  summarize(n=n(),
            sum=sum(value),
            mn=mean(value))
sum_trouble_2

#R>  # A tibble: 4 x 5
#R>  # Groups:   group1 [2]
#R>    group1 group2     n   sum    mn
#R>    <chr>  <chr>  <int> <dbl> <dbl>
#R>  1 A      y          5   104  20.8
#R>  2 A      z          2    19   9.5
#R>  3 B      y          2    66  33  
#R>  4 B      z          2    19   9.5

This last data frame is still grouped by group1 so it is possible to use it to get summaries for the two groups defined by group1.

sum_trouble_1A <- sum_trouble_2 %>%
  summarize(n=sum(n),
            sum=sum(sum),
            mn=mean(mn))
sum_trouble_1A

#R>  # A tibble: 2 x 4
#R>    group1     n   sum    mn
#R>    <chr>  <int> <dbl> <dbl>
#R>  1 A          7   123  15.2
#R>  2 B          4    85  21.2

The sum_trouble_1 from further above and sum_trouble_1A from here reveals that both have identical counts and sums of the values for the two groups and the same mean for the “B” group. However the means are different for the “A” group. The means for the “A” group differ between the two methods of summarization because there were different sample sizes among the groups of group2 nested within the “A” group of group1. In other words, the mean for the “A” group was calculated as the mean of 20.8 and 19.5 without realizing that 20.8 came from five observations in the y group and 9.5 came from only two observations in the z group.³⁷

Do NOT use multi-level summarizations for other than counts and sums.

8.2.1 Handling Missing Values

Missing values are coded in R with NA. For example, this simple data frame called trouble2 has three missing values in the value variable.

#R>  # A tibble: 11 x 3
#R>     group1 group2 value
#R>     <chr>  <chr>  <dbl>
#R>   1 A      z         10
#R>   2 A      z         NA
#R>   3 A      y         10
#R>   4 A      y         12
#R>   5 A      y         13
#R>   6 A      y         14
#R>   7 A      y         55
#R>   8 B      z         NA
#R>   9 B      z          9
#R>  10 B      y         NA
#R>  11 B      y         55

Most of the summary functions shown in Table 8.1 will return NA if the variable being summarized contains any NAs. For example, the code below attempts to count the number of values in value and compute the mean and standard deviation of value for each group in group1.

tmp <- trouble2 %>%
  group_by(group1) %>%
  summarize(n=n(),
            mn=mean(value),
            sd=sd(value))
tmp

#R>  # A tibble: 2 x 4
#R>    group1     n    mn    sd
#R>    <chr>  <int> <dbl> <dbl>
#R>  1 A          7    NA    NA
#R>  2 B          4    NA    NA

There are at least two issues here. First, the count variable (n) suggests that there were 7 and 4 valid observations in the two groups, when in reality there is only 6 and 2. Second, the means and standard deviations could not be properly calculated because of the NAs in value.

The first issue of counting valid observations is addressed by using the sum(!is.na(x)) code shown in Table 8.1. This code is a combination of two functions. The is.na() function returns TRUE if an element of x is NA (and FALSE otherwise). The exclamation point in front of is.na() takes the complement of these values (i.e., TRUE becomes FALSE and vice versa) such that !is.na() returns TRUE if the element is not an NA. When logical values are given to sum() the TRUEs are converted to 1s and the FALSEs to 0s. Thus, the sum() of these logicals will return the number of TRUEs or, in this case, the number of elements that are not NA; i.e., the number of valid observations.

The second issue of the summary function returning NA if an NA exists in the variable is addressed by including na.rm=TRUE.] within the summary function. This argument serves to remove the NAs from the calculations and will, thus, return the summary of all non-missing elements.

Thus, the following code provides a better summary of the count, mean, and standard deviation of the value variable.

tmp <- trouble2 %>%
  group_by(group1) %>%
  summarize(n=n(),
            n_valid=sum(!is.na(value)),
            mn=mean(value,na.rm=TRUE),
            sd=sd(value,na.rm=TRUE))
tmp

#R>  # A tibble: 2 x 5
#R>    group1     n n_valid    mn    sd
#R>    <chr>  <int>   <int> <dbl> <dbl>
#R>  1 A          7       6    19  17.7
#R>  2 B          4       2    32  32.5

8.3 Wrangling by Group

Groupings can also be used with other dplyr verbs. For example, consider this simple data frame called grades that has hypothetical exam scores for students in two sections of a course.

#R>  # A tibble: 11 x 3
#R>     last       section grade
#R>     <chr>        <dbl> <dbl>
#R>   1 Boshwitz         1  87.2
#R>   2 Lepal            1  56.9
#R>   3 Smith            1  74.4
#R>   4 Felix            1  92.5
#R>   5 Seidel           1  88.2
#R>   6 Phelps           2  71.2
#R>   7 McLaughlin       2  88.4
#R>   8 Robertson        2  56.5
#R>   9 Jak              2  78.3
#R>  10 Abel             2  67.6
#R>  11 Bonham           2  80.3

The code below uses rank() and desc() to create a new variable that is the rank of each student in the course based on their grade. The desc() function is used here to assure that the student with the highest grade is given a rank of 1 (because rank() ranks in ascending order by default).

tmp <- grades %>%
  mutate(rnk=rank(desc(grade))) %>%
  arrange(rnk)
tmp

#R>  # A tibble: 11 x 4
#R>     last       section grade   rnk
#R>     <chr>        <dbl> <dbl> <dbl>
#R>   1 Felix            1  92.5     1
#R>   2 McLaughlin       2  88.4     2
#R>   3 Seidel           1  88.2     3
#R>   4 Boshwitz         1  87.2     4
#R>   5 Bonham           2  80.3     5
#R>   6 Jak              2  78.3     6
#R>   7 Smith            1  74.4     7
#R>   8 Phelps           2  71.2     8
#R>   9 Abel             2  67.6     9
#R>  10 Lepal            1  56.9    10
#R>  11 Robertson        2  56.5    11

However, suppose that interest is in the rank WITHIN each section. Here group_by() can be used prior to mutate() so that the methods in mutate() are applied separately to each group.³⁸.]

grades %<>%
  group_by(section) %>%
  mutate(rnk=rank(desc(grade))) %>%
  arrange(section,rnk)
grades

#R>  # A tibble: 11 x 4
#R>  # Groups:   section [2]
#R>     last       section grade   rnk
#R>     <chr>        <dbl> <dbl> <dbl>
#R>   1 Felix            1  92.5     1
#R>   2 Seidel           1  88.2     2
#R>   3 Boshwitz         1  87.2     3
#R>   4 Smith            1  74.4     4
#R>   5 Lepal            1  56.9     5
#R>   6 McLaughlin       2  88.4     1
#R>   7 Bonham           2  80.3     2
#R>   8 Jak              2  78.3     3
#R>   9 Phelps           2  71.2     4
#R>  10 Abel             2  67.6     5
#R>  11 Robertson        2  56.5     6

Note that in contrast to summarize() the grouping is not removed from the data frame when mutate() is used. Because the grouping variable is still intact, filter() can be used to, for example, return the three students with the highest grades in EACH section.³⁹

top3 <- grades %>%
  filter(rnk<=3)
top3

#R>  # A tibble: 6 x 4
#R>  # Groups:   section [2]
#R>    last       section grade   rnk
#R>    <chr>        <dbl> <dbl> <dbl>
#R>  1 Felix            1  92.5     1
#R>  2 Seidel           1  88.2     2
#R>  3 Boshwitz         1  87.2     3
#R>  4 McLaughlin       2  88.4     1
#R>  5 Bonham           2  80.3     2
#R>  6 Jak              2  78.3     3

Once again, note that the grouping is not removed from the data frame when using filter(). Thus, one could immediately calculate the mean grade for the highest three grades in each section.

top3 %>% summarize(n=n(),
                   mn=mean(grade))

#R>  # A tibble: 2 x 3
#R>    section     n    mn
#R>      <dbl> <int> <dbl>
#R>  1       1     3  89.3
#R>  2       2     3  82.3

The use of mutate() with group_by() is less common but can be very powerful. As a simple example, suppose that one wanted to find the difference between each observation and the mean of its group. In the code below, mean() within mutate() when a grouping is declared will find the mean for each group. Because this is within a mutate() rather than a summarize() it is repeated for each observation in each group.

tmp <- trouble2 %>%
  group_by(group1) %>%
  mutate(mn=mean(value,na.rm=TRUE),
         diff=value-mn)
tmp

#R>  # A tibble: 11 x 5
#R>  # Groups:   group1 [2]
#R>     group1 group2 value    mn  diff
#R>     <chr>  <chr>  <dbl> <dbl> <dbl>
#R>   1 A      z         10    19    -9
#R>   2 A      z         NA    19    NA
#R>   3 A      y         10    19    -9
#R>   4 A      y         12    19    -7
#R>   5 A      y         13    19    -6
#R>   6 A      y         14    19    -5
#R>   7 A      y         55    19    36
#R>   8 B      z         NA    32    NA
#R>   9 B      z          9    32   -23
#R>  10 B      y         NA    32    NA
#R>  11 B      y         55    32    23

dplyr verbs other than summarize() will not remove a level of groupings.

8.4 Ungrouping

As a general rule-of-thumb it is best to remove the groupings from your data frame once you know you are done summarizing, filtering, etc. based on groups. There are two main reasons for this. First, as noted above, many dplyr verbs work on groupings. Thus, if your data frame maintins groupings after you are done (in your mind) with groupings then you may get unintended results.

As a very simple example, suppose that you want to use slice() to retain ONLY the first row of a data frame. However, if that data frame has groupings (e.g., after a first level of summarizing) then slice() will return rows from each group. For example, suppose that you want only the first row of sum_trouble_2 created above (note below that it retained a grouping variable).

sum_trouble_2

#R>  # A tibble: 4 x 5
#R>  # Groups:   group1 [2]
#R>    group1 group2     n   sum    mn
#R>    <chr>  <chr>  <int> <dbl> <dbl>
#R>  1 A      y          5   104  20.8
#R>  2 A      z          2    19   9.5
#R>  3 B      y          2    66  33  
#R>  4 B      z          2    19   9.5

tmp <- sum_trouble_2 %>%
  slice(1)
tmp

#R>  # A tibble: 2 x 5
#R>  # Groups:   group1 [2]
#R>    group1 group2     n   sum    mn
#R>    <chr>  <chr>  <int> <dbl> <dbl>
#R>  1 A      y          5   104  20.8
#R>  2 B      y          2    66  33

As you can see, slice() was applied to both groups of group1 such that the first row of each group was returned, which was not the intended outcome.

As another example, suppose that you wanted to change the names of the groups in group1 in sum_trouble_2.

tmp <- sum_trouble_2 %>%
  mutate(group1=plyr::mapvalues(group1,from=c("A","B"),to=c("Alex","Bart")))

#R>  The following `from` values were not present in `x`: B

#R>  The following `from` values were not present in `x`: A

tmp

#R>  # A tibble: 4 x 5
#R>  # Groups:   group1 [2]
#R>    group1 group2     n   sum    mn
#R>    <chr>  <chr>  <int> <dbl> <dbl>
#R>  1 Alex   y          5   104  20.8
#R>  2 Alex   z          2    19   9.5
#R>  3 Bart   y          2    66  33  
#R>  4 Bart   z          2    19   9.5

While this ultimately worked the messages shown in the output suggest an issue. Again the mutate() is applied by groups and when working with group “A” there is no group “B” which leads to the first message (and the second message comes from the opposite problem when working with group “B”).

Both of these issues can be corrected by using ungroup() to remove the groupings from the data frame.

tmp <- sum_trouble_2 %>%
  ungroup() %>%
  mutate(group1=plyr::mapvalues(group1,from=c("A","B"),to=c("Alex","Bart"))) %>%
  slice(1)
tmp

#R>  # A tibble: 1 x 5
#R>    group1 group2     n   sum    mn
#R>    <chr>  <chr>  <int> <dbl> <dbl>
#R>  1 Alex   y          5   104  20.8

As a general rule-of-thumb, I suggest using ungroup() at the end of a piping chain where you know you are done with the groupings. For example, instead of using ungroup() as in the previous code, I would have created sum_trouble_2 as such.

sum_trouble_2 <- trouble %>%
  group_by(group1,group2) %>%
  summarize(n=n(),
            sum=sum(value),
            mn=mean(value)) %>%
  ungroup()
sum_trouble_2

#R>  # A tibble: 4 x 5
#R>    group1 group2     n   sum    mn
#R>    <chr>  <chr>  <int> <dbl> <dbl>
#R>  1 A      y          5   104  20.8
#R>  2 A      z          2    19   9.5
#R>  3 B      y          2    66  33  
#R>  4 B      z          2    19   9.5

Notice how the tibble does not show any grouping structure.

To avoid unforeseen behavior, grouping variable(s) should be removed from the data frame with ungroup() if you are done summarizing or wrangling by group.

8.5 Examples in Context

8.5.1 Student Data

In Section 4.4.1 a data frame called schedules2 was constructued that contained a student’s ID number with each course they were enrolled in along with the course’s credits and instructor.

schedules2

#R>  # A tibble: 20 x 4
#R>     studentID course credits instructor
#R>         <dbl> <chr>    <dbl> <chr>     
#R>   1     34535 MTH107       4 Ogle      
#R>   2     34535 BIO115       4 Johnson   
#R>   3     34535 CHM110       4 Carlson   
#R>   4     34535 IDS101       3 Goyke     
#R>   5     45423 SCD110       3 Tochterman
#R>   6     45423 PSY110       4 Sneyd     
#R>   7     45423 MTH140       4 Jensen    
#R>   8     45423 OED212       3 Andre     
#R>   9     45423 IDS101       3 Goyke     
#R>  10     73424 BIO234       4 Goyke     
#R>  11     73424 CHM220       4 Robertson 
#R>  12     73424 BIO370       4 Anich     
#R>  13     73424 SCD110       3 Tochterman
#R>  14     89874 SCD440       4 Foster    
#R>  15     89874 PSY370       4 Sneyd     
#R>  16     89874 IDS490       4 Hannickel 
#R>  17     98222 SCD440       4 Foster    
#R>  18     98222 SCD330       3 Tochterman
#R>  19     98222 SOC480       4 Schanning 
#R>  20     98222 ART220       3 Duffy

Additionally, recall that there was a data frame called personal that contained personal information about each student (along with the ID).

personal

#R>  # A tibble: 5 x 5
#R>    studentID first_nm  last_nm    hometown     homestate
#R>        <dbl> <chr>     <chr>      <chr>        <chr>    
#R>  1     34535 Rolando   Blackman   Windsor      MI       
#R>  2     45423 Catherine Johnson    Eden Prairie MN       
#R>  3     73424 James     Carmichael Marion       IA       
#R>  4     89874 Rachel    Brown      Milwaukee    WI       
#R>  5     98222 Esteban   Perez      El Paso      TX

In this example, suppose that the registrar wants to create a report that has the number of courses and the total number of credits taken appended to the personal information for each student. Construction of this report begins by summarizing schedules2 for each student.

sum_crs <- schedules2 %>%
  group_by(studentID) %>%
  summarize(num_courses=n(),
            num_credits=sum(credits))
sum_crs

#R>  # A tibble: 5 x 3
#R>    studentID num_courses num_credits
#R>        <dbl>       <int>       <dbl>
#R>  1     34535           4          15
#R>  2     45423           5          17
#R>  3     73424           4          15
#R>  4     89874           3          12
#R>  5     98222           4          14

These results can then be left_join()ed with personal to create the desired database.

personal2 <- personal %>%
  left_join(sum_crs,by="studentID")
personal2

#R>  # A tibble: 5 x 7
#R>    studentID first_nm  last_nm    hometown     homestate num_courses num_credits
#R>        <dbl> <chr>     <chr>      <chr>        <chr>           <int>       <dbl>
#R>  1     34535 Rolando   Blackman   Windsor      MI                  4          15
#R>  2     45423 Catherine Johnson    Eden Prairie MN                  5          17
#R>  3     73424 James     Carmichael Marion       IA                  4          15
#R>  4     89874 Rachel    Brown      Milwaukee    WI                  3          12
#R>  5     98222 Esteban   Perez      El Paso      TX                  4          14

8.5.2 Resource Sampling Data

In Section 4.4.2 a data frame called fishcatch was created that had the species and number of that species caught in each of five nets. The date and lake where the net was set was also recorded.

fishcatch

#R>    net_num      lake     date          species number
#R>  1       1     Eagle 3-Jul-21         Bluegill      7
#R>  2       1     Eagle 3-Jul-21  Largemouth Bass      3
#R>  3       2      Hart 3-Jul-21         Bluegill     19
#R>  4       2      Hart 3-Jul-21  Largemouth Bass      2
#R>  5       2      Hart 3-Jul-21 Bluntnose Minnow     56
#R>  6       3      Hart 5-Jul-21             <NA>     NA
#R>  7       4     Eagle 6-Jul-21         Bluegill      3
#R>  8       4     Eagle 6-Jul-21  Largemouth Bass      6
#R>  9       5 Millicent 6-Jul-21  Largemouth Bass      3

Suppose a technician wants to summarize the number of species caught and the total catch (regardless of species) in each net. An examination of the data frame above reveals NAs for species and number for one of the nets that did not catch any fish. Because of this we cannot simply count the number of rows for each net_num to get the number of species. Instead this calculation will have to be treated as described for finding the valid number of observations. The total catch in each net_num can be found with sum() but it must include na.rm=TRUE to account for the missing data.

tmp <- fishcatch %>%
  group_by(net_num) %>%
  summarize(num_spec=sum(!is.na(species)),
            ttl_catch=sum(number,na.rm=TRUE))
tmp

#R>  # A tibble: 5 x 3
#R>    net_num num_spec ttl_catch
#R>      <dbl>    <int>     <dbl>
#R>  1       1        2        10
#R>  2       2        3        77
#R>  3       3        0         0
#R>  4       4        2         9
#R>  5       5        1         3

The resulting data frame is missing the specific information (date and lake) for each net_num. A trick for including information that is specific (and thus repeated) to the grouping variable is to include those variables as grouping variables prior to the main grouping variable. For example, there is only one net_num per lake and date combination so including lake and date prior to net_num will not alter the results but will retain the lake and date values. If you use this trick, make sure to ungroup() after the summarization so there are no unintended consequences of adding the extra grouping variables.

tmp <- fishcatch %>%
  group_by(lake,date,net_num) %>%
  summarize(num_spec=sum(!is.na(species)),
            ttl_catch=sum(number,na.rm=TRUE)) %>%
  ungroup() %>%
  arrange(net_num) %>%
  relocate(net_num)
tmp

#R>  # A tibble: 5 x 5
#R>    net_num lake      date     num_spec ttl_catch
#R>      <dbl> <chr>     <chr>       <int>     <dbl>
#R>  1       1 Eagle     3-Jul-21        2        10
#R>  2       2 Hart      3-Jul-21        3        77
#R>  3       3 Hart      5-Jul-21        0         0
#R>  4       4 Eagle     6-Jul-21        2         9
#R>  5       5 Millicent 6-Jul-21        1         3

8.5.3 Wolves and Moose of Isle Royale

In Section 6.7.2 a data frame called irmw2 was created that contained the number of wolves and moose, the winter air temperature, and whether or not an ice bridge to the mainland formed for each year from 1959-2012. In that module, an era variable was also created that categorized the years into “early,” “middle,” and “late” time periods.

Suppose that the researchers want to compute summary statistics for the number of moose separated by era, and by era and whether an ice bridge formed. The latter is accomplished below.

tmp <- irmw2 %>%
  group_by(era,ice_bridges) %>%
  summarize(n=n(),
            n_valid=sum(!is.na(moose)),
            mean=mean(moose,na.rm=TRUE),
            sd=sd(moose,na.rm=TRUE),
            min=min(moose,na.rm=TRUE),
            Q1=quantile(moose,0.25,na.rm=TRUE),
            median=median(moose,na.rm=TRUE),
            Q3=quantile(moose,0.75,na.rm=TRUE),
            max=max(moose,na.rm=TRUE)) %>%
  ungroup()
tmp

#R>  # A tibble: 6 x 11
#R>    era    ice_bridges     n n_valid  mean    sd   min    Q1 median    Q3   max
#R>    <chr>  <chr>       <int>   <int> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
#R>  1 early  no              4       4  734.  322.  538.  557.   592.  769. 1215.
#R>  2 early  yes            12      12  864.  264.  572.  624.   807. 1079. 1243.
#R>  3 middle no             17      17 1150.  381.  767.  925.  1031. 1260. 2117.
#R>  4 middle yes             9       9 1277.  575.  780.  900.   976. 1496. 2398.
#R>  5 recent no             14      14  816.  364.  385   519.   750  1069. 1600 
#R>  6 recent yes             5       5 1297   523.  650  1050   1250  1475  2060

Here I ungroup()ed the data frame because I want to make sure that I am not tempted to summarize the returned data frame that would have still had groupings by era. As mentioned in the main text it is inappropriate to compute most summaries on a second level of groupings after summarizing by the first level of groupings.

Thus, the first goal of the researchers is then accomplished below.

tmp <- irmw2 %>%
  group_by(era) %>%
  summarize(n=n(),
            n_valid=sum(!is.na(moose)),
            mean=mean(moose,na.rm=TRUE),
            sd=sd(moose,na.rm=TRUE),
            min=min(moose,na.rm=TRUE),
            Q1=quantile(moose,0.25,na.rm=TRUE),
            median=median(moose,na.rm=TRUE),
            Q3=quantile(moose,0.75,na.rm=TRUE),
            max=max(moose,na.rm=TRUE)) %>%
  ungroup()
tmp

#R>  # A tibble: 3 x 10
#R>    era        n n_valid  mean    sd   min    Q1 median    Q3   max
#R>    <chr>  <int>   <int> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
#R>  1 early     16      16  832.  274.  538.  592.   713. 1079. 1243.
#R>  2 middle    26      26 1194.  450.  767.  906.  1023. 1301. 2398.
#R>  3 recent    19      19  943.  452.  385   535    900  1185. 2060

This data frame was introduced in Module 5.↩︎
There are no arguments to n().↩︎
Note here the use of the summarized but still grouped data frame and that the computation of numbers of months and cases had to be adjusted for the new variables in the summarized data frame.↩︎
In the two-level summarize the mean of the “A” group is calculated as \(\frac{20.8+19.5}{1+1}\) rather than \(\frac{104+19}{5+2}\).↩︎
This same ordering also could have been accomplished without creating the ranks and just using arrange(section,desc(grade)).↩︎
This could also have been accopmlished with grades %>% slice_head(n=3).↩︎