Module 8 Groupings
In some instances data from individual observations need to be summarized by groupings of those observations. For example, you may want to compute the total number of COVID cases by county (across all months) for the data frame shown below.34
#R> # A tibble: 48 x 4
#R> County Month Year Cases
#R> <chr> <chr> <chr> <dbl>
#R> 1 Ashland Mar 2020 1
#R> 2 Ashland Apr 2020 1
#R> 3 Ashland May 2020 0
#R> 4 Ashland Jun 2020 1
#R> 5 Ashland Jul 2020 16
#R> 6 Ashland Aug 2020 16
#R> 7 Ashland Sep 2020 94
#R> 8 Ashland Oct 2020 167
#R> 9 Ashland Nov 2020 378
#R> 10 Ashland Dec 2020 331
#R> # ... with 38 more rows
How to summarize and wrangle data by groups is introduced in this module.
8.1 Declaring Groupings
Groups may be declared by including the grouping variable or variables in group_by()
. For example the code below declares groupings based on levels in County
.
<- covABD %>%
covid_byCO group_by(County)
covid_byCO
#R> # A tibble: 48 x 4
#R> # Groups: County [3]
#R> County Month Year Cases
#R> <chr> <chr> <chr> <dbl>
#R> 1 Ashland Mar 2020 1
#R> 2 Ashland Apr 2020 1
#R> 3 Ashland May 2020 0
#R> 4 Ashland Jun 2020 1
#R> 5 Ashland Jul 2020 16
#R> 6 Ashland Aug 2020 16
#R> 7 Ashland Sep 2020 94
#R> 8 Ashland Oct 2020 167
#R> 9 Ashland Nov 2020 378
#R> 10 Ashland Dec 2020 331
#R> # ... with 38 more rows
Simply using group_by()
will not produce any immediately noticeable difference in the data frame. For example, the only perceptible difference above is the addition of the “Groups: County[3]” line in the output. Using group_by()
only adds a grouping declaration to a data frame. How this is useful is demonstrated in the next sections.
There can be multiple levels of groupings. For example, the code below will group Year
within County
.
<- covABD %>%
covid_byCOYR group_by(County,Year)
covid_byCOYR
#R> # A tibble: 48 x 4
#R> # Groups: County, Year [6]
#R> County Month Year Cases
#R> <chr> <chr> <chr> <dbl>
#R> 1 Ashland Mar 2020 1
#R> 2 Ashland Apr 2020 1
#R> 3 Ashland May 2020 0
#R> 4 Ashland Jun 2020 1
#R> 5 Ashland Jul 2020 16
#R> 6 Ashland Aug 2020 16
#R> 7 Ashland Sep 2020 94
#R> 8 Ashland Oct 2020 167
#R> 9 Ashland Nov 2020 378
#R> 10 Ashland Dec 2020 331
#R> # ... with 38 more rows
When grouping by more than one variable, subsequent variables are always nested within groups of previous variables.
8.2 Summarizing by Groups
Adding groupings to a data frame becomes most useful when that data frame is submitted to summarize()
to summarize results by groups. The summarize()
function takes arguments that are a name for the summary set equal to a function that creates a summary. The summary function can be any function that returns a single numeric result (Table 8.1).
Function | Summary value returned |
---|---|
n()
|
Number of observations35 |
sum(!is.na(x)) .]
|
Count of non-missing values in x
|
sum(x) .]
|
Sum values in x
|
mean(x) .]
|
Mean (average) of values in x
|
median(x) .]
|
Median of values in x
|
sd(x) .]
|
Standard deviation of values in x
|
IQR(x) .]
|
Inter-quartile range of values in x
|
max(x) .]
|
Maximum value of x
|
min(x) .]
|
Minimum value of x
|
quantile(x,p) .]
|
100\(\times\)p % quantile of values in x
|
first(x) .]
|
Value of first observation of x
|
last(x) .]
|
Value of last observation of x
|
n_distinct(x) .]
|
Number of distance (i.e., unique) values of x .
|
For example, the code below finds the sample size (i.e., number of months) and total number of cases of COVID by county (across all months) using the first “grouped” data frame created in Section 8.1.
<- covid_byCO %>%
sum_covid_byCO summarize(num_mons=n(),
num_cases=sum(Cases))
sum_covid_byCO
#R> # A tibble: 3 x 3
#R> County num_mons num_cases
#R> <chr> <int> <dbl>
#R> 1 Ashland 16 1291
#R> 2 Bayfield 16 1169
#R> 3 Douglas 16 4166
The results from applying summarize()
to a “grouped” data frame is itself a data frame with the last level of grouping removed. In the example above, there was only one level of grouping (i.e., County
) so the returned result was simply a data frame with the grouping removed. However, applying the same summaries to the data frame that had groupings by both County
and Year
returns a data frame with summaries by year within each county, with the returned data frame retaining the first grouping (i.e., by County
) but not the last (i.e., by Year
).
<- covid_byCOYR %>%
sum_covid_byCOYR summarize(num_mons=n(),
num_cases=sum(Cases))
sum_covid_byCOYR
#R> # A tibble: 6 x 4
#R> # Groups: County [3]
#R> County Year num_mons num_cases
#R> <chr> <chr> <int> <dbl>
#R> 1 Ashland 2020 10 1005
#R> 2 Ashland 2021 6 286
#R> 3 Bayfield 2020 10 903
#R> 4 Bayfield 2021 6 266
#R> 5 Douglas 2020 10 3081
#R> 6 Douglas 2021 6 1085
Summarizing this summarized but still grouped data frame will then summarize the summarized data across the remaining groupings (i.e., by County
).36 Note that these results are the same as when the summarization was just by County
(i.e., compare sum_covid_byCO
from above to sum_covid_byCO1
below).
<- sum_covid_byCOYR %>%
sum_covid_byCO1 summarize(num_mons=sum(num_mons),
num_cases=sum(num_cases))
sum_covid_byCO1
#R> # A tibble: 3 x 3
#R> County num_mons num_cases
#R> <chr> <int> <dbl>
#R> 1 Ashland 16 1291
#R> 2 Bayfield 16 1169
#R> 3 Douglas 16 4166
Nested levels of groupings can be very powerful, but they should be used carefully. As a general rule, multi-level summarizations on multiple grouping variables only work properly for counts and sums. Multi-level summarizations are unlikely to give the desired results when using other summaries, such as the mean or standard deviation.
For example, consider this simple data frame called trouble
with two grouping variables and a single measurement variable.
#R> # A tibble: 11 x 3
#R> group1 group2 value
#R> <chr> <chr> <dbl>
#R> 1 A z 10
#R> 2 A z 9
#R> 3 A y 10
#R> 4 A y 12
#R> 5 A y 13
#R> 6 A y 14
#R> 7 A y 55
#R> 8 B z 10
#R> 9 B z 9
#R> 10 B y 11
#R> 11 B y 55
The code below computes the sample size, sum, and mean of value
for the two groups defined by group1
.
<- trouble %>%
sum_trouble_1 group_by(group1) %>%
summarize(n=n(),
sum=sum(value),
mn=mean(value))
sum_trouble_1
#R> # A tibble: 2 x 4
#R> group1 n sum mn
#R> <chr> <int> <dbl> <dbl>
#R> 1 A 7 123 17.6
#R> 2 B 4 85 21.2
The code below computes the same summaries for the four groups defined by group2
nested within group1
.
<- trouble %>%
sum_trouble_2 group_by(group1,group2) %>%
summarize(n=n(),
sum=sum(value),
mn=mean(value))
sum_trouble_2
#R> # A tibble: 4 x 5
#R> # Groups: group1 [2]
#R> group1 group2 n sum mn
#R> <chr> <chr> <int> <dbl> <dbl>
#R> 1 A y 5 104 20.8
#R> 2 A z 2 19 9.5
#R> 3 B y 2 66 33
#R> 4 B z 2 19 9.5
This last data frame is still grouped by group1
so it is possible to use it to get summaries for the two groups defined by group1
.
<- sum_trouble_2 %>%
sum_trouble_1A summarize(n=sum(n),
sum=sum(sum),
mn=mean(mn))
sum_trouble_1A
#R> # A tibble: 2 x 4
#R> group1 n sum mn
#R> <chr> <int> <dbl> <dbl>
#R> 1 A 7 123 15.2
#R> 2 B 4 85 21.2
The sum_trouble_1
from further above and sum_trouble_1A
from here reveals that both have identical counts and sums of the values for the two groups and the same mean for the “B” group. However the means are different for the “A” group. The means for the “A” group differ between the two methods of summarization because there were different sample sizes among the groups of group2
nested within the “A” group of group1
. In other words, the mean for the “A” group was calculated as the mean of 20.8 and 19.5 without realizing that 20.8 came from five observations in the y
group and 9.5 came from only two observations in the z
group.37
Do NOT use multi-level summarizations for other than counts and sums.
8.2.1 Handling Missing Values
Missing values are coded in R with NA
. For example, this simple data frame called trouble2
has three missing values in the value
variable.
#R> # A tibble: 11 x 3
#R> group1 group2 value
#R> <chr> <chr> <dbl>
#R> 1 A z 10
#R> 2 A z NA
#R> 3 A y 10
#R> 4 A y 12
#R> 5 A y 13
#R> 6 A y 14
#R> 7 A y 55
#R> 8 B z NA
#R> 9 B z 9
#R> 10 B y NA
#R> 11 B y 55
Most of the summary functions shown in Table 8.1 will return NA
if the variable being summarized contains any NA
s. For example, the code below attempts to count the number of values in value
and compute the mean and standard deviation of value
for each group in group1
.
<- trouble2 %>%
tmp group_by(group1) %>%
summarize(n=n(),
mn=mean(value),
sd=sd(value))
tmp
#R> # A tibble: 2 x 4
#R> group1 n mn sd
#R> <chr> <int> <dbl> <dbl>
#R> 1 A 7 NA NA
#R> 2 B 4 NA NA
There are at least two issues here. First, the count variable (n
) suggests that there were 7 and 4 valid observations in the two groups, when in reality there is only 6 and 2. Second, the means and standard deviations could not be properly calculated because of the NA
s in value
.
The first issue of counting valid observations is addressed by using the sum(!is.na(x))
code shown in Table 8.1. This code is a combination of two functions. The is.na()
function returns TRUE
if an element of x
is NA
(and FALSE
otherwise). The exclamation point in front of is.na()
takes the complement of these values (i.e., TRUE
becomes FALSE
and vice versa) such that !is.na()
returns TRUE
if the element is not an NA
. When logical values are given to sum()
the TRUE
s are converted to 1
s and the FALSE
s to 0
s. Thus, the sum()
of these logicals will return the number of TRUE
s or, in this case, the number of elements that are not NA
; i.e., the number of valid observations.
The second issue of the summary function returning NA
if an NA
exists in the variable is addressed by including na.rm=TRUE
.] within the summary function. This argument serves to remove the NA
s from the calculations and will, thus, return the summary of all non-missing elements.
Thus, the following code provides a better summary of the count, mean, and standard deviation of the value
variable.
<- trouble2 %>%
tmp group_by(group1) %>%
summarize(n=n(),
n_valid=sum(!is.na(value)),
mn=mean(value,na.rm=TRUE),
sd=sd(value,na.rm=TRUE))
tmp
#R> # A tibble: 2 x 5
#R> group1 n n_valid mn sd
#R> <chr> <int> <int> <dbl> <dbl>
#R> 1 A 7 6 19 17.7
#R> 2 B 4 2 32 32.5
8.3 Wrangling by Group
Groupings can also be used with other dplyr
verbs. For example, consider this simple data frame called grades
that has hypothetical exam scores for students in two sections of a course.
#R> # A tibble: 11 x 3
#R> last section grade
#R> <chr> <dbl> <dbl>
#R> 1 Boshwitz 1 87.2
#R> 2 Lepal 1 56.9
#R> 3 Smith 1 74.4
#R> 4 Felix 1 92.5
#R> 5 Seidel 1 88.2
#R> 6 Phelps 2 71.2
#R> 7 McLaughlin 2 88.4
#R> 8 Robertson 2 56.5
#R> 9 Jak 2 78.3
#R> 10 Abel 2 67.6
#R> 11 Bonham 2 80.3
The code below uses rank()
and desc()
to create a new variable that is the rank of each student in the course based on their grade. The desc()
function is used here to assure that the student with the highest grade is given a rank of 1 (because rank()
ranks in ascending order by default).
<- grades %>%
tmp mutate(rnk=rank(desc(grade))) %>%
arrange(rnk)
tmp
#R> # A tibble: 11 x 4
#R> last section grade rnk
#R> <chr> <dbl> <dbl> <dbl>
#R> 1 Felix 1 92.5 1
#R> 2 McLaughlin 2 88.4 2
#R> 3 Seidel 1 88.2 3
#R> 4 Boshwitz 1 87.2 4
#R> 5 Bonham 2 80.3 5
#R> 6 Jak 2 78.3 6
#R> 7 Smith 1 74.4 7
#R> 8 Phelps 2 71.2 8
#R> 9 Abel 2 67.6 9
#R> 10 Lepal 1 56.9 10
#R> 11 Robertson 2 56.5 11
However, suppose that interest is in the rank WITHIN each section. Here group_by()
can be used prior to mutate()
so that the methods in mutate()
are applied separately to each group.38.]
%<>%
grades group_by(section) %>%
mutate(rnk=rank(desc(grade))) %>%
arrange(section,rnk)
grades
#R> # A tibble: 11 x 4
#R> # Groups: section [2]
#R> last section grade rnk
#R> <chr> <dbl> <dbl> <dbl>
#R> 1 Felix 1 92.5 1
#R> 2 Seidel 1 88.2 2
#R> 3 Boshwitz 1 87.2 3
#R> 4 Smith 1 74.4 4
#R> 5 Lepal 1 56.9 5
#R> 6 McLaughlin 2 88.4 1
#R> 7 Bonham 2 80.3 2
#R> 8 Jak 2 78.3 3
#R> 9 Phelps 2 71.2 4
#R> 10 Abel 2 67.6 5
#R> 11 Robertson 2 56.5 6
Note that in contrast to summarize()
the grouping is not removed from the data frame when mutate()
is used. Because the grouping variable is still intact, filter()
can be used to, for example, return the three students with the highest grades in EACH section.39
<- grades %>%
top3 filter(rnk<=3)
top3
#R> # A tibble: 6 x 4
#R> # Groups: section [2]
#R> last section grade rnk
#R> <chr> <dbl> <dbl> <dbl>
#R> 1 Felix 1 92.5 1
#R> 2 Seidel 1 88.2 2
#R> 3 Boshwitz 1 87.2 3
#R> 4 McLaughlin 2 88.4 1
#R> 5 Bonham 2 80.3 2
#R> 6 Jak 2 78.3 3
Once again, note that the grouping is not removed from the data frame when using filter()
. Thus, one could immediately calculate the mean grade for the highest three grades in each section.
%>% summarize(n=n(),
top3 mn=mean(grade))
#R> # A tibble: 2 x 3
#R> section n mn
#R> <dbl> <int> <dbl>
#R> 1 1 3 89.3
#R> 2 2 3 82.3
The use of mutate()
with group_by()
is less common but can be very powerful. As a simple example, suppose that one wanted to find the difference between each observation and the mean of its group. In the code below, mean()
within mutate()
when a grouping is declared will find the mean for each group. Because this is within a mutate()
rather than a summarize()
it is repeated for each observation in each group.
<- trouble2 %>%
tmp group_by(group1) %>%
mutate(mn=mean(value,na.rm=TRUE),
diff=value-mn)
tmp
#R> # A tibble: 11 x 5
#R> # Groups: group1 [2]
#R> group1 group2 value mn diff
#R> <chr> <chr> <dbl> <dbl> <dbl>
#R> 1 A z 10 19 -9
#R> 2 A z NA 19 NA
#R> 3 A y 10 19 -9
#R> 4 A y 12 19 -7
#R> 5 A y 13 19 -6
#R> 6 A y 14 19 -5
#R> 7 A y 55 19 36
#R> 8 B z NA 32 NA
#R> 9 B z 9 32 -23
#R> 10 B y NA 32 NA
#R> 11 B y 55 32 23
dplyr
verbs other than summarize()
will not remove a level of groupings.
8.4 Ungrouping
As a general rule-of-thumb it is best to remove the groupings from your data frame once you know you are done summarizing, filtering, etc. based on groups. There are two main reasons for this. First, as noted above, many dplyr
verbs work on groupings. Thus, if your data frame maintins groupings after you are done (in your mind) with groupings then you may get unintended results.
As a very simple example, suppose that you want to use slice()
to retain ONLY the first row of a data frame. However, if that data frame has groupings (e.g., after a first level of summarizing) then slice()
will return rows from each group. For example, suppose that you want only the first row of sum_trouble_2
created above (note below that it retained a grouping variable).
sum_trouble_2
#R> # A tibble: 4 x 5
#R> # Groups: group1 [2]
#R> group1 group2 n sum mn
#R> <chr> <chr> <int> <dbl> <dbl>
#R> 1 A y 5 104 20.8
#R> 2 A z 2 19 9.5
#R> 3 B y 2 66 33
#R> 4 B z 2 19 9.5
<- sum_trouble_2 %>%
tmp slice(1)
tmp
#R> # A tibble: 2 x 5
#R> # Groups: group1 [2]
#R> group1 group2 n sum mn
#R> <chr> <chr> <int> <dbl> <dbl>
#R> 1 A y 5 104 20.8
#R> 2 B y 2 66 33
As you can see, slice()
was applied to both groups of group1
such that the first row of each group was returned, which was not the intended outcome.
As another example, suppose that you wanted to change the names of the groups in group1
in sum_trouble_2
.
<- sum_trouble_2 %>%
tmp mutate(group1=plyr::mapvalues(group1,from=c("A","B"),to=c("Alex","Bart")))
#R> The following `from` values were not present in `x`: B
#R> The following `from` values were not present in `x`: A
tmp
#R> # A tibble: 4 x 5
#R> # Groups: group1 [2]
#R> group1 group2 n sum mn
#R> <chr> <chr> <int> <dbl> <dbl>
#R> 1 Alex y 5 104 20.8
#R> 2 Alex z 2 19 9.5
#R> 3 Bart y 2 66 33
#R> 4 Bart z 2 19 9.5
While this ultimately worked the messages shown in the output suggest an issue. Again the mutate()
is applied by groups and when working with group “A” there is no group “B” which leads to the first message (and the second message comes from the opposite problem when working with group “B”).
Both of these issues can be corrected by using ungroup()
to remove the groupings from the data frame.
<- sum_trouble_2 %>%
tmp ungroup() %>%
mutate(group1=plyr::mapvalues(group1,from=c("A","B"),to=c("Alex","Bart"))) %>%
slice(1)
tmp
#R> # A tibble: 1 x 5
#R> group1 group2 n sum mn
#R> <chr> <chr> <int> <dbl> <dbl>
#R> 1 Alex y 5 104 20.8
As a general rule-of-thumb, I suggest using ungroup()
at the end of a piping chain where you know you are done with the groupings. For example, instead of using ungroup()
as in the previous code, I would have created sum_trouble_2
as such.
<- trouble %>%
sum_trouble_2 group_by(group1,group2) %>%
summarize(n=n(),
sum=sum(value),
mn=mean(value)) %>%
ungroup()
sum_trouble_2
#R> # A tibble: 4 x 5
#R> group1 group2 n sum mn
#R> <chr> <chr> <int> <dbl> <dbl>
#R> 1 A y 5 104 20.8
#R> 2 A z 2 19 9.5
#R> 3 B y 2 66 33
#R> 4 B z 2 19 9.5
Notice how the tibble does not show any grouping structure.
To avoid unforeseen behavior, grouping variable(s) should be removed from the data frame with ungroup()
if you are done summarizing or wrangling by group.
8.5 Examples in Context
8.5.1 Student Data
In Section 4.4.1 a data frame called schedules2
was constructued that contained a student’s ID number with each course they were enrolled in along with the course’s credits and instructor.
schedules2
#R> # A tibble: 20 x 4
#R> studentID course credits instructor
#R> <dbl> <chr> <dbl> <chr>
#R> 1 34535 MTH107 4 Ogle
#R> 2 34535 BIO115 4 Johnson
#R> 3 34535 CHM110 4 Carlson
#R> 4 34535 IDS101 3 Goyke
#R> 5 45423 SCD110 3 Tochterman
#R> 6 45423 PSY110 4 Sneyd
#R> 7 45423 MTH140 4 Jensen
#R> 8 45423 OED212 3 Andre
#R> 9 45423 IDS101 3 Goyke
#R> 10 73424 BIO234 4 Goyke
#R> 11 73424 CHM220 4 Robertson
#R> 12 73424 BIO370 4 Anich
#R> 13 73424 SCD110 3 Tochterman
#R> 14 89874 SCD440 4 Foster
#R> 15 89874 PSY370 4 Sneyd
#R> 16 89874 IDS490 4 Hannickel
#R> 17 98222 SCD440 4 Foster
#R> 18 98222 SCD330 3 Tochterman
#R> 19 98222 SOC480 4 Schanning
#R> 20 98222 ART220 3 Duffy
Additionally, recall that there was a data frame called personal
that contained personal information about each student (along with the ID).
personal
#R> # A tibble: 5 x 5
#R> studentID first_nm last_nm hometown homestate
#R> <dbl> <chr> <chr> <chr> <chr>
#R> 1 34535 Rolando Blackman Windsor MI
#R> 2 45423 Catherine Johnson Eden Prairie MN
#R> 3 73424 James Carmichael Marion IA
#R> 4 89874 Rachel Brown Milwaukee WI
#R> 5 98222 Esteban Perez El Paso TX
In this example, suppose that the registrar wants to create a report that has the number of courses and the total number of credits taken appended to the personal information for each student. Construction of this report begins by summarizing schedules2
for each student.
<- schedules2 %>%
sum_crs group_by(studentID) %>%
summarize(num_courses=n(),
num_credits=sum(credits))
sum_crs
#R> # A tibble: 5 x 3
#R> studentID num_courses num_credits
#R> <dbl> <int> <dbl>
#R> 1 34535 4 15
#R> 2 45423 5 17
#R> 3 73424 4 15
#R> 4 89874 3 12
#R> 5 98222 4 14
These results can then be left_join()
ed with personal
to create the desired database.
<- personal %>%
personal2 left_join(sum_crs,by="studentID")
personal2
#R> # A tibble: 5 x 7
#R> studentID first_nm last_nm hometown homestate num_courses num_credits
#R> <dbl> <chr> <chr> <chr> <chr> <int> <dbl>
#R> 1 34535 Rolando Blackman Windsor MI 4 15
#R> 2 45423 Catherine Johnson Eden Prairie MN 5 17
#R> 3 73424 James Carmichael Marion IA 4 15
#R> 4 89874 Rachel Brown Milwaukee WI 3 12
#R> 5 98222 Esteban Perez El Paso TX 4 14
8.5.2 Resource Sampling Data
In Section 4.4.2 a data frame called fishcatch
was created that had the species and number of that species caught in each of five nets. The date and lake where the net was set was also recorded.
fishcatch
#R> net_num lake date species number
#R> 1 1 Eagle 3-Jul-21 Bluegill 7
#R> 2 1 Eagle 3-Jul-21 Largemouth Bass 3
#R> 3 2 Hart 3-Jul-21 Bluegill 19
#R> 4 2 Hart 3-Jul-21 Largemouth Bass 2
#R> 5 2 Hart 3-Jul-21 Bluntnose Minnow 56
#R> 6 3 Hart 5-Jul-21 <NA> NA
#R> 7 4 Eagle 6-Jul-21 Bluegill 3
#R> 8 4 Eagle 6-Jul-21 Largemouth Bass 6
#R> 9 5 Millicent 6-Jul-21 Largemouth Bass 3
Suppose a technician wants to summarize the number of species caught and the total catch (regardless of species) in each net. An examination of the data frame above reveals NA
s for species
and number
for one of the nets that did not catch any fish. Because of this we cannot simply count the number of rows for each net_num
to get the number of species. Instead this calculation will have to be treated as described for finding the valid number of observations. The total catch in each net_num
can be found with sum()
but it must include na.rm=TRUE
to account for the missing data.
<- fishcatch %>%
tmp group_by(net_num) %>%
summarize(num_spec=sum(!is.na(species)),
ttl_catch=sum(number,na.rm=TRUE))
tmp
#R> # A tibble: 5 x 3
#R> net_num num_spec ttl_catch
#R> <dbl> <int> <dbl>
#R> 1 1 2 10
#R> 2 2 3 77
#R> 3 3 0 0
#R> 4 4 2 9
#R> 5 5 1 3
The resulting data frame is missing the specific information (date and lake) for each net_num
. A trick for including information that is specific (and thus repeated) to the grouping variable is to include those variables as grouping variables prior to the main grouping variable. For example, there is only one net_num
per lake
and date
combination so including lake
and date
prior to net_num
will not alter the results but will retain the lake
and date
values. If you use this trick, make sure to ungroup()
after the summarization so there are no unintended consequences of adding the extra grouping variables.
<- fishcatch %>%
tmp group_by(lake,date,net_num) %>%
summarize(num_spec=sum(!is.na(species)),
ttl_catch=sum(number,na.rm=TRUE)) %>%
ungroup() %>%
arrange(net_num) %>%
relocate(net_num)
tmp
#R> # A tibble: 5 x 5
#R> net_num lake date num_spec ttl_catch
#R> <dbl> <chr> <chr> <int> <dbl>
#R> 1 1 Eagle 3-Jul-21 2 10
#R> 2 2 Hart 3-Jul-21 3 77
#R> 3 3 Hart 5-Jul-21 0 0
#R> 4 4 Eagle 6-Jul-21 2 9
#R> 5 5 Millicent 6-Jul-21 1 3
8.5.3 Wolves and Moose of Isle Royale
In Section 6.7.2 a data frame called irmw2
was created that contained the number of wolves and moose, the winter air temperature, and whether or not an ice bridge to the mainland formed for each year from 1959-2012. In that module, an era
variable was also created that categorized the years into “early,” “middle,” and “late” time periods.
Suppose that the researchers want to compute summary statistics for the number of moose separated by era, and by era and whether an ice bridge formed. The latter is accomplished below.
<- irmw2 %>%
tmp group_by(era,ice_bridges) %>%
summarize(n=n(),
n_valid=sum(!is.na(moose)),
mean=mean(moose,na.rm=TRUE),
sd=sd(moose,na.rm=TRUE),
min=min(moose,na.rm=TRUE),
Q1=quantile(moose,0.25,na.rm=TRUE),
median=median(moose,na.rm=TRUE),
Q3=quantile(moose,0.75,na.rm=TRUE),
max=max(moose,na.rm=TRUE)) %>%
ungroup()
tmp
#R> # A tibble: 6 x 11
#R> era ice_bridges n n_valid mean sd min Q1 median Q3 max
#R> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#R> 1 early no 4 4 734. 322. 538. 557. 592. 769. 1215.
#R> 2 early yes 12 12 864. 264. 572. 624. 807. 1079. 1243.
#R> 3 middle no 17 17 1150. 381. 767. 925. 1031. 1260. 2117.
#R> 4 middle yes 9 9 1277. 575. 780. 900. 976. 1496. 2398.
#R> 5 recent no 14 14 816. 364. 385 519. 750 1069. 1600
#R> 6 recent yes 5 5 1297 523. 650 1050 1250 1475 2060
Here I ungroup()
ed the data frame because I want to make sure that I am not tempted to summarize the returned data frame that would have still had groupings by era
. As mentioned in the main text it is inappropriate to compute most summaries on a second level of groupings after summarizing by the first level of groupings.
Thus, the first goal of the researchers is then accomplished below.
<- irmw2 %>%
tmp group_by(era) %>%
summarize(n=n(),
n_valid=sum(!is.na(moose)),
mean=mean(moose,na.rm=TRUE),
sd=sd(moose,na.rm=TRUE),
min=min(moose,na.rm=TRUE),
Q1=quantile(moose,0.25,na.rm=TRUE),
median=median(moose,na.rm=TRUE),
Q3=quantile(moose,0.75,na.rm=TRUE),
max=max(moose,na.rm=TRUE)) %>%
ungroup()
tmp
#R> # A tibble: 3 x 10
#R> era n n_valid mean sd min Q1 median Q3 max
#R> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#R> 1 early 16 16 832. 274. 538. 592. 713. 1079. 1243.
#R> 2 middle 26 26 1194. 450. 767. 906. 1023. 1301. 2398.
#R> 3 recent 19 19 943. 452. 385 535 900 1185. 2060
There are no arguments to
n()
.↩︎Note here the use of the summarized but still grouped data frame and that the computation of numbers of months and cases had to be adjusted for the new variables in the summarized data frame.↩︎
In the two-level summarize the mean of the “A” group is calculated as \(\frac{20.8+19.5}{1+1}\) rather than \(\frac{104+19}{5+2}\).↩︎
This same ordering also could have been accomplished without creating the ranks and just using
arrange(section,desc(grade))
.↩︎This could also have been accopmlished with
grades %>% slice_head(n=3)
.↩︎