Much of my work is with undergraduates who are first learning to analyze fisheries data. A common “learning opportunity” occurs when students are asked to compute the mean catch (or CPE), along with a standard deviation (SD), across multiple gear sets for each species. The learning opportunity occurs because some species will invariably not be caught in some gear sets. When the students summarize the number of fish caught for each species in each gear set those species not caught in a particular gear set will not “appear” in their data. Thus, when calculating the mean, the student will get the correct numerator (sum of catch across all gear sets) but not denominator (they use number of catches summed rather than total number of gear sets), which inflates (over-estimates) the mean catch and (usually) deflates (under-estimates) the SD of catches. Once confronted with this issue, they easily realize how to correct the mean calculation, but calculating the standard deviation is still an issue. These problems are exacerbated when using software to compute these summary statistics across many individual gear sets.
In software, the “trick” is to add a zero for each species not caught in a specific gear set that was caught in at least one of all of the gear sets. For example, if Bluegill were caught in at least one gear set but not in the third gear set, then a zero must be added as the catch of Bluegill in the third gear set. The
addZeroCatch() function in the
FSA package was an attempt to efficiently add these zeroes. This function has proven useful over the years, but I have become dissatisfied with its clunkiness. Additionally, I recently became aware of the
complete() function in the
tidyr package which holds promise for handling the same task. In this post, I explore the use of
complete() for handling this issue.
This post requires the
tidyr packages. It also uses
FSA behind the scenes.
Example 1 - Very Simple Data
In this first example, the data consists of
length recorded for each captured fish organized by the gear set identification number (
ID) and held in the
The catch of each species in each gear set may be found using
n(). Note that
as.data.frame() is used simply to remove the
tibble structure returned by
From this it is seen that three species (“BLG”, “LMB”, and “YEP”) were captured in all nets, but that “BLG” were not captured in “ID=2”, “LMB” were not captured in “ID=3”, and “LMB” and “YEP” were not captured in “ID=5”. The sample size, mean, and SD of catches per species from these data may be found by again using
summarize(). However, note that these calculations are INCORRECT because they do not include the zero catches of “BLG” in “ID=2”, “LMB” in “ID=3”, and “LMB” and “YEP” in “ID=5”. The problem is most evident in the sample sizes, which should be five (gear sets) for each species.
complete() function can be used to add rows to a data.frame for variables (or combinations of variables) that should be present in the data.frame (relative to other values that are present) but are not. The
complete() function takes a data.frame as its first argument (but will be “piped” in below with
%>%) and the variable or variables that will be used to identify which items are missing. For example, with these data, a zero should be added to
num for missing combinations defined by
From this result, it is seen that
complete() added a row for “BLG” in “ID=2”, “LMB” in “ID=3”, and “LMB” and “YEP” in “ID=5”, as we had hoped. However,
NAs by default. The value to add can be changed with
fill=, which takes a list that includes the name of the variable to which the
NAs were added (
num in this case) set equal to the value to be added (
0 in this case).
These correct catch data can then be summarized as above to show the correct sample size, mean, and SD of catches per species.
Example 2 - Multiple Values to Receive Zeroes
Suppose that the fish data included a column that indicates whether the fish was marked and returned to the waterbody or not.
The catch and number of fish marked and returned per gear set ID and species may again be computed with
summarize(). Note, however, the use of
ifelse() to use a
1 if the fish was marked and a
0 if it was not. Summing these values returns the number of fish that were marked. Giving this data.frame to
complete() as before will add zeroes for both the
nmarked variables as long as both are included in the list given to
Example 3 - More Information that Does Not Get Zeroes
Suppose that a data.frame called
geardat contains information specific to each gear set.
And, for the purposes of this example, let’s suppose that we have summarized catch data WITHOUT the zeroes having been added.
Finally, suppose that these summarized catch data are joined with the gear data such that the gear set specific information is shown with each catch.
These data simulate what might be seen from a flat database.
With these data, zeroes still need to be added as defined by missing combinations of
species. However, if only these two variables are included in
NAs, zeroes, or something else (if we define that) will be added
effort, which is not desired. These five variables are “nested” with the
ID variable (i.e., if you know
ID then you know these variables) and should be treated as a group. Nesting of variables can be handled in
complete() by including these variables within
It is possible to have nesting with
species as well. Suppose, for example, that the scientific name for the species was included in the original
fishdata2 that was summarized (using a combination of the examples from above, but not shown here) to
The zeroes are then added to this data.frame making sure to note the nesting of
This is my first explorations with
complete() and it looks promising for this task of adding zeroes to data frames of catch by gear set for gear sets in which a species was not caught. I will be curious to hear what others think of this function and how it might fit in their workflow.
I find the tibble structure to be annoying with simple data.frames like this. Thus, I usually use
as.data.frame()to remove it. ↩