How to make a data frame with only some individuals from a larger data frame.
There are many examples during this course where a subset of a data frame will be required for an exercise. Creating a subset from a larger data frame is called “filtering” and uses filter()
from the dplyr
package, which is loaded automatically with NCStats
.
Example data used here is from Mirex.csv [data, meta], which is loaded below.
#R> 'data.frame': 122 obs. of 4 variables:
#R> $ year : int 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
#R> $ weight : num 0.41 0.45 1.04 1.09 1.24 1.25 1.3 1.34 1.37 1.49 ...
#R> $ mirex : num 0.16 0.19 0.19 0.1 0.13 0.19 0.28 0.16 0.17 0.2 ...
#R> $ species: chr "chinook" "chinook" "chinook" "coho" ...
The “levels” of species
and year
are shown below (to aid understanding their use below).
unique(Mirex$species)
#R> [1] "chinook" "coho"
unique(Mirex$year)
#R> [1] 1977 1982 1986 1992 1996 1999
filter()
FunctionThe filter()
function requires two arguments. The first argument is the data frame from which the subset should be drawn. The second argument is a condition statement that describes how the subset should be drawn from the original data frame. For example, the condition statement will be R code that says to select all cohos from the species variable, all fish that had a weight greater than 5 kg, or all years between 1982 and 1992. The condition statement will usually start with the name of a variable, followed by an “operator,” and then “values” of the variable (Table 1). Specific types of conditions statement are explained in the following sections.
Comparison Operator | Rows Returned from Original Data Frame |
---|---|
var==value
|
All rows where var IS equal to value
|
var!=value
|
All rows where var is NOT equal to value
|
var %in% c(value1,value2)
|
All rows where var IS IN vector of value s1
|
var >value
|
All rows where var is greater than value 2
|
var >=value
|
All rows where var is greater than or equal to value 3
|
var <value
|
All rows where var is less than value 4
|
var <=value
|
All rows where var is less than or equal to value 5
|
condition1,condition2 | All rows where BOTH conditions are true |
condition1 | condition2 | All rows where ONE or BOTH conditions are true6 |
One group can be isolated from the original data frame using the ==
operator. For example, the code below select all coho
s from the species
variable.
#R> [1] "coho"
Two groups can be isolated from the original data using the %in%
operator. For example, the code below selects all individuals that were in either 1986 or 1996.
yr86n96 <- filter(Mirex,year %in% c(1986,1996))
unique(yr86n96$year) ## CHECK - should be just 1986 and 1996
#R> [1] 1986 1996
A group can be eliminated (thus, isolating all of the other groups) with the !=
operator. For example, the code below eliminates all individuals that were not coho
s (leaving just chinook
in this example).
#R> [1] "chinook"
Individuals may be selected based on the relative value of a quantitative variable using obvious operators. For example, the code below selects all fish that weighed more than 5 kg.
#R> [1] 5.09
Similarly the code below selects all fish with a mirex concentration less than or equal to 0.2.
#R> [1] 0.2
Individuals that meet both of two conditions (i.e., must have this and that) are selected by separating the two conditions with a comma in filter()
. For example, the code below selects coho
s with a weight
less than 1 kg.
smallcoho <- filter(Mirex,species=="coho",weight<1)
smallcoho ## CHECK -- should be all coho and all weights less than 1
#R> year weight mirex species
#R> 1 1982 0.46 0.10 coho
#R> 2 1982 0.63 0.09 coho
#R> 3 1982 0.70 0.10 coho
#R> 4 1986 0.34 0.02 coho
#R> 5 1986 0.34 0.12 coho
#R> 6 1986 0.68 0.03 coho
#R> 7 1996 0.56 0.06 coho
#R> 8 1996 0.80 0.07 coho
#R> 9 1996 0.90 0.07 coho
Individuals that meet one or both of two conditions (i.e., can have this or that) are selected by separating the two conditions with a |
in filter()
. For example the code below selects fish from 1992 or that had a weight greater than or equal to 13 kg.
weird <- filter(Mirex,year==1992 | weight>=13)
weird
#R> year weight mirex species
#R> 1 1992 1.9 0.10 coho
#R> 2 1992 2.0 0.09 coho
#R> 3 1992 2.4 0.12 chinook
#R> 4 1992 2.6 0.15 coho
#R> 5 1992 7.5 0.13 chinook
#R> 6 1992 7.9 0.18 coho
#R> 7 1992 8.6 0.34 coho
#R> 8 1992 9.1 0.27 chinook
#R> 9 1992 10.0 0.48 chinook
#R> 10 1992 10.3 0.25 chinook
#R> 11 1992 10.8 0.45 chinook
#R> 12 1992 12.3 0.28 chinook
#R> 13 1996 14.0 0.21 chinook