Filter or Subset Data in R

How to make a data frame with only some individuals from a larger data frame.

There are many examples during this course where a subset of a data frame will be required for an exercise. Creating a subset from a larger data frame is called “filtering” and uses filter() from the dplyr package, which is loaded automatically with NCStats.

Example data used here is from Mirex.csv [data, meta], which is loaded below.

Mirex <- read.csv("https://raw.githubusercontent.com/droglenc/NCData/master/Mirex.csv")
str(Mirex)
#R>   'data.frame': 122 obs. of  4 variables:
#R>    $ year   : int  1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
#R>    $ weight : num  0.41 0.45 1.04 1.09 1.24 1.25 1.3 1.34 1.37 1.49 ...
#R>    $ mirex  : num  0.16 0.19 0.19 0.1 0.13 0.19 0.28 0.16 0.17 0.2 ...
#R>    $ species: chr  "chinook" "chinook" "chinook" "coho" ...

The “levels” of species and year are shown below (to aid understanding their use below).

unique(Mirex$species)
#R>   [1] "chinook" "coho"
unique(Mirex$year)
#R>   [1] 1977 1982 1986 1992 1996 1999

 

The filter() Function

The filter() function requires two arguments. The first argument is the data frame from which the subset should be drawn. The second argument is a condition statement that describes how the subset should be drawn from the original data frame. For example, the condition statement will be R code that says to select all cohos from the species variable, all fish that had a weight greater than 5 kg, or all years between 1982 and 1992. The condition statement will usually start with the name of a variable, followed by an “operator,” and then “values” of the variable (Table 1). Specific types of conditions statement are explained in the following sections.

Table 1: Comparison operators used in filter() and their results. Note that var generically represents a variable in the original data frame and value is a generic value or level. Both var and val would be replaced with specific items (see examples in following sections).
Comparison Operator Rows Returned from Original Data Frame
var==value All rows where var IS equal to value
var!=value All rows where var is NOT equal to value
var %in% c(value1,value2) All rows where var IS IN vector of values1
var>value All rows where var is greater than value2
var>=value All rows where var is greater than or equal to value3
var<value All rows where var is less than value4
var<=value All rows where var is less than or equal to value5
condition1,condition2 All rows where BOTH conditions are true
condition1 | condition2 All rows where ONE or BOTH conditions are true6

 

Isolating One Group

One group can be isolated from the original data frame using the == operator. For example, the code below select all cohos from the species variable.

cohos <- filter(Mirex,species=="coho")
unique(cohos$species)  ## CHECK - should be just coho
#R>   [1] "coho"

 

Isolating Two Groups

Two groups can be isolated from the original data using the %in% operator. For example, the code below selects all individuals that were in either 1986 or 1996.

yr86n96 <- filter(Mirex,year %in% c(1986,1996))
unique(yr86n96$year)  ## CHECK - should be just 1986 and 1996
#R>   [1] 1986 1996

 

Eliminating One Group

A group can be eliminated (thus, isolating all of the other groups) with the != operator. For example, the code below eliminates all individuals that were not cohos (leaving just chinook in this example).

notcoho <- filter(Mirex,species!="coho")
unique(notcoho$species)  ## CHECK - should not include coho
#R>   [1] "chinook"

 

Isolating Relative to a Quantitative Variable

Individuals may be selected based on the relative value of a quantitative variable using obvious operators. For example, the code below selects all fish that weighed more than 5 kg.

heavier5kg <- filter(Mirex,weight>5)
min(heavier5kg$weight)  ## CHECK - should be greater than 5
#R>   [1] 5.09

Similarly the code below selects all fish with a mirex concentration less than or equal to 0.2.

mirexlt2 <- filter(Mirex,mirex<=0.2)
max(mirexlt2$mirex)  ## CHECK - should not be greater than 0.2
#R>   [1] 0.2

 

Two Conditions

Must Have Both

Individuals that meet both of two conditions (i.e., must have this and that) are selected by separating the two conditions with a comma in filter(). For example, the code below selects cohos with a weight less than 1 kg.

smallcoho <- filter(Mirex,species=="coho",weight<1)
smallcoho  ## CHECK -- should be all coho and all weights less than 1
#R>     year weight mirex species
#R>   1 1982   0.46  0.10    coho
#R>   2 1982   0.63  0.09    coho
#R>   3 1982   0.70  0.10    coho
#R>   4 1986   0.34  0.02    coho
#R>   5 1986   0.34  0.12    coho
#R>   6 1986   0.68  0.03    coho
#R>   7 1996   0.56  0.06    coho
#R>   8 1996   0.80  0.07    coho
#R>   9 1996   0.90  0.07    coho

Can Have Either

Individuals that meet one or both of two conditions (i.e., can have this or that) are selected by separating the two conditions with a | in filter(). For example the code below selects fish from 1992 or that had a weight greater than or equal to 13 kg.

weird <- filter(Mirex,year==1992 | weight>=13)
weird
#R>      year weight mirex species
#R>   1  1992    1.9  0.10    coho
#R>   2  1992    2.0  0.09    coho
#R>   3  1992    2.4  0.12 chinook
#R>   4  1992    2.6  0.15    coho
#R>   5  1992    7.5  0.13 chinook
#R>   6  1992    7.9  0.18    coho
#R>   7  1992    8.6  0.34    coho
#R>   8  1992    9.1  0.27 chinook
#R>   9  1992   10.0  0.48 chinook
#R>   10 1992   10.3  0.25 chinook
#R>   11 1992   10.8  0.45 chinook
#R>   12 1992   12.3  0.28 chinook
#R>   13 1996   14.0  0.21 chinook

  1. value should be a character, factor, or integer.↩︎

  2. value must be numeric.↩︎

  3. value must be numeric.↩︎

  4. value must be numeric.↩︎

  5. value must be numeric.↩︎

  6. Note that this “or” operator is a “vertical line”" which is typed with the shift-backslash key.↩︎