Module 24 Filtering Data in R
In Module 23, you learned how to retrieve data from the class webpage, enter your own data into a CSV file, load that data into R, and how to view that data in R. In this module, we will learn how to create subsets (i.e., filter) of a data frame into smaller data frames. For example, you may want to create a data frame that contains just male bears from a data frame with both male and female bears, or a data frame that contains only sales during summer months from a data frame that contains all sales. Less often you may wish to eliminate a particular individual from the data frame, perhaps if it is considered to be erroneous.
This module also uses the bears
data frame from Module 23, which is shown below for reference.
library(NCStats)
<- read.csv("Bears.csv")
bears bears
#R> length.cm weight.kg loc
#R> 1 139.0 110 Bayfield
#R> 2 138.0 60 Bayfield
#R> 3 139.0 90 Bayfield
#R> 4 120.5 60 Bayfield
#R> 5 149.0 85 Bayfield
#R> 6 141.0 100 Ashland
#R> 7 141.0 95 Ashland
#R> 8 150.0 85 Douglas
#R> 9 166.0 155 Douglas
#R> 10 151.5 140 Douglas
#R> 11 129.5 105 Douglas
#R> 12 150.0 110 Douglas
24.1 Filtering a data frame
It is common to create a new data frame that contains only some of the individuals from an existing data frame. The process of creating the newer, smaller data frame is called filtering and is accomplished with filterD()
.80 The filterD()
function requires the original data frame as the first argument and a condition statement as the second argument. The condition statement is used to either include or exclude individuals from the original data frame. Condition statements consist of the name of a variable in the original data frame, a comparison operator, and a comparison value (Table 24.1). The results from filterD()
should be assigned to an object, which is then the name of the new data frame.
Comparison Operator | Rows Returned from Original Data Frame |
---|---|
var==value
|
All rows where var IS equal to value
|
var!=value
|
All rows where var is NOT equal to value
|
var %in% c(value1,value2)
|
All rows where var IS IN (or one of the) vector of value s81
|
var >value
|
All rows where var is greater than value 82
|
var >=value
|
All rows where var is greater than or equal to value 83
|
var <value
|
All rows where var is less than value 84
|
var <=value
|
All rows where var is less than or equal to value 85
|
condition1,condition2 | All rows where BOTH conditions are true |
condition1 | condition2 | All rows where ONE or BOTH conditions are true86 |
The following are examples of new data frames created from bears
. The name of the new data frame (i.e., object left of the assignment operator) can be any valid object name. As demonstrated below, the new data frame should be examined after each filtering to ensure that the data frame actually contains the items that you desire.87
- Only individuals from Bayfield county.
<- filterD(bears,loc=="Bayfield")
bf bf
#R> length.cm weight.kg loc
#R> 1 139.0 110 Bayfield
#R> 2 138.0 60 Bayfield
#R> 3 139.0 90 Bayfield
#R> 4 120.5 60 Bayfield
#R> 5 149.0 85 Bayfield
- Individuals from both Bayfield and Ashland counties.
<- filterD(bears,loc %in% c("Bayfield","Ashland"))
bfash bfash
#R> length.cm weight.kg loc
#R> 1 139.0 110 Bayfield
#R> 2 138.0 60 Bayfield
#R> 3 139.0 90 Bayfield
#R> 4 120.5 60 Bayfield
#R> 5 149.0 85 Bayfield
#R> 6 141.0 100 Ashland
#R> 7 141.0 95 Ashland
- Individuals NOT from Bayfield county.
<- filterD(bears,loc != "Bayfield")
bfnotbay bfnotbay
#R> length.cm weight.kg loc
#R> 1 141.0 100 Ashland
#R> 2 141.0 95 Ashland
#R> 3 150.0 85 Douglas
#R> 4 166.0 155 Douglas
#R> 5 151.5 140 Douglas
#R> 6 129.5 105 Douglas
#R> 7 150.0 110 Douglas
- Individuals with a weight greater than 100 kg.
<- filterD(bears,weight.kg>100)
gt100 gt100
#R> length.cm weight.kg loc
#R> 1 139.0 110 Bayfield
#R> 2 166.0 155 Douglas
#R> 3 151.5 140 Douglas
#R> 4 129.5 105 Douglas
#R> 5 150.0 110 Douglas
- Individuals from Douglas County that weighed at least 150 kg.
<- filterD(bears,loc=="Douglas",weight.kg>=150)
do150 do150
#R> length.cm weight.kg loc
#R> 1 166 155 Douglas
Examine the new data frame after filtering to ensure that it contains the data you intended.
24.2 Selecting Entire Variables
As noted in Section 23.2.1, whole variables can be selected from a data frame with the $
notation. Recall that the $
separates the name of the data frame from the name of the variable within that data frame.
- The weights of all bears (i.e., in the
bears
data frame).
$weight.kg bears
#R> [1] 110 60 90 60 85 100 95 85 155 140 105 110
- The weights of all bears in Ashland and Bayfield counties (i.e., in
bfash
from above).
$weight.kg bfash
#R> [1] 110 60 90 60 85 100 95
- The location of all bears in just Bayfield county (i.e., in
bf
from above). [You might use something like this to check your filtering.]
$loc bf
#R> [1] "Bayfield" "Bayfield" "Bayfield" "Bayfield" "Bayfield"
24.3 Selecting Individuals
In some instances, you may need to select or exclude an individual from a data frame. Positions within an object are identified within square brackets. As data frames are two-dimensional objects they are indexed by both a row and a column, in that order. For example, the item in the third row and second column of bears
is selected below.
3,2] bears[
#R> [1] 90
An entire row or column may be selected by omitting the other dimension. For example, one could select the entire second column with bears[,2]
.88 As a better example, the entire third row is selected below (note how the column designation was omitted).
3,] bears[
#R> length.cm weight.kg loc
#R> 3 139 90 Bayfield
Multiple rows are selected by combining row indices together with c()
. For example, the third, fifth, and eighth rows are selected below (again, the column index is omitted).
c(3,5,8),] bears[
#R> length.cm weight.kg loc
#R> 3 139 90 Bayfield
#R> 5 149 85 Bayfield
#R> 8 150 85 Douglas
Finally, rows can be excluded by preceding the row indices with a negative sign.
-c(3,5,8,10,12),] bears[
#R> length.cm weight.kg loc
#R> 1 139.0 110 Bayfield
#R> 2 138.0 60 Bayfield
#R> 4 120.5 60 Bayfield
#R> 6 141.0 100 Ashland
#R> 7 141.0 95 Ashland
#R> 9 166.0 155 Douglas
#R> 11 129.5 105 Douglas
filterD()
requires theNCStats
package.↩︎value
should be a character, factor, or integer.↩︎value
must be numeric.↩︎value
must be numeric.↩︎value
must be numeric.↩︎value
must be numeric.↩︎Note that this “or” operator is a “vertical line”” which is typed with the shift-backslash key.↩︎
For larger data.frames you should check the structure (using
str()
) or theheadtail()
of the new data frame.↩︎But this is also the
weight.kg
variable and is better selected withbears$weight.kg
.↩︎