# Module 24 Filtering Data in R

In Module 23, you learned how to retrieve data from the class webpage, enter your own data into a CSV file, load that data into R, and how to view that data in R. In this module, we will learn how to create subsets (i.e., filter) of a data frame into smaller data frames. For example, you may want to create a data frame that contains just male bears from a data frame with both male and female bears, or a data frame that contains only sales during summer months from a data frame that contains all sales. Less often you may wish to eliminate a particular individual from the data frame, perhaps if it is considered to be erroneous.

This module also uses the bears data frame from Module 23, which is shown below for reference.

library(NCStats)
bears
#R>     length.cm weight.kg      loc
#R>  1      139.0       110 Bayfield
#R>  2      138.0        60 Bayfield
#R>  3      139.0        90 Bayfield
#R>  4      120.5        60 Bayfield
#R>  5      149.0        85 Bayfield
#R>  6      141.0       100  Ashland
#R>  7      141.0        95  Ashland
#R>  8      150.0        85  Douglas
#R>  9      166.0       155  Douglas
#R>  10     151.5       140  Douglas
#R>  11     129.5       105  Douglas
#R>  12     150.0       110  Douglas

## 24.1 Filtering a data frame

It is common to create a new data frame that contains only some of the individuals from an existing data frame. The process of creating the newer, smaller data frame is called filtering and is accomplished with filterD().80 The filterD() function requires the original data frame as the first argument and a condition statement as the second argument. The condition statement is used to either include or exclude individuals from the original data frame. Condition statements consist of the name of a variable in the original data frame, a comparison operator, and a comparison value (Table 24.1). The results from filterD() should be assigned to an object, which is then the name of the new data frame.

Table 24.1: Comparison operators used in filterD() and their results. Note that var generically represents a variable in the original data frame and value is a generic value or level. Both var and val would be replaced with specific items (see examples in main text).
Comparison Operator Rows Returned from Original Data Frame
var==value All rows where var IS equal to value
var!=value All rows where var is NOT equal to value
var %in% c(value1,value2) All rows where var IS IN (or one of the) vector of values81
var>value All rows where var is greater than value82
var>=value All rows where var is greater than or equal to value83
var<value All rows where var is less than value84
var<=value All rows where var is less than or equal to value85
condition1,condition2 All rows where BOTH conditions are true
condition1 | condition2 All rows where ONE or BOTH conditions are true86

The following are examples of new data frames created from bears. The name of the new data frame (i.e., object left of the assignment operator) can be any valid object name. As demonstrated below, the new data frame should be examined after each filtering to ensure that the data frame actually contains the items that you desire.87

• Only individuals from Bayfield county.
bf <- filterD(bears,loc=="Bayfield")
bf
#R>    length.cm weight.kg      loc
#R>  1     139.0       110 Bayfield
#R>  2     138.0        60 Bayfield
#R>  3     139.0        90 Bayfield
#R>  4     120.5        60 Bayfield
#R>  5     149.0        85 Bayfield
• Individuals from both Bayfield and Ashland counties.
bfash <- filterD(bears,loc %in% c("Bayfield","Ashland"))
bfash
#R>    length.cm weight.kg      loc
#R>  1     139.0       110 Bayfield
#R>  2     138.0        60 Bayfield
#R>  3     139.0        90 Bayfield
#R>  4     120.5        60 Bayfield
#R>  5     149.0        85 Bayfield
#R>  6     141.0       100  Ashland
#R>  7     141.0        95  Ashland
• Individuals NOT from Bayfield county.
bfnotbay <- filterD(bears,loc != "Bayfield")
bfnotbay
#R>    length.cm weight.kg     loc
#R>  1     141.0       100 Ashland
#R>  2     141.0        95 Ashland
#R>  3     150.0        85 Douglas
#R>  4     166.0       155 Douglas
#R>  5     151.5       140 Douglas
#R>  6     129.5       105 Douglas
#R>  7     150.0       110 Douglas
• Individuals with a weight greater than 100 kg.
gt100 <- filterD(bears,weight.kg>100)
gt100
#R>    length.cm weight.kg      loc
#R>  1     139.0       110 Bayfield
#R>  2     166.0       155  Douglas
#R>  3     151.5       140  Douglas
#R>  4     129.5       105  Douglas
#R>  5     150.0       110  Douglas
• Individuals from Douglas County that weighed at least 150 kg.
do150 <- filterD(bears,loc=="Douglas",weight.kg>=150)
do150
#R>    length.cm weight.kg     loc
#R>  1       166       155 Douglas

Examine the new data frame after filtering to ensure that it contains the data you intended.

## 24.2 Selecting Entire Variables

As noted in Section 23.2.1, whole variables can be selected from a data frame with the $ notation. Recall that the $ separates the name of the data frame from the name of the variable within that data frame.

• The weights of all bears (i.e., in the bears data frame).
bears$weight.kg #R> [1] 110 60 90 60 85 100 95 85 155 140 105 110 • The weights of all bears in Ashland and Bayfield counties (i.e., in bfash from above). bfash$weight.kg
#R>  [1] 110  60  90  60  85 100  95
• The location of all bears in just Bayfield county (i.e., in bf from above). [You might use something like this to check your filtering.]