Module 24 Filtering Data in R

In Module 23, you learned how to retrieve data from the class webpage, enter your own data into a CSV file, load that data into R, and how to view that data in R. In this module, we will learn how to create subsets (i.e., filter) of a data frame into smaller data frames. For example, you may want to create a data frame that contains just male bears from a data frame with both male and female bears, or a data frame that contains only sales during summer months from a data frame that contains all sales. Less often you may wish to eliminate a particular individual from the data frame, perhaps if it is considered to be erroneous.

This module also uses the bears data frame from Module 23, which is shown below for reference.

library(NCStats)
bears <- read.csv("Bears.csv")
bears
#R>     length.cm weight.kg      loc
#R>  1      139.0       110 Bayfield
#R>  2      138.0        60 Bayfield
#R>  3      139.0        90 Bayfield
#R>  4      120.5        60 Bayfield
#R>  5      149.0        85 Bayfield
#R>  6      141.0       100  Ashland
#R>  7      141.0        95  Ashland
#R>  8      150.0        85  Douglas
#R>  9      166.0       155  Douglas
#R>  10     151.5       140  Douglas
#R>  11     129.5       105  Douglas
#R>  12     150.0       110  Douglas

 

24.1 Filtering a data frame

It is common to create a new data frame that contains only some of the individuals from an existing data frame. The process of creating the newer, smaller data frame is called filtering and is accomplished with filterD().80 The filterD() function requires the original data frame as the first argument and a condition statement as the second argument. The condition statement is used to either include or exclude individuals from the original data frame. Condition statements consist of the name of a variable in the original data frame, a comparison operator, and a comparison value (Table 24.1). The results from filterD() should be assigned to an object, which is then the name of the new data frame.

 

Table 24.1: Comparison operators used in filterD() and their results. Note that var generically represents a variable in the original data frame and value is a generic value or level. Both var and val would be replaced with specific items (see examples in main text).
Comparison Operator Rows Returned from Original Data Frame
var==value All rows where var IS equal to value
var!=value All rows where var is NOT equal to value
var %in% c(value1,value2) All rows where var IS IN (or one of the) vector of values81
var>value All rows where var is greater than value82
var>=value All rows where var is greater than or equal to value83
var<value All rows where var is less than value84
var<=value All rows where var is less than or equal to value85
condition1,condition2 All rows where BOTH conditions are true
condition1 | condition2 All rows where ONE or BOTH conditions are true86

 

The following are examples of new data frames created from bears. The name of the new data frame (i.e., object left of the assignment operator) can be any valid object name. As demonstrated below, the new data frame should be examined after each filtering to ensure that the data frame actually contains the items that you desire.87

  • Only individuals from Bayfield county.
bf <- filterD(bears,loc=="Bayfield")
bf
#R>    length.cm weight.kg      loc
#R>  1     139.0       110 Bayfield
#R>  2     138.0        60 Bayfield
#R>  3     139.0        90 Bayfield
#R>  4     120.5        60 Bayfield
#R>  5     149.0        85 Bayfield
  • Individuals from both Bayfield and Ashland counties.
bfash <- filterD(bears,loc %in% c("Bayfield","Ashland"))
bfash
#R>    length.cm weight.kg      loc
#R>  1     139.0       110 Bayfield
#R>  2     138.0        60 Bayfield
#R>  3     139.0        90 Bayfield
#R>  4     120.5        60 Bayfield
#R>  5     149.0        85 Bayfield
#R>  6     141.0       100  Ashland
#R>  7     141.0        95  Ashland
  • Individuals NOT from Bayfield county.
bfnotbay <- filterD(bears,loc != "Bayfield")
bfnotbay
#R>    length.cm weight.kg     loc
#R>  1     141.0       100 Ashland
#R>  2     141.0        95 Ashland
#R>  3     150.0        85 Douglas
#R>  4     166.0       155 Douglas
#R>  5     151.5       140 Douglas
#R>  6     129.5       105 Douglas
#R>  7     150.0       110 Douglas
  • Individuals with a weight greater than 100 kg.
gt100 <- filterD(bears,weight.kg>100)
gt100
#R>    length.cm weight.kg      loc
#R>  1     139.0       110 Bayfield
#R>  2     166.0       155  Douglas
#R>  3     151.5       140  Douglas
#R>  4     129.5       105  Douglas
#R>  5     150.0       110  Douglas
  • Individuals from Douglas County that weighed at least 150 kg.
do150 <- filterD(bears,loc=="Douglas",weight.kg>=150)
do150
#R>    length.cm weight.kg     loc
#R>  1       166       155 Douglas

Examine the new data frame after filtering to ensure that it contains the data you intended.

24.2 Selecting Entire Variables

As noted in Section 23.2.1, whole variables can be selected from a data frame with the $ notation. Recall that the $ separates the name of the data frame from the name of the variable within that data frame.

  • The weights of all bears (i.e., in the bears data frame).
bears$weight.kg
#R>   [1] 110  60  90  60  85 100  95  85 155 140 105 110
  • The weights of all bears in Ashland and Bayfield counties (i.e., in bfash from above).
bfash$weight.kg
#R>  [1] 110  60  90  60  85 100  95
  • The location of all bears in just Bayfield county (i.e., in bf from above). [You might use something like this to check your filtering.]
bf$loc
#R>  [1] "Bayfield" "Bayfield" "Bayfield" "Bayfield" "Bayfield"

 

24.3 Selecting Individuals

In some instances, you may need to select or exclude an individual from a data frame. Positions within an object are identified within square brackets. As data frames are two-dimensional objects they are indexed by both a row and a column, in that order. For example, the item in the third row and second column of bears is selected below.

bears[3,2]
#R>  [1] 90

An entire row or column may be selected by omitting the other dimension. For example, one could select the entire second column with bears[,2].88 As a better example, the entire third row is selected below (note how the column designation was omitted).

bears[3,]
#R>    length.cm weight.kg      loc
#R>  3       139        90 Bayfield

Multiple rows are selected by combining row indices together with c(). For example, the third, fifth, and eighth rows are selected below (again, the column index is omitted).

bears[c(3,5,8),]
#R>    length.cm weight.kg      loc
#R>  3       139        90 Bayfield
#R>  5       149        85 Bayfield
#R>  8       150        85  Douglas

Finally, rows can be excluded by preceding the row indices with a negative sign.

bears[-c(3,5,8,10,12),]
#R>     length.cm weight.kg      loc
#R>  1      139.0       110 Bayfield
#R>  2      138.0        60 Bayfield
#R>  4      120.5        60 Bayfield
#R>  6      141.0       100  Ashland
#R>  7      141.0        95  Ashland
#R>  9      166.0       155  Douglas
#R>  11     129.5       105  Douglas

  1. filterD() requires the NCStats package.↩︎

  2. value should be a character, factor, or integer.↩︎

  3. value must be numeric.↩︎

  4. value must be numeric.↩︎

  5. value must be numeric.↩︎

  6. value must be numeric.↩︎

  7. Note that this “or” operator is a “vertical line”” which is typed with the shift-backslash key.↩︎

  8. For larger data.frames you should check the structure (using str()) or the headtail() of the new data frame.↩︎

  9. But this is also the weight.kg variable and is better selected with bears$weight.kg.↩︎