MTH107 FAQ

In some instances you may want to create a smaller data.frame that contains only those individuals that belong in two groups. The smaller data frame can be constructed using filterD() from FSA (which is loaded with NCStats) with the %in% operator. For example, the following data.frame contains measurements on petals and sepals from three species of iris.

> library(NCStats)
> df <- read.csv("data/Iris.csv")
> str(df)

'data.frame':   150 obs. of  5 variables:
 $ seplen : int  50 46 46 51 55 48 52 49 44 50 ...
 $ sepwid : int  33 34 36 33 35 31 34 36 32 35 ...
 $ petlen : int  14 14 10 17 13 16 14 14 13 16 ...
 $ petwid : int  2 3 2 5 2 2 2 1 2 6 ...
 $ species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> levels(df$species)

[1] "setosa"     "versicolor" "virginica"

One can select two (or more) groups with filterD() where the first argument is the original data frame and the second argument is a condition statement that uses the grouping (or factor) variable followed by %in% and the names of the groups in quotes within c(). For example, one can select the versicolor and virginica species with the following code.

> df2 <- filterD(df,species %in% c("virginica","versicolor"))
> levels(df2$species)

[1] "versicolor" "virginica"

One would use whichever data frame contains the specific information in all ensuing analyses. For example, one could do the following to compute summary statistics of petal length for the versicolor and viriginica irises.

> Summarize(~petlen,data=df2,digits=1)

     n   mean     sd    min     Q1 median     Q3    max 
 100.0   49.1    8.3   30.0   43.8   49.0   55.2   69.0

However, the following is the summary statistics of petal length for ALL three species (i.e., uses the original data frame).

> Summarize(~petlen,data=df,digits=1)

     n   mean     sd    min     Q1 median     Q3    max 
 150.0   37.6   17.7   10.0   16.0   43.5   51.0   69.0

This type of filtering or subsetting is also described in this R Tutorial video.