MTH107 FAQ

In some instances you may want to create a smaller data frame that contains only those individuals that belong in one group. The smaller data frame can be constructed using filterD() from FSA (which is loaded with NCStats) with the == operator. For example, the following data frame contains measurements on petals and sepals from three species of iris.

> library(NCStats)
> df <- read.csv("data/Iris.csv")
> str(df)

'data.frame':   150 obs. of  5 variables:
 $ seplen : int  50 46 46 51 55 48 52 49 44 50 ...
 $ sepwid : int  33 34 36 33 35 31 34 36 32 35 ...
 $ petlen : int  14 14 10 17 13 16 14 14 13 16 ...
 $ petwid : int  2 3 2 5 2 2 2 1 2 6 ...
 $ species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> levels(df$species)

[1] "setosa"     "versicolor" "virginica"

One can select just one group with filterD() where the first argument is the original data frame and the second argument is a condition statement that uses the grouping (or factor) variable followed by == (note the two equals signs), and the name of single group contained in quotes. For example, one can select the versicolor species from the species variable in the df data frame with the following code.

> df2 <- filterD(df,species=="versicolor")
> levels(df2$species)

[1] "versicolor"

Similar, one could select just the virginica species with the following code.

> df3 <- filterD(df,species=="virginica")
> levels(df3$species)

[1] "virginica"

One would use whichever data frame contains the specific information in all ensuing analyses. For example, one could do the following to compute summary statistics of petal length for the versicolor irises.

> Summarize(~petlen,data=df2,digits=1)

     n   mean     sd    min     Q1 median     Q3    max 
  50.0   42.6    4.7   30.0   40.0   43.5   46.0   51.0

However, the following is the summary statistics of petal length for ALL three species (i.e., uses the original data frame).

> Summarize(~petlen,data=df,digits=1)

     n   mean     sd    min     Q1 median     Q3    max 
 150.0   37.6   17.7   10.0   16.0   43.5   51.0   69.0

And for comparisons purposes.

> Summarize(petlen~species,data=df,digits=1)

     species  n mean  sd min Q1 median   Q3 max
1     setosa 50 14.6 1.7  10 14   15.0 15.8  19
2 versicolor 50 42.6 4.7  30 40   43.5 46.0  51
3  virginica 50 55.5 5.5  45 51   55.5 58.8  69

This type of filtering or subsetting is also described in this R Tutorial video.