MTH107 FAQ

Base R uses a function called subset() to construct subsets of a data.frame. For example, the following code can be used to construct a data.frame for the Iris.csv file that contains a single species.

> df <- read.csv("data/Iris.csv")
> str(df)

'data.frame':   150 obs. of  5 variables:
 $ seplen : int  50 46 46 51 55 48 52 49 44 50 ...
 $ sepwid : int  33 34 36 33 35 31 34 36 32 35 ...
 $ petlen : int  14 14 10 17 13 16 14 14 13 16 ...
 $ petwid : int  2 3 2 5 2 2 2 1 2 6 ...
 $ species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> levels(df$species)

[1] "setosa"     "versicolor" "virginica"

> tmp <- subset(df,species=="versicolor")

A quick table of results shows that only one species remains in the data.frame.

> xtabs(~species,data=tmp)

species
    setosa versicolor  virginica 
         0         50          0

But, annoyingly, the variable has maintained the names of the other species that were present in the original data.frame. This unwanted behavior can be corrected by using filterD() from FSA (which is loaded with NCStats) instead of subset().

> library(NCStats)
> tmp <- filterD(df,species=="versicolor")
> xtabs(~species,data=tmp)

species
versicolor 
        50