Base R uses a function called subset()
to construct subsets of a data.frame. For example, the following code can be used to construct a data.frame for the Iris.csv
file that contains a single species.
> df <- read.csv("data/Iris.csv")
> str(df)
'data.frame': 150 obs. of 5 variables:
$ seplen : int 50 46 46 51 55 48 52 49 44 50 ...
$ sepwid : int 33 34 36 33 35 31 34 36 32 35 ...
$ petlen : int 14 14 10 17 13 16 14 14 13 16 ...
$ petwid : int 2 3 2 5 2 2 2 1 2 6 ...
$ species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> levels(df$species)
[1] "setosa" "versicolor" "virginica"
> tmp <- subset(df,species=="versicolor")
A quick table of results shows that only one species remains in the data.frame.
> xtabs(~species,data=tmp)
species
setosa versicolor virginica
0 50 0
But, annoyingly, the variable has maintained the names of the other species that were present in the original data.frame. This unwanted behavior can be corrected by using filterD()
from FSA
(which is loaded with NCStats
) instead of subset()
.
> library(NCStats)
> tmp <- filterD(df,species=="versicolor")
> xtabs(~species,data=tmp)
species
versicolor
50