Base R uses a function called subset()
to construct subsets of a data.frame. For example, the following code can be used to construct a data.frame for the Iris.csv
file that contains two of three species.
> df <- read.csv("data/Iris.csv")
> str(df)
'data.frame': 150 obs. of 5 variables:
$ seplen : int 50 46 46 51 55 48 52 49 44 50 ...
$ sepwid : int 33 34 36 33 35 31 34 36 32 35 ...
$ petlen : int 14 14 10 17 13 16 14 14 13 16 ...
$ petwid : int 2 3 2 5 2 2 2 1 2 6 ...
$ species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> levels(df$species)
[1] "setosa" "versicolor" "virginica"
> tmp <- subset(df,species!="versicolor")
A quick table of results shows that only one species remains in the data.frame.
> ( freq <- xtabs(~species,data=tmp) )
species
setosa versicolor virginica
50 0 50
But, annoyingly, the variable has maintained the names of the other species that were present in the original data.frame. This also results in a bar chart with empty bars.
> barplot(freq,xlab="Species",ylab="Frequency")
This unwanted behavior is corrected by using filterD()
from FSA
(which is loaded with NCStats
) instead of subset()
.
> library(NCStats)
> tmp <- filterD(df,species!="versicolor")
> ( freq <- xtabs(~species,data=tmp) )
species
setosa virginica
50 50
> barplot(freq,xlab="Species",ylab="Frequency")