In some instances you may want to create a smaller data frame that contains only those individuals that belong in one group. The smaller data frame can be constructed using filterD()
from FSA
(which is loaded with NCStats
) with the ==
operator. For example, the following data frame contains measurements on petals and sepals from three species of iris.
> library(NCStats)
> df <- read.csv("data/Iris.csv")
> str(df)
'data.frame': 150 obs. of 5 variables:
$ seplen : int 50 46 46 51 55 48 52 49 44 50 ...
$ sepwid : int 33 34 36 33 35 31 34 36 32 35 ...
$ petlen : int 14 14 10 17 13 16 14 14 13 16 ...
$ petwid : int 2 3 2 5 2 2 2 1 2 6 ...
$ species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> levels(df$species)
[1] "setosa" "versicolor" "virginica"
One can select just one group with filterD()
where the first argument is the original data frame and the second argument is a condition statement that uses the grouping (or factor) variable followed by ==
(note the two equals signs), and the name of single group contained in quotes. For example, one can select the versicolor
species from the species
variable in the df
data frame with the following code.
> df2 <- filterD(df,species=="versicolor")
> levels(df2$species)
[1] "versicolor"
Similar, one could select just the virginica
species with the following code.
> df3 <- filterD(df,species=="virginica")
> levels(df3$species)
[1] "virginica"
One would use whichever data frame contains the specific information in all ensuing analyses. For example, one could do the following to compute summary statistics of petal length for the versicolor
irises.
> Summarize(~petlen,data=df2,digits=1)
n mean sd min Q1 median Q3 max
50.0 42.6 4.7 30.0 40.0 43.5 46.0 51.0
However, the following is the summary statistics of petal length for ALL three species (i.e., uses the original data frame).
> Summarize(~petlen,data=df,digits=1)
n mean sd min Q1 median Q3 max
150.0 37.6 17.7 10.0 16.0 43.5 51.0 69.0
And for comparisons purposes.
> Summarize(petlen~species,data=df,digits=1)
species n mean sd min Q1 median Q3 max
1 setosa 50 14.6 1.7 10 14 15.0 15.8 19
2 versicolor 50 42.6 4.7 30 40 43.5 46.0 51
3 virginica 50 55.5 5.5 45 51 55.5 58.8 69
This type of filtering or subsetting is also described in this R Tutorial video.