How to convert seemingly quantitative to categorical variables and how to reorder the levels of categorical variables.
Two common “issues” you will run into in this course is the need to turn a variable that R sees as quantitative into a categorical (or factor) variable and how to rearrange the order of levels within a categorical (or factor) variable. Handling both of these cases is described below.
Example data used here is from Mirex.csv [data, meta], which is loaded below.
Mirex <- read.csv("https://raw.githubusercontent.com/droglenc/NCData/master/Mirex.csv")
Change a variable to a factor with factor()
. For example,
Mirex$year <- factor(Mirex$year)
At times a variable that represents groups – i.e., a categorical variable – will be entered with numeric values. For example, Mirex
contains year
which records the year the fish was captured. In this example, there are relatively few years, the years are not contiguous, and hypotheses will be used to determine if the mean response differs among years. Thus, year
contains “groups” and should be treated as a categorical variable.
R, however, treats year
as if it is an int
eger, or quantitative variable because it simply “see” numbers.
str(Mirex)
#R> 'data.frame': 122 obs. of 4 variables:
#R> $ year : int 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
#R> $ weight : num 0.41 0.45 1.04 1.09 1.24 1.25 1.3 1.34 1.37 1.49 ...
#R> $ mirex : num 0.16 0.19 0.19 0.1 0.13 0.19 0.28 0.16 0.17 0.2 ...
#R> $ species: chr "chinook" "chinook" "chinook" "coho" ...
year
is forced to be a factor that defines groupings with factor()
. When simply converting to a factor factor()
only requires the variable (in dataframe$var
format) as an argument. The result should be saved to a variable in the data frame. For example, year
in Mirex
is replaced with a factored version below.
Mirex$year <- factor(Mirex$year)
year
is now a factor variable.
str(Mirex)
#R> 'data.frame': 122 obs. of 4 variables:
#R> $ year : Factor w/ 6 levels "1977","1982",..: 1 1 1 1 1 1 1 1 1 1 ...
#R> $ weight : num 0.41 0.45 1.04 1.09 1.24 1.25 1.3 1.34 1.37 1.49 ...
#R> $ mirex : num 0.16 0.19 0.19 0.1 0.13 0.19 0.28 0.16 0.17 0.2 ...
#R> $ species: chr "chinook" "chinook" "chinook" "coho" ...
As a side note, it is seen in the Mirex
structure that species
is a character (chr
) variable rather than a factor. In analyses in this course, character variables will ultimately be treated as factor variables so there is no need to explicitly convert them to factors.
Change order of levels by setting level order in levels=
of factor()
. For example,
R treats the levels of categorical variables alphabetically unless a specific order was set. For example, the levels for species
are
unique(Mirex$species)
#R> [1] "chinook" "coho"
Here R will treat chinook
as the “first level” because it alphabetically precedes coho
. This default order is evident in tables and figures; e.g.,
xtabs(~species,data=Mirex)
#R> species
#R> chinook coho
#R> 67 55
In some analyses, groups will need to be in a different order. The order of levels is controlled by setting the specific order with levels=
in factor()
. For example, the order of the levels in species
is changed below.
This change in order will then be evident in subsequent tables and figures.
xtabs(~species,data=Mirex)
#R> species
#R> coho chinook
#R> 55 67
If a variable has more groups then the list of groups in levels=
will simply be longer.