Module 10 Factors
Factors are a special case of character variables in R. Factors are primarily used for categorical data, especially if summaries of that data are to be displayed in other than alphabetical order. For example, a simple count of the number of observations in each loc
of the bears
data frame introduced in Section 3.3.1 shows that the categories of the character variable are displayed alphabetically by default.
<- read_csv(file.path("data","Bears.csv"))
bears <- bears %>%
tmp group_by(loc) %>%
summarize(n=n())
tmp
#R> # A tibble: 3 x 2
#R> loc n
#R> <chr> <int>
#R> 1 Ashland 2
#R> 2 Bayfield 3
#R> 3 Douglas 3
To display these results in other than alphabetical order will require the use of factors, which will be illustrated in this module.
Factors have a long and, for some, dubious history in R. The forcats
package was developed to aid working with factors in R. While it is not critical to understanding the methods of this module, the stringsAsFactors: An unauthorized biography and stringsAsFactors = <sigh> are interesting reads about how unforeseen issues can arise when using factors. Many of these issues are addressed when using forcats
, which is loaded with tidyverse
.
10.1 Creating Factors
A variable is converted to a factor class by including that variable in factor()
. The code below creates a new data frame where loc
is now a factor, as indicated by <fct>
below loc
when the tibble is printed.
<- bears %>%
tmp mutate(loc=factor(loc))
tmp
#R> # A tibble: 8 x 3
#R> length.cm weight.kg loc
#R> <dbl> <dbl> <fct>
#R> 1 139 110 Bayfield
#R> 2 120. 60 Bayfield
#R> 3 149 85 Bayfield
#R> 4 141 100 Ashland
#R> 5 141 95 Ashland
#R> 6 150 85 Douglas
#R> 7 130. 105 Douglas
#R> 8 150 110 Douglas
Nothing is changed when the data are displayed as above. However, behind-the-scenes the factor variable consists of numerical codes that are mapped to levels (i.e., the categories) of the factor variable. The order here is still alphabetical because the order of the levels was not specifically controlled when creating the factor variable. Thus, behind-the-scenes “1” is mapped to “Ashland,” “2” to “Bayfield,” and “3” to “Douglas.” This mapping is shown with loc_nums
below.
%>% mutate(loc_nums=as.numeric(loc)) tmp
#R> # A tibble: 8 x 4
#R> length.cm weight.kg loc loc_nums
#R> <dbl> <dbl> <fct> <dbl>
#R> 1 139 110 Bayfield 2
#R> 2 120. 60 Bayfield 2
#R> 3 149 85 Bayfield 2
#R> 4 141 100 Ashland 1
#R> 5 141 95 Ashland 1
#R> 6 150 85 Douglas 3
#R> 7 130. 105 Douglas 3
#R> 8 150 110 Douglas 3
Suppose, however, that we prefer that the levels of the loc
variable be ordered from West to East, as in Douglas than Bayfield and then Ashland. The specific order of the levels can be set by supplying the ordered levels in a vector given to levels=
within factor()
.
<- bears %>%
tmp mutate(loc=factor(loc,levels=c("Douglas","Bayfield","Ashland")))
tmp
#R> # A tibble: 8 x 3
#R> length.cm weight.kg loc
#R> <dbl> <dbl> <fct>
#R> 1 139 110 Bayfield
#R> 2 120. 60 Bayfield
#R> 3 149 85 Bayfield
#R> 4 141 100 Ashland
#R> 5 141 95 Ashland
#R> 6 150 85 Douglas
#R> 7 130. 105 Douglas
#R> 8 150 110 Douglas
Again, the data are not fundamentally altered, but the codes underlying the factor variables are.47
%>% mutate(loc_nums=as.numeric(loc)) tmp
#R> # A tibble: 8 x 4
#R> length.cm weight.kg loc loc_nums
#R> <dbl> <dbl> <fct> <dbl>
#R> 1 139 110 Bayfield 2
#R> 2 120. 60 Bayfield 2
#R> 3 149 85 Bayfield 2
#R> 4 141 100 Ashland 3
#R> 5 141 95 Ashland 3
#R> 6 150 85 Douglas 1
#R> 7 130. 105 Douglas 1
#R> 8 150 110 Douglas 1
When setting the levels of a factor variable make sure that all levels are spelled exactly as they are in the original variable. For example, see the major problem created below when a level is erroneously set (i.e., “bayfield” is not capitalized).
<- bears %>%
tmp mutate(loc=factor(loc,levels=c("Douglas","bayfield","Ashland")))
tmp
#R> # A tibble: 8 x 3
#R> length.cm weight.kg loc
#R> <dbl> <dbl> <fct>
#R> 1 139 110 <NA>
#R> 2 120. 60 <NA>
#R> 3 149 85 <NA>
#R> 4 141 100 Ashland
#R> 5 141 95 Ashland
#R> 6 150 85 Douglas
#R> 7 130. 105 Douglas
#R> 8 150 110 Douglas
Additionally, make sure to include all possible levels when creating the levels.
<- bears %>%
tmp mutate(loc=factor(loc,levels=c("Douglas","Ashland")))
tmp
#R> # A tibble: 8 x 3
#R> length.cm weight.kg loc
#R> <dbl> <dbl> <fct>
#R> 1 139 110 <NA>
#R> 2 120. 60 <NA>
#R> 3 149 85 <NA>
#R> 4 141 100 Ashland
#R> 5 141 95 Ashland
#R> 6 150 85 Douglas
#R> 7 130. 105 Douglas
#R> 8 150 110 Douglas
Given these last two issues it is important to understand what levels exist within a variable. If the variable is a character class (and not yet a factor) then use unique()
to see the list of levels.
unique(bears$loc)
#R> [1] "Bayfield" "Ashland" "Douglas"
However, if the variable is a factor class already then use levels()
to see the list of levels along with their order.
levels(tmp$loc)
#R> [1] "Douglas" "Ashland"
10.2 Changing Factor Order
Factor levels can be manually reordered using levels=
as shown in the previous section. However functions in forcats
provide other methods to simplify reordering levels for common situations.
Levels can be reordered based on the value of another variable with fct_reorder()
. This function takes the factor variable as the first argument and a numeric variable that contains values for which to reorder the factor levels as the second argument. Optionally a third argument called .fun
can be used to calculate a summary of the numeric variable to use for reordering.48
For example, the code below orders the levels in loc
by the median length of the bears.
<- bears %>%
tmp mutate(loc=fct_reorder(loc,length.cm))
levels(tmp$loc)
#R> [1] "Bayfield" "Ashland" "Douglas"
## Checking
%>%
tmp group_by(loc) %>%
summarize(n=n(),
mdn=median(length.cm))
#R> # A tibble: 3 x 3
#R> loc n mdn
#R> <fct> <int> <dbl>
#R> 1 Bayfield 3 139
#R> 2 Ashland 2 141
#R> 3 Douglas 3 150
Alternatively, the code below orders the levels of loc
by the minimum length of the bears (i.e., using min
in .fun=
).
<- bears %>%
tmp mutate(loc=fct_reorder(loc,length.cm,.fun=min))
levels(tmp$loc)
#R> [1] "Bayfield" "Douglas" "Ashland"
## Checking
%>%
tmp group_by(loc) %>%
summarize(n=n(),
min=min(length.cm))
#R> # A tibble: 3 x 3
#R> loc n min
#R> <fct> <int> <dbl>
#R> 1 Bayfield 3 120.
#R> 2 Douglas 3 130.
#R> 3 Ashland 2 141
Specific levels are moved to the beginning of the order of levels with fct_relevel()
. The first argument to this function is the factor variable and subsequent arguments are levels to move to the “beginning of the line.” For example, the following code moves “Bayfield” to the beginning of the order.
<- bears %>%
tmp mutate(loc=fct_relevel(loc,"Bayfield"))
levels(tmp$loc)
#R> [1] "Bayfield" "Ashland" "Douglas"
Alternatively, both “Bayfield” and “Douglas” are moved to the beginning below
<- bears %>%
tmp mutate(loc=fct_relevel(loc,"Bayfield","Douglas"))
levels(tmp$loc)
#R> [1] "Bayfield" "Douglas" "Ashland"
The order of levels for a factor variable are reversed with fct_rev()
.
<- bears %>%
tmp mutate(loc=fct_rev(loc))
levels(tmp$loc)
#R> [1] "Douglas" "Bayfield" "Ashland"
The fct_rev()
function can be used with fct_reorder()
to change the order from ascending to descending order. For example, the code below changes the factor levels to be in descending order of minimum length.
<- bears %>%
tmp mutate(loc=fct_reorder(loc,length.cm,.fun=min),
loc=fct_rev(loc))
## Checking
%>%
tmp group_by(loc) %>%
summarize(n=n(),
min=min(length.cm))
#R> # A tibble: 3 x 3
#R> loc n min
#R> <fct> <int> <dbl>
#R> 1 Ashland 2 141
#R> 2 Douglas 3 130.
#R> 3 Bayfield 3 120.
The fct_infreq()
function is used to order the levels by decreasing frequency of their occurrence.
<- bears %>%
tmp mutate(loc=fct_infreq(loc))
## Checking
%>%
tmp group_by(loc) %>%
summarize(n=n())
#R> # A tibble: 3 x 2
#R> loc n
#R> <fct> <int>
#R> 1 Bayfield 3
#R> 2 Douglas 3
#R> 3 Ashland 2
Of course, using fct_rev()
with fct_infreq()
would order the levels in ascending frequency of occurrence.
<- bears %>%
tmp mutate(loc=fct_rev(fct_infreq(loc)))
## Checking
%>%
tmp group_by(loc) %>%
summarize(n=n())
#R> # A tibble: 3 x 2
#R> loc n
#R> <fct> <int>
#R> 1 Ashland 2
#R> 2 Douglas 3
#R> 3 Bayfield 3
10.3 Changing Factor Levels
In some instances it may be beneficial to change the names of the levels or to collapse or lump levels together. Methods for performing these changes are demonstrated in this section using data on the density and basal area of tree species located on plots in the Apostle Islands that were designated as “Balsam Fir” plots. These data were extracted from Sanders and Grochowski (2012)49 and are stored in APIS_FirPlots.xlsx.
<- readxl::read_excel(file.path("data","APIS_FirPlots.xlsx"))
aip aip
#R> # A tibble: 26 x 4
#R> plot species density basal_area
#R> <dbl> <chr> <dbl> <dbl>
#R> 1 1011 Acer rubrum 11.1 0.4
#R> 2 1011 Betula alleghaniensis 156. 1.7
#R> 3 1011 Betula papyrifera 44.4 0.7
#R> 4 1011 Sorbus decora 22.2 0.2
#R> 5 1011 unknown tree - hardwood 11.1 0.6
#R> 6 1011 Abies balsamea 878. 2.5
#R> 7 1011 unknown tree - softwood 122. 1.2
#R> 8 1031 Acer saccharum 55.6 0.1
#R> 9 1031 Acer sp. 44.4 0.3
#R> 10 1031 Betula papyrifera 33.3 1.5
#R> # ... with 16 more rows
These data are briefly summarized below.
<- aip %>%
aip_sum1 group_by(species) %>%
summarize(n=n(),
ttl_density=sum(density),
ttl_barea=sum(basal_area))
aip_sum1
#R> # A tibble: 14 x 4
#R> species n ttl_density ttl_barea
#R> <chr> <int> <dbl> <dbl>
#R> 1 Abies balsamea 4 1000 2.8
#R> 2 Acer rubrum 1 11.1 0.4
#R> 3 Acer saccharum 3 77.8 0.2
#R> 4 Acer sp. 1 44.4 0.3
#R> 5 Acer spicatum 1 11.1 0
#R> 6 Betula alleghaniensis 3 178. 2.3
#R> 7 Betula papyrifera 2 77.7 2.2
#R> 8 Betula sp. 1 11.1 0.3
#R> 9 Populus tremuloides 1 22.2 1.5
#R> 10 Quercus rubra 1 11.1 0
#R> 11 Sorbus decora 1 22.2 0.2
#R> 12 Thuja occidentalis 1 11.1 0.7
#R> 13 unknown tree - hardwood 4 111 4.3
#R> 14 unknown tree - softwood 2 144. 1.2
The summary above shows several species within a few genera (e.g., Acer). A researcher may wish to collapse these species into one level that represents the genera. For example, one may want to collapse “Acer rubrum,” “Acer saccharum,” “Acer spicatum,” and “Acer sp.” into a single level called “Acer sp.”
The “collapsing” of multiple levels into one level is accomplished with fct_collapse()
. The first argument to this function is the factor variable containing the “old” levels. Subsequent arguments are formed by setting a new level name equal to a vector containing old level names to collapse.
For example the code below uses fct_collapse()
to create a new species2
variable that collapses the multiple “Acer” and “Betula” levels into two levels specific to each genera. To allow ease of comparison of the two “species” variables I then moved species2
to be right after species
and arranged the rows by species
.50
%<>%
aip mutate(species2=fct_collapse(species,
"Acer sp."=c("Acer rubrum","Acer saccharum",
"Acer sp.","Acer spicatum"),
"Betula sp."=c("Betula alleghaniensis",
"Betula papyrifera",
"Betula sp."),
"Unknown"=c("unknown tree - hardwood",
"unknown tree - softwood"))) %>%
relocate(species,.after=species) %>%
arrange(species)
aip
#R> # A tibble: 26 x 5
#R> plot species density basal_area species2
#R> <dbl> <chr> <dbl> <dbl> <fct>
#R> 1 1011 Abies balsamea 878. 2.5 Abies balsamea
#R> 2 1031 Abies balsamea 33.3 0.1 Abies balsamea
#R> 3 1034 Abies balsamea 66.7 0.1 Abies balsamea
#R> 4 1035 Abies balsamea 22.2 0.1 Abies balsamea
#R> 5 1011 Acer rubrum 11.1 0.4 Acer sp.
#R> 6 1031 Acer saccharum 55.6 0.1 Acer sp.
#R> 7 1034 Acer saccharum 11.1 0.1 Acer sp.
#R> 8 1035 Acer saccharum 11.1 0 Acer sp.
#R> 9 1031 Acer sp. 44.4 0.3 Acer sp.
#R> 10 1034 Acer spicatum 11.1 0 Acer sp.
#R> # ... with 16 more rows
To highlight the most abundance species in further analysis below, the levels of species2
were ordered in descending values of the sum of density
(using fct_reorder()
and fct_rev()
).
%<>%
aip mutate(species2=fct_rev(fct_reorder(species2,density,.fun=sum)))
Total density and total basal area by level of species2
, where the levels of species2
are ordered in descending order of the total density is then shown below.
<- aip %>%
aip_sum2 group_by(species2) %>%
summarize(n=n(),
ttl_density=sum(density),
ttl_barea=sum(basal_area))
aip_sum2
#R> # A tibble: 8 x 4
#R> species2 n ttl_density ttl_barea
#R> <fct> <int> <dbl> <dbl>
#R> 1 Abies balsamea 4 1000 2.8
#R> 2 Betula sp. 6 267. 4.8
#R> 3 Unknown 6 255. 5.5
#R> 4 Acer sp. 6 144. 0.9
#R> 5 Sorbus decora 1 22.2 0.2
#R> 6 Populus tremuloides 1 22.2 1.5
#R> 7 Thuja occidentalis 1 11.1 0.7
#R> 8 Quercus rubra 1 11.1 0
The result above indicate that four of the “species” dominated the results. It may be useful (especially if graphing these results) to “lump” the four least abundant “species” into an “Other” category. This type of “lumping” can be accomplished with fct_lump_n()
which requires the factor variable as its first argument andn=
to indicate the top n
levels to not lump into an “Other” category. For example, the code below will lump all levels in species2
after the top four into an “Other” category of the new species3
variable. The summary results illustrate the lumping.
<- aip %>%
aip mutate(species3=fct_lump_n(species2,n=4))
<- aip %>%
aip_sum3 group_by(species3) %>%
summarize(n=n(),
ttl_density=sum(density),
ttl_barea=sum(basal_area))
aip_sum3
#R> # A tibble: 5 x 4
#R> species3 n ttl_density ttl_barea
#R> <fct> <int> <dbl> <dbl>
#R> 1 Abies balsamea 4 1000 2.8
#R> 2 Betula sp. 6 267. 4.8
#R> 3 Unknown 6 255. 5.5
#R> 4 Acer sp. 6 144. 0.9
#R> 5 Other 4 66.6 2.4
Other versions of fct_lump_XXX()
, such as fct_lump_prop()
and fct_lump_min()
, can be used when ordering by frequency rather than be another variable as shown here.
Finally, the names of specific levels can be changed with fct_recode()
. The first argument to this function is the original factor variable. Subsequent arguments are of the form new level name equal to old level name. Any levels not listed in fct_recode()
will be retained with their original names. For example, the code below creates a new species4
variable with new common names for the three “species” levels remaining in species3
. The results are seen in the summary below.
%<>%
aip mutate(species4=fct_recode(species3,
"Maple" = "Acer sp.",
"Birch" = "Betula sp.",
"Balsam Fir" = "Abies balsamea"))
<- aip %>%
aip_sum4 group_by(species4) %>%
summarize(n=n(),
ttl_density=sum(density),
ttl_barea=sum(basal_area))
aip_sum4
#R> # A tibble: 5 x 4
#R> species4 n ttl_density ttl_barea
#R> <fct> <int> <dbl> <dbl>
#R> 1 Balsam Fir 4 1000 2.8
#R> 2 Birch 6 267. 4.8
#R> 3 Unknown 6 255. 5.5
#R> 4 Maple 6 144. 0.9
#R> 5 Other 4 66.6 2.4
10.4 Examples in Context
10.4.1 Pulse of the Nation Survey
Cards Against Humanity Saves America created monthly polls to assess what they called the “Pulse of the Nation.” “Cards” partnered with Survey Sampling International to contact a nationally representative sample of the American public. For the first three polls, they contacted potential respondents on their cell phones and landlines, and a total of about 3000 adults ultimately participated i the poll. Results from the first poll are in 201709-CAH_PulseOfTheNation.csv51, which are read in below.
<- read_csv(file.path("data","201709-CAH_PulseOfTheNation.csv"))
PON PON
#R> # A tibble: 1,000 x 28
#R> Income Gender Age `Age Range` `Political Affiliation` `Do you approve or d~
#R> <dbl> <chr> <dbl> <chr> <chr> <chr>
#R> 1 8000 Female 64 55-64 Democrat Strongly disapprove
#R> 2 68000 Female 56 55-64 Democrat Strongly disapprove
#R> 3 46000 Male 63 55-64 Independent Somewhat Approve
#R> 4 51000 Male 48 45-54 Republican Strongly Approve
#R> 5 100000 Female 32 25-34 Democrat Somewhat Approve
#R> 6 54000 Female 64 55-64 Democrat Strongly disapprove
#R> 7 83000 Male 61 55-64 Independent Somewhat Approve
#R> 8 114000 Female 64 55-64 Republican Somewhat disapprove
#R> 9 90000 Female 64 55-64 Republican Somewhat Approve
#R> 10 5000 Female 68 65+ Democrat Somewhat disapprove
#R> # ... with 990 more rows, and 22 more variables:
#R> # What is your highest level of education? <chr>, Q5OTH1 <chr>,
#R> # What is your race? <chr>, Q6OTH1 <chr>, What is your marital status? <chr>,
#R> # Q7OTH1 <chr>, q8x <chr>,
#R> # What would you say is the likelihood that your current job will be entirely performed by robots or computers within the next decade? <chr>,
#R> # Do you believe that climate change is real and caused by people, real but not caused by people, or not real at all? <chr>,
#R> # How many Transformers movies have you seen? <dbl>, q11x <chr>, ...
The variables in the data frame are generally the actual questions asked in the poll. These are both difficult to read and to work with. To get a better look at these variables I used names()
, which when given a data frame returns all the variable names of that data frame.
names(PON)
#R> [1] "Income"
#R> [2] "Gender"
#R> [3] "Age"
#R> [4] "Age Range"
#R> [5] "Political Affiliation"
#R> [6] "Do you approve or disapprove of how Donald Trump is handling his job as president?"
#R> [7] "What is your highest level of education?"
#R> [8] "Q5OTH1"
#R> [9] "What is your race?"
#R> [10] "Q6OTH1"
#R> [11] "What is your marital status?"
#R> [12] "Q7OTH1"
#R> [13] "q8x"
#R> [14] "What would you say is the likelihood that your current job will be entirely performed by robots or computers within the next decade?"
#R> [15] "Do you believe that climate change is real and caused by people, real but not caused by people, or not real at all?"
#R> [16] "How many Transformers movies have you seen?"
#R> [17] "q11x"
#R> [18] "Do you agree or disagree with the following statement: scientists are generally honest and are serving the public good."
#R> [19] "Do you agree or disagree with the following statement: vaccines are safe and protect children from disease."
#R> [20] "How many books, if any, have you read in the past year?"
#R> [21] "q14x"
#R> [22] "Do you believe in ghosts?"
#R> [23] "What percentage of the federal budget would you estimate is spent on scientific research?"
#R> [24] "q16x"
#R> [25] "Is federal funding of scientific research too high, too low, or about right?"
#R> [26] "True or false: the earth is always farther away from the sun in the winter than in the summer."
#R> [27] "If you had to choose: would you rather be smart and sad, or dumb and happy?"
#R> [28] "Do you think it is acceptable or unacceptable to urinate in the shower?"
For this analysis I want to focus on the Political Affiliation
, Age Range
, and the three variables related to education, opinion about climate change, and opinion about the honesty of scientists. These last three variables have very long names so it will be easier to refer to them by their column number. From above, it is noted that the “education” variable is column 7, the climate change variable is in column 15, and the “scientists honesty” question is in column 18. In the code below, I rename each of the columns (and Age Range
) and then reduce the data frame to just those variables.
%<>%
PON rename(polit_aff=`Political Affiliation`,age_range=`Age Range`,
education=7,climate=15,scihon=18) %>%
select(polit_aff,age_range,education,climate,scihon)
PON
#R> # A tibble: 1,000 x 5
#R> polit_aff age_range education climate scihon
#R> <chr> <chr> <chr> <chr> <chr>
#R> 1 Democrat 55-64 College degree Real and Caused by People Strongly ~
#R> 2 Democrat 55-64 High school DK/REF Somewhat ~
#R> 3 Independent 55-64 Some college Real but not Caused by People Somewhat ~
#R> 4 Republican 45-54 High school Not Real At All Somewhat ~
#R> 5 Democrat 25-34 Some college Real and Caused by People Strongly ~
#R> 6 Democrat 55-64 Some college Real and Caused by People Strongly ~
#R> 7 Independent 55-64 College degree Real but not Caused by People Strongly ~
#R> 8 Republican 55-64 College degree Real and Caused by People Somewhat ~
#R> 9 Republican 55-64 High school Real and Caused by People Somewhat ~
#R> 10 Democrat 65+ Some college Real and Caused by People Somewhat ~
#R> # ... with 990 more rows
All remaining variables are categorical and for my analysis purposes I want each to be a factor. Before converting these variables to factors I examine the levels that appear in each.
unique(PON$polit_aff)
#R> [1] "Democrat" "Independent" "Republican" "DK/REF"
unique(PON$age_range)
#R> [1] "55-64" "45-54" "25-34" "65+" "35-44" "18-24"
unique(PON$education)
#R> [1] "College degree" "High school" "Some college" "Graduate degree"
#R> [5] "DK/REF" "Other"
unique(PON$climate)
#R> [1] "Real and Caused by People" "DK/REF"
#R> [3] "Real but not Caused by People" "Not Real At All"
unique(PON$scihon)
#R> [1] "Strongly Agree" "Somewhat Agree"
#R> [3] "Somewhat Disagree" "Strongly Disagree"
#R> [5] "DK/REF" "Neither Agree nor Disagree"
From this examination of levels, I made the following notes:
- Respondents were allowed to answer “Don’t Know” or refuse to answer which were coded as “DK/REF.” My general preference is to have “DK/REF” answers be the last level.
- The levels for
age_range
will not need to be controlled because the numbers will naturally alphabetically order (as the numbers are all two digits; the order might not be natural if there were some one digit (say “1-17”) or three digit (say “100+”) numbers). - Levels in
education
andscihon
need to be ordered according to their natural order (i.e., increasing education and increasing level of agreement). - I will order the levels for
climate
based on my understanding of what the correct answer is (i.e., I will order the answers from correct to less correct). - I will order the levels of
polit_aff
so that “Independent” is in the middle (which will happen alphabetically) and “DK/REF” is last.
Creation of these factors is below. Note that to move “DK/REF” to the end of polit_aff
I used fct_relevel()
with after=Inf
to move the level to the last position.
%<>%
PON mutate(polit_aff=factor(polit_aff),
polit_aff=fct_relevel(polit_aff,"DK/REF",after=Inf),
age_range=factor(age_range),
education=factor(education,levels=c("High school","Some college",
"College degree","Graduate degree",
"Other","DK/REF")),
climate=factor(climate,levels=c("Real and Caused by People",
"Real but not Caused by People",
"Not Real At All",
"DK/REF")),
scihon=factor(scihon,levels=c("Strongly Disagree","Somewhat Disagree",
"Neither Agree nor Disagree",
"Somewhat Agree","Strongly Agree",
"DK/REF")))
I then used levels()
to double-check that each factor has the levels and the order of levels that I expected.
levels(PON$polit_aff)
#R> [1] "Democrat" "Independent" "Republican" "DK/REF"
levels(PON$age_range)
#R> [1] "18-24" "25-34" "35-44" "45-54" "55-64" "65+"
levels(PON$education)
#R> [1] "High school" "Some college" "College degree" "Graduate degree"
#R> [5] "Other" "DK/REF"
levels(PON$climate)
#R> [1] "Real and Caused by People" "Real but not Caused by People"
#R> [3] "Not Real At All" "DK/REF"
levels(PON$scihon)
#R> [1] "Strongly Disagree" "Somewhat Disagree"
#R> [3] "Neither Agree nor Disagree" "Somewhat Agree"
#R> [5] "Strongly Agree" "DK/REF"
Suppose the first analysis of interest is examining responses to the “scientists are generally honest and are serving the public good” question.
<- PON %>%
sci_sum1 group_by(scihon) %>%
summarize(freq=n())
sci_sum1
#R> # A tibble: 6 x 2
#R> scihon freq
#R> <fct> <int>
#R> 1 Strongly Disagree 104
#R> 2 Somewhat Disagree 133
#R> 3 Neither Agree nor Disagree 19
#R> 4 Somewhat Agree 335
#R> 5 Strongly Agree 349
#R> 6 DK/REF 60
Suppose that the analysts want to simplify this result for their audience by ignoring the “Don’t know or refused to answer” responses and collapsing all disagree answers to one level and all agree answers to one level.
<- PON %>%
tmp filter(scihon!="DK/REF") %>%
mutate(scihon=fct_collapse(scihon,
"Disagree"=c("Strongly Disagree","Somewhat Disagree"),
"Agree"=c("Strongly Agree","Somewhat Agree")))
<- tmp %>%
sci_sum2 group_by(scihon) %>%
summarize(freq=n())
sci_sum2
#R> # A tibble: 3 x 2
#R> scihon freq
#R> <fct> <int>
#R> 1 Disagree 237
#R> 2 Neither Agree nor Disagree 19
#R> 3 Agree 684
For fun, a graph of this summary is below.
Further suppose that the analysts want to summarize the question about climate change separately by political affiliation (for which an answer was provided).
<- PON %>%
polclim_sum1 filter(polit_aff != "DK/REF") %>%
group_by(polit_aff,climate) %>%
summarize(freq=n()) %>%
mutate(perc=freq/sum(freq)*100)
polclim_sum1
#R> # A tibble: 12 x 4
#R> # Groups: polit_aff [3]
#R> polit_aff climate freq perc
#R> <fct> <fct> <int> <dbl>
#R> 1 Democrat Real and Caused by People 221 80.7
#R> 2 Democrat Real but not Caused by People 28 10.2
#R> 3 Democrat Not Real At All 15 5.47
#R> 4 Democrat DK/REF 10 3.65
#R> 5 Independent Real and Caused by People 234 63.6
#R> 6 Independent Real but not Caused by People 69 18.8
#R> 7 Independent Not Real At All 49 13.3
#R> 8 Independent DK/REF 16 4.35
#R> 9 Republican Real and Caused by People 71 34.1
#R> 10 Republican Real but not Caused by People 61 29.3
#R> 11 Republican Not Real At All 61 29.3
#R> 12 Republican DK/REF 15 7.21
Suppose that the researchers want to compute a percentage of respondents that thought climate change was real, whether they had the cause correct or not. Further suppose that they wish to include the “DK/REF” answers with the “Not Real At All” answers under the assumption that if they thought climate change was real they would have responded as such.
<- PON %>%
tmp filter(polit_aff != "DK/REF") %>%
mutate(climate=fct_collapse(climate,
"Real"=c("Real and Caused by People",
"Real but not Caused by People"),
"Not Real"=c("Not Real At All",
"DK/REF")))
<- tmp %>%
polclim_sum2 group_by(polit_aff,climate) %>%
summarize(freq=n()) %>%
mutate(perc=freq/sum(freq)*100) %>%
ungroup()
polclim_sum2
#R> # A tibble: 6 x 4
#R> polit_aff climate freq perc
#R> <fct> <fct> <int> <dbl>
#R> 1 Democrat Real 249 90.9
#R> 2 Democrat Not Real 25 9.12
#R> 3 Independent Real 303 82.3
#R> 4 Independent Not Real 65 17.7
#R> 5 Republican Real 132 63.5
#R> 6 Republican Not Real 76 36.5
And a graphic of the results for fun.
10.4.2 Enterococci on Sydney Beaches
The Beachwatch Water Quality Program of Sydney, Australia measured the density (in colony forming units per 100 ml of water) of Enterococci on Sydney beaches. The data from 2013 to 2018 was made available by the R-Ladies Sydney group and is in sydneybeaches.csv.52 The data are loaded below.
<- read_csv("https://raw.githubusercontent.com/rladiessydney/RYouWithMe/master/sydneybeaches.csv")
sb sb
#R> # A tibble: 3,690 x 8
#R> BeachId Region Council Site Longitude Latitude Date `Enterococci (cf~
#R> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
#R> 1 25 Sydney C~ Randwick~ Clove~ 151. -33.9 02/0~ 19
#R> 2 25 Sydney C~ Randwick~ Clove~ 151. -33.9 06/0~ 3
#R> 3 25 Sydney C~ Randwick~ Clove~ 151. -33.9 12/0~ 2
#R> 4 25 Sydney C~ Randwick~ Clove~ 151. -33.9 18/0~ 13
#R> 5 25 Sydney C~ Randwick~ Clove~ 151. -33.9 30/0~ 8
#R> 6 25 Sydney C~ Randwick~ Clove~ 151. -33.9 05/0~ 7
#R> 7 25 Sydney C~ Randwick~ Clove~ 151. -33.9 11/0~ 11
#R> 8 25 Sydney C~ Randwick~ Clove~ 151. -33.9 23/0~ 97
#R> 9 25 Sydney C~ Randwick~ Clove~ 151. -33.9 07/0~ 3
#R> 10 25 Sydney C~ Randwick~ Clove~ 151. -33.9 25/0~ 0
#R> # ... with 3,680 more rows
For the demonstration here the BeachId
, Council
, and Region
variables are not needed. The density of Enterococci is the last variable but its name is difficult to work with and will be renamed.
%<>%
sb select(-BeachId,-Council,-Region) %>%
rename(density=last_col())
sb
#R> # A tibble: 3,690 x 5
#R> Site Longitude Latitude Date density
#R> <chr> <dbl> <dbl> <chr> <dbl>
#R> 1 Clovelly Beach 151. -33.9 02/01/2013 19
#R> 2 Clovelly Beach 151. -33.9 06/01/2013 3
#R> 3 Clovelly Beach 151. -33.9 12/01/2013 2
#R> 4 Clovelly Beach 151. -33.9 18/01/2013 13
#R> 5 Clovelly Beach 151. -33.9 30/01/2013 8
#R> 6 Clovelly Beach 151. -33.9 05/02/2013 7
#R> 7 Clovelly Beach 151. -33.9 11/02/2013 11
#R> 8 Clovelly Beach 151. -33.9 23/02/2013 97
#R> 9 Clovelly Beach 151. -33.9 07/03/2013 3
#R> 10 Clovelly Beach 151. -33.9 25/03/2013 0
#R> # ... with 3,680 more rows
In one part of their analysis the researchers wanted to produce a plot of the mean and range of densities of Enterococci by beach. To aid understanding, however, they wanted the beaches ordered from south to north. The summary statistics below are calculated by Site
converted to a factor class based on increasing Latitude
. The latitudes are list with negative numbers (because Australia is in the southern hemisphere) so that more negative numbers are further south. Note the use of na.rm=TRUE
in fct_reorder()
because of the missing values at some sites.
<- sb %>%
sb_sum1 mutate(Site=fct_reorder(Site,Latitude)) %>%
group_by(Longitude,Latitude,Site) %>%
summarize(n=n(),
valid_n=sum(!is.na(density)),
mn_density=mean(density,na.rm=TRUE),
sd_density=sd(density,na.rm=TRUE),
min_density=min(density,na.rm=TRUE),
max_density=max(density,na.rm=TRUE)) %>%
ungroup()
sb_sum1
#R> # A tibble: 11 x 9
#R> Longitude Latitude Site n valid_n mn_density sd_density min_density
#R> <dbl> <dbl> <fct> <int> <int> <dbl> <dbl> <dbl>
#R> 1 151. -34.0 Little Ba~ 338 336 45.6 297. 0
#R> 2 151. -34.0 Malabar B~ 343 341 68.1 214. 0
#R> 3 151. -34.0 South Mar~ 338 336 15.7 77.0 0
#R> 4 151. -33.9 Maroubra ~ 338 335 20.2 123. 0
#R> 5 151. -33.9 Coogee Be~ 342 339 39.4 109. 0
#R> 6 151. -34.0 South Mar~ 316 314 63.9 210. 0
#R> 7 151. -33.9 Gordons B~ 324 322 24.9 120. 0
#R> 8 151. -33.9 Clovelly ~ 338 334 10.2 26.4 0
#R> 9 151. -33.9 Bronte Be~ 338 335 31.4 126. 0
#R> 10 151. -33.9 Tamarama ~ 337 335 35.7 129. 0
#R> 11 151. -33.9 Bondi Bea~ 338 334 18.8 51.9 0
#R> # ... with 1 more variable: max_density <dbl>
The desired plot is shown below (with the southern-most beach at the bottom of the y-axis).
Other researchers would prefer that the beaches be ordered from lowest to highest density of Enterococci. The code below creates a summary by Site
s ordered by the mean density
.
<- sb %>%
sb_sum2 mutate(Site=fct_reorder(Site,density,.fun=mean,na.rm=TRUE)) %>%
group_by(Site,Longitude,Latitude) %>%
summarize(n=n(),
valid_n=sum(!is.na(density)),
mn_density=mean(density,na.rm=TRUE),
sd_density=sd(density,na.rm=TRUE),
min_density=min(density,na.rm=TRUE),
max_density=max(density,na.rm=TRUE)) %>%
ungroup()
sb_sum2
#R> # A tibble: 11 x 9
#R> Site Longitude Latitude n valid_n mn_density sd_density min_density
#R> <fct> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>
#R> 1 Clovelly ~ 151. -33.9 338 334 10.2 26.4 0
#R> 2 South Mar~ 151. -34.0 338 336 15.7 77.0 0
#R> 3 Bondi Bea~ 151. -33.9 338 334 18.8 51.9 0
#R> 4 Maroubra ~ 151. -33.9 338 335 20.2 123. 0
#R> 5 Gordons B~ 151. -33.9 324 322 24.9 120. 0
#R> 6 Bronte Be~ 151. -33.9 338 335 31.4 126. 0
#R> 7 Tamarama ~ 151. -33.9 337 335 35.7 129. 0
#R> 8 Coogee Be~ 151. -33.9 342 339 39.4 109. 0
#R> 9 Little Ba~ 151. -34.0 338 336 45.6 297. 0
#R> 10 South Mar~ 151. -34.0 316 314 63.9 210. 0
#R> 11 Malabar B~ 151. -34.0 343 341 68.1 214. 0
#R> # ... with 1 more variable: max_density <dbl>
See that “1” is now mapped to “Douglas,” “2” to “Bayfield,” and “3” to “Ashland.”↩︎
Note that
.fun
is only useful if multiple observations of the factor variable are present and it defaults to using the median.↩︎Note that Jessica Grochowski, now Jessica Kirschbaum, is a Northland alum.↩︎
Compare the
species
tospecies2
columns below to confirm how the “refactoring” worked.↩︎Data from this online book.↩︎