Module 29 Chi-Square in R
In Modules 20 and 21 you learned the theory and how to perform the calculations required to complete the 11 steps (Section 17.1) of a hypothesis test for chi-square and goodness-of-fit tests. In this module, you will learn how to perform the required calculations from raw data using R.111
29.1 Chi-Square Tests
29.1.1 Data formats
Consider the following example for this section.
Daniel Weiss (in “100% American”) reported the results of a survey of 300 first-time fathers from four different hospitals (labeled as A, B, C, and D). Each father was asked if he was present (or not) in the delivery room when his child was born. The results of the survey are in FatherPresent.csv (data,meta). Use these data to determine if there is a difference, at the 5% level, in the proportion of fathers present in the delivery room among the four hospitals.
The data are loaded and examined below.
<- read.csv("https://raw.githubusercontent.com/droglenc/NCData/master/FatherPresent.csv")
fp str(fp)
#R> 'data.frame': 300 obs. of 2 variables:
#R> $ hospital: chr "A" "A" "A" "A" ...
#R> $ father : chr "Present" "Present" "Present" "Present" ...
headtail(fp)
#R> hospital father
#R> 1 A Present
#R> 2 A Present
#R> 3 A Present
#R> 298 D Absent
#R> 299 D Absent
#R> 300 D Absent
The chi-square test calculations can be made from either raw data, that are then summarized, or data that has already been summarized. Raw data, as are given here, should be in stacked format (as described in Section 23.1.1) and then summarized to a two-way frequency table with xtabs()
using a formula of the form ~cvarRow+cvarCol
as described in Section 26.2.1. The result should be saved to an object for later use.
<- xtabs(~hospital+father,data=fp) ) ( fp.obs
#R> father
#R> hospital Absent Present
#R> A 9 66
#R> B 15 60
#R> C 18 57
#R> D 19 56
IF the raw data do NOT exist, but a summary table is provided, then those summary statistics must be entered into an object using matrix()
. The summarized frequencies must first be entered into a vector with the first COLUMN of values followed by the second COLUMN and so on. This vector is then the first argument to matrix()
, which will also include the number of columns in the frequency table in ncol=
. For example, the same observed table from above would be constructed as follows.
<- matrix(c(9,15,18,19,66,60,57,56),ncol=2) fp.obs
It is useful to augment this matrix with column and row names as follows.
colnames(fp.obs) <- c("Absent","Present")
rownames(fp.obs) <- c("A","B","C","D")
fp.obs
#R> Absent Present
#R> A 9 66
#R> B 15 60
#R> C 18 57
#R> D 19 56
The observed table must contain frequencies, not proportions or percentages (don’t use percTable()
), without marginal totals (don’t use addMargins()
).
29.1.2 Chi-Square test in R
The Chi-Square Test is performed with chisq.test()
, which takes the observed frequency table as the first argument and correct=FALSE
so that the continuity correction is not used.112 The results of chisq.test()
should be assigned to an object. The Chi-Square test statistic and p-value are extracted by simply printing the saved object.
<- chisq.test(fp.obs,correct=FALSE) ) ( fp.chi
#R> Pearson's Chi-squared test with fp.obs
#R> X-squared = 5.0003, df = 3, p-value = 0.1718
The expected frequency table is returned by appending $expected
to the saved object.
$expected fp.chi
#R> Absent Present
#R> A 15.25 59.75
#R> B 15.25 59.75
#R> C 15.25 59.75
#R> D 15.25 59.75
From the results above the expected table for Step 5 is shown at the end, the observed table for Step 6 was shown further above, the χ2 test statistic for Step 7 is 5.000, and the p-value for Step 8 is 0.172 (with 3 df).
Rejecting the null hypothesis in a Chi-Square Test indicates that there is some difference in the distribution of individuals into the levels of the response variable among some of the groups. A post-hoc method for helping determine which groups differ is obtained by observing the Pearson residuals. A Pearson residual is computed for each cell in the table as,
\[ \frac{\text{Observed-Expected}}{\sqrt{\text{Expected}}} \]
which is the appropriately signed square root of the parts in the χ2 test statistic calculation. Therefore, cells that have Pearson residuals far from zero contributed substantially to the large χ2 test statistic that resulted in a small p-value and the ultimate rejection of H0. Patterns in where the large Pearson residuals are found may allow one to qualitatively determine which groups differ and, thus, which levels of the response differ the most. This process will be illustrated more fully in the examples. However, the Pearson residuals are obtained from the saved chisq.test()
object by appending $residuals
.
$residuals fp.chi
#R> Absent Present
#R> A -1.60046100 0.80855778
#R> B -0.06401844 0.03234231
#R> C 0.70420284 -0.35576543
#R> D 0.96027660 -0.48513467
These residuals are all relatively small which is not surprising given that the H0 was not rejected.
29.1.3 Example - Apostle Islands Plants
Below are the 11-steps for completing a full hypothesis test (Section 17.1) for the following situation:
In her Senior Capstone project a Northland College student recorded the dominant (i.e., most abundant) plant species in 100 randomly selected plots on both Devil’s Island and the Bayfield Peninsula (i.e., the mainland). There were a total of six “species” (one group was called “other”) recorded (labeled as A, B, C, D, E, and F). The results are shown in the table below. Determine, at the 5% level, if the frequency of dominant species differs between the two locations.
A | B | C | D | E | F | Sum | |
---|---|---|---|---|---|---|---|
Devil’s Is | 34 | 22 | 14 | 13 | 12 | 5 | 100 |
Mainland | 62 | 12 | 8 | 7 | 6 | 5 | 100 |
Sum | 96 | 34 | 22 | 20 | 18 | 10 | 200 |
As these data are given in summarized format they were entered into a matrix as described above. The entire R code for this analysis is shown in the “R Code and Results” subsection below.
- α = 0.05.
- H0: “the distribution of dominant plants species is the same between Devil’s Island and the Mainland” vs. HA: “the distribution of dominant plants species is NOT the same between Devil’s Island and the Mainland.”
- A Chi-Square Test is required because …
- a categorical variable with six levels (plant species) was recorded and
- two groups are being compared (Devil’s Island and Mainland).
- The data appear to be part of an observational study (the two groups existed and were not created by the researcher) where the plots were randomly selected.
- There are more than five individuals in each cell of the expected table (Table 29.1).
- The statistic is the observed frequency table given in the background information.
- χ2=16.54 with 5 df (Table 29.2).
- p-value=0.0055 (Table 29.2).
- H0 is rejected because the p-value is < α.
- There does appear to be a significant difference in the distribution of the dominant plants between the two sites. A look at the Pearson residuals (Table 29.3) and the row-percentage table (Table 29.4) both suggest that the biggest difference between the two locations is due to “plant A.”113
A | B | C | D | E | F | Sum | |
---|---|---|---|---|---|---|---|
Devil’s Is | 48 | 17 | 11 | 10 | 9 | 5 | 100 |
Mainland | 48 | 17 | 11 | 10 | 9 | 5 | 100 |
Sum | 96 | 34 | 22 | 20 | 18 | 10 | 200 |
X-squared = 16.5442, df = 5, p-value = 0.00545 |
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
Devil’s Is | -2.021 | 1.213 | 0.905 | 0.949 | 1 | 0 |
Mainland | 2.021 | -1.213 | -0.905 | -0.949 | -1 | 0 |
A | B | C | D | E | F | Sum | |
---|---|---|---|---|---|---|---|
Devil’s Is | 34 | 22 | 14 | 13 | 12 | 5 | 100 |
Mainland | 62 | 12 | 8 | 7 | 6 | 5 | 100 |
R Appendix:
<- c(34,22,14,13,12,5,62,12,8,7,6,5)
freq <- matrix(freq,nrow=2,byrow=TRUE)
ai.obs rownames(ai.obs) <- c("Devil's Is","Mainland")
colnames(ai.obs) <- c("A","B","C","D","E","F")
<- chisq.test(ai.obs) )
( ai.chi $expected
ai.chi$residuals
ai.chipercTable(ai.obs,margin=1,digits=1)
<- ai.obs[,-1]
ai.obs1 <- chisq.test(ai.obs1) ) ( ai.chi1
29.2 Goodness-of-Fit Test in R
29.2.1 Data Format
Consider the following example for this section.
Herriges and King (1999) examined modes of fishing for a large number of recreational saltwater users in southern California. One of the questions asked in their Southern California Sportfishing Survey was what “mode” they used for fishing – “from the beach,” “from a fishing pier,” “on a private boat,” or “on a chartered boat.” The results to this question, along with other data not used here, are found in FishingModes.csv (data, meta). One hypothesis of interest states that two-thirds of the users will fish from a boat, split evenly between private and charter boats, while the other one-third will fish from land, also split evenly between those fishing on the beach and those from a pier. Use the data in the
mode
variable of the data file to determine if this hypothesis is supported at the 10% level.
The data are loaded and examined below (the headtail()
only shows a few variables).
<- read.csv("https://raw.githubusercontent.com/droglenc/NCData/master/FishingModes.csv")
sf str(sf)
#R> 'data.frame': 1182 obs. of 12 variables:
#R> $ mode : chr "charter" "charter" "boat" "pier" ...
#R> $ price : num 182.9 34.5 24.3 15.1 41.5 ...
#R> $ catch : num 0.5391 0.4671 0.2413 0.0789 0.1082 ...
#R> $ pbeach : num 157.9 15.1 161.9 15.1 106.9 ...
#R> $ ppier : num 157.9 15.1 161.9 15.1 106.9 ...
#R> $ pboat : num 157.9 10.5 24.3 55.9 41.5 ...
#R> $ pcharter: num 182.9 34.5 59.3 84.9 71 ...
#R> $ cbeach : num 0.0678 0.1049 0.5333 0.0678 0.0678 ...
#R> $ cpier : num 0.0503 0.0451 0.4522 0.0789 0.0503 ...
#R> $ cboat : num 0.26 0.157 0.241 0.164 0.108 ...
#R> $ ccharter: num 0.539 0.467 1.027 0.539 0.324 ...
#R> $ income : num 7083 1250 3750 2083 4583 ...
headtail(sf,which=c("mode","price","catch"))
#R> mode price catch
#R> 1 charter 182.930 0.5391
#R> 2 charter 34.534 0.4671
#R> 3 boat 24.334 0.2413
#R> 1180 pier 65.036 0.4522
#R> 1181 beach 36.636 0.5333
#R> 1182 boat 235.436 0.6817
As with the Chi-Square Test, the Goodness-of-Fit test can be made from either raw data, that are then summarized, or data that has already been summarized. Raw data are summarized into a one-way observed frequency table using xtabs()
with a formula of ~cvarRow
as described in Section 25.3.2. The results should be saved to an object for later use.
<- xtabs(~mode,data=sf) ) ( sf.obs
#R> mode
#R> beach boat charter pier
#R> 134 418 452 178
IF the raw data do NOT exist, but a summary table is provided, then those summary statistics must be entered into an object using c()
. Each item can be named in c()
as shown below, which will make the results easier to interpret. For example, the same observed table from above would be constructed as follows.
<- c(beach=134,boat=418,charter=452,pier=178) ) ( sf.obs
#R> beach boat charter pier
#R> 134 418 452 178
29.2.2 Expected Table
The R function to perform a Goodness-of-Fit test described in the next section requires a vector of expected proportions in each group defined by the theoretical distribution in the hypotheses. This vector does not actually have to be proportions, per se, but it must be values that are directly proportional to the expected proportions. Thus, the vector can contain the proportions, percentages, actual expected values, or any other set of values that are related to the expected proportions by a constant.
In this example, the theoretical distribution suggests that “two-thirds of the users will fish from a boat, split evenly between private and charter boats, while the other one-third will fish from land, also split evenly between those fishing on the beach and those from a pier.” With this, the expected proportions are \(\frac{1}{3}\) for a private boat, \(\frac{1}{3}\) for a charter boat, \(\frac{1}{6}\) for beach, and \(\frac{1}{6}\) for a pier. These expected proportions could be entered into a vector in any of the following ways, as each way differs only by a constant value.
<- c(beach=1/6,boat=1/3,charter=1/3,pier=1/6) #proportions
exp.p <- c(beach=16.67,boat=33.33,charter=33.33,pier=16.67) #percentages
exp.p <- c(beach=1,boat=2,charter=2,pier=1) #ratios
exp.p <- c(beach=197,boat=394,charter=394,pier=197) #expected values exp.p
Enter the expected proportions in the same order as the observed frequencies.
29.2.3 Goodness-of-Fit Test in R
The Goodness-of-Fit Test is computed with chisq.test()
with the observed frequency table as the first argument and the following arguments:
p=
: a vector of expected proportions for the levels of the theoretical distribution.rescale.p=TRUE
: rescales the values inp=
to sum to 1. Rescaling is useful if the proportions inp=
were rounded or are expected frequencies.correct=FALSE
: indicates to not use a “continuity correction.”
The results from chisq.test()
should be assigned to an object so that useful information can be extracted.
<- chisq.test(sf.obs,p=exp.p,rescale.p=TRUE,correct=FALSE) ) ( sf.gof
#R> Chi-squared test for given probabilities with sf.obs
#R> X-squared = 31.9797, df = 3, p-value = 5.285e-07
From the results above the χ2 test statistic for Step 7 is 31.980, and the p-value for Step 8 is 0.00000053 (with 3 df). The observed table for Step 6 was shown further above. The expected table for Step 5 is extracted from the chisq.test()
object as follows.
$expected sf.gof
#R> beach boat charter pier
#R> 197 394 394 197
Rejecting H0 in a Goodness-of-Fit test simply indicates that the observed distribution differs from the theoretical distribution, but it is not clear exactly where the difference occurs. The following code can be used to neatly put the observed values, expected values, and Pearson residuals next to each other, which may be helpful for explaining any differences that were observed.
cbind(obs=sf.gof$observed,exp=sf.gof$expected,resids=sf.gof$residuals)
#R> obs exp resids
#R> beach 134 197 -4.488564
#R> boat 418 394 1.209103
#R> charter 452 394 2.921998
#R> pier 178 197 -1.353694
In this case it is clear that those respondents that answered “beach” were much lower than expected and those that answered “charter” were greater than expected.
Another way to identify where differences occurred is to examine confidence intervals for the proportions of individuals in each level. These confidence intervals are extracted from the saved chisq.test()
object with gofCI()
.
gofCI(sf.gof,digits=3)
#R> p.obs p.LCI p.UCI p.exp
#R> beach 0.113 0.097 0.133 0.167
#R> boat 0.354 0.327 0.381 0.333
#R> charter 0.382 0.355 0.410 0.333
#R> pier 0.151 0.131 0.172 0.167
Again, the confidence interval for the proportion that responded “beach” is less than the expected proportion and the proportion that answered “charter” is greater than expected.
29.2.4 Example - Loggerhead Shrikes
Below are the 11-steps for completing a full hypothesis test (Section 17.1) for the following situation:
Bohall-Wood (1987) constructed 24 random 16-km transects along roads in counties near Gainesville, FL. Two observers censused each transect once every 2 weeks from 18 October 1981 to 30 October 1982, by driving 32 km/h and scanning both sides of the road for perched and flying shrikes (Lanius ludovicianus). The habitat, whether the bird was on the roadside or actually in the habitat, and the perch type were recorded for each shrike observed. Habitats were grouped into five categories. The number of shrikes observed in each habitat was 1456 in open areas, 43 in midsuccessional, 112 in scattered trees, 44 in woods, and 6 in wetlands. Separate analyses were used to construct the proportion of habitat available in each of the five habitat types. These results were as follows: 0.358 open, 0.047 midsuccessional, 0.060 scattered trees, 0.531 woods, and 0.004 wetlands. Use these data to determine, at the 5% level, if shrikes are using the habitat in proportion to its availability.
- α=0.05.
- H0: “distribution of habitat use by shrikes is the same as the proportions of available habitat” vs. HA: “distribution of habitat use by shrikes is NOT the same as the proportions of available habitat.”
- A Goodness-of-Fit Test is required because …
- a categorical variable was recorded (habitat use),
- a single group (or population) was considered (shrikes in this area), and
- the observed distribution is compared to a theoretical distribution.
- The data appear to be part of an observational study where the individuals were not randomly selected but the transects upon which they were observed were.
- There are more than five individuals expected in each habitat level (Table 29.5).
- The statistic is the observed frequency table in Table 29.5.
- χ2=2345.1 with 4 df (Table 29.6).
- p-value<0.00005 (Table 29.6).
- H0 is rejected because the p-value<α.
- The shrikes do not appear to use habitats in the same proportions as the availability of the habitat.
- The 95% confidence intervals for the proportion of use in each habitat level are in Table 29.7. From these results it appears that the shrikes use the “open” habitat much more often and the “woods” habitat mush less often than would be expected if they used all habitats in proportion to their availability.
Observed | Expected | Residuals | |
---|---|---|---|
Open | 1456 | 595 | 35.323 |
MidSucc | 43 | 78 | -3.969 |
ScatTree | 112 | 100 | 1.236 |
Woods | 6 | 882 | -29.496 |
Wetland | 44 | 7 | 14.493 |
X-squared = 2345.071, df = 4, p-value < 2.2e-16 |
p.obs | p.LCI | p.UCI | p.exp | |
---|---|---|---|---|
Open | 0.877 | 0.860 | 0.892 | 0.358 |
MidSucc | 0.026 | 0.019 | 0.035 | 0.047 |
ScatTree | 0.067 | 0.056 | 0.081 | 0.060 |
Woods | 0.004 | 0.002 | 0.008 | 0.531 |
Wetland | 0.026 | 0.020 | 0.035 | 0.004 |
R Appendix:
<- c(Open=1456,MidSucc=43,ScatTree=112,Woods=6,Wetland=44) )
( obs <- c(Open=0.358,MidSucc=0.047,ScatTree=0.060,Woods=0.531,Wetland=0.004) )
( p.exp <- chisq.test(obs,p=p.exp,rescale.p=TRUE) )
( shrike.chi cbind(obs=shrike.chi$observed,exp=shrike.chi$expected,resids=shrike.chi$residuals)
gofCI(shrike.chi,digits=3)
29.3 Generic R Code
The following generic codes were used in this module and are provided here so that you can efficiently copy and paste them into your assignment. Note the following:
dfobj
should be replaced with the name of your data frame.cvarPop
andcvarResp
should be replaced with the names of your categorical population/groups and response variables, respectively.NUM
is replaced with numeric values.
Also examine the “R Function Guide” on the class Resources page for more guidance.
Chi-Square Test
( obstbl <- xtabs(~cvarPop+cvarResp,data=dfobj) ) # observed freq table
( chi <- chisq.test(obstbl,correct=FALSE) ) # chi-square and p-value
chi$expected # expected frequencies
percTable(obstbl,margin=1) # row percentage table
Goodness-of-Fit Test
( obstbl <- xtabs(~cvarResp,data=dfobj) ) # observed freq table
( exp.p <- c(lvl1=NUM,lvl2=NUM,lvl3=NUM,...) ) # expected proportions
( gof <- chisq.test(obstbl,p=exp.p,rescale.p=TRUE,correct=FALSE) ) # chi & p
gof$expected # expected frequencies
gofCI(gof,digits=3) # CIs for p-hats
Methods in this module require the
NCStats
package (as always) and theggplot2
package (for making graphs). Both packages are loaded in the first code chunk of the assignment template.↩︎The continuity correction is not used here simply so that the results using R will match hand-calculations. The continuity correction should usually be used.↩︎
When “Plant A” is removed from the observed table, the Chi-Square Test performed on the remaining plant species showed no difference in the distribution of the remaining plants between the two locations (p=0.9239). Thus, most of the difference in plant distributions between Devil’s Island and the Mainland appears to be due primarily to “plant A” with more of “plant A” found on the Mainland than on Devil’s Island.↩︎