Module 29 Chi-Square in R

In Modules 20 and 21 you learned the theory and how to perform the calculations required to complete the 11 steps (Section 17.1) of a hypothesis test for chi-square and goodness-of-fit tests. In this module, you will learn how to perform the required calculations from raw data using R.111

29.1 Chi-Square Tests

29.1.1 Data formats

Consider the following example for this section.

Daniel Weiss (in “100% American”) reported the results of a survey of 300 first-time fathers from four different hospitals (labeled as A, B, C, and D). Each father was asked if he was present (or not) in the delivery room when his child was born. The results of the survey are in FatherPresent.csv (data,meta). Use these data to determine if there is a difference, at the 5% level, in the proportion of fathers present in the delivery room among the four hospitals.

The data are loaded and examined below.

fp <- read.csv("https://raw.githubusercontent.com/droglenc/NCData/master/FatherPresent.csv")
str(fp)
#R>  'data.frame':  300 obs. of  2 variables:
#R>   $ hospital: chr  "A" "A" "A" "A" ...
#R>   $ father  : chr  "Present" "Present" "Present" "Present" ...
headtail(fp)
#R>      hospital  father
#R>  1          A Present
#R>  2          A Present
#R>  3          A Present
#R>  298        D  Absent
#R>  299        D  Absent
#R>  300        D  Absent

The chi-square test calculations can be made from either raw data, that are then summarized, or data that has already been summarized. Raw data, as are given here, should be in stacked format (as described in Section 23.1.1) and then summarized to a two-way frequency table with xtabs() using a formula of the form ~cvarRow+cvarCol as described in Section 26.2.1. The result should be saved to an object for later use.

( fp.obs <- xtabs(~hospital+father,data=fp) )
#R>          father
#R>  hospital Absent Present
#R>         A      9      66
#R>         B     15      60
#R>         C     18      57
#R>         D     19      56

 

IF the raw data do NOT exist, but a summary table is provided, then those summary statistics must be entered into an object using matrix(). The summarized frequencies must first be entered into a vector with the first COLUMN of values followed by the second COLUMN and so on. This vector is then the first argument to matrix(), which will also include the number of columns in the frequency table in ncol=. For example, the same observed table from above would be constructed as follows.

fp.obs <- matrix(c(9,15,18,19,66,60,57,56),ncol=2)

It is useful to augment this matrix with column and row names as follows.

colnames(fp.obs) <- c("Absent","Present")
rownames(fp.obs) <- c("A","B","C","D")
fp.obs
#R>    Absent Present
#R>  A      9      66
#R>  B     15      60
#R>  C     18      57
#R>  D     19      56

The observed table must contain frequencies, not proportions or percentages (don’t use percTable()), without marginal totals (don’t use addMargins()).

29.1.2 Chi-Square test in R

The Chi-Square Test is performed with chisq.test(), which takes the observed frequency table as the first argument and correct=FALSE so that the continuity correction is not used.112 The results of chisq.test() should be assigned to an object. The Chi-Square test statistic and p-value are extracted by simply printing the saved object.

( fp.chi <- chisq.test(fp.obs,correct=FALSE) )
#R>  Pearson's Chi-squared test with fp.obs 
#R>  X-squared = 5.0003, df = 3, p-value = 0.1718

The expected frequency table is returned by appending $expected to the saved object.

fp.chi$expected
#R>    Absent Present
#R>  A  15.25   59.75
#R>  B  15.25   59.75
#R>  C  15.25   59.75
#R>  D  15.25   59.75

From the results above the expected table for Step 5 is shown at the end, the observed table for Step 6 was shown further above, the χ2 test statistic for Step 7 is 5.000, and the p-value for Step 8 is 0.172 (with 3 df).

Rejecting the null hypothesis in a Chi-Square Test indicates that there is some difference in the distribution of individuals into the levels of the response variable among some of the groups. A post-hoc method for helping determine which groups differ is obtained by observing the Pearson residuals. A Pearson residual is computed for each cell in the table as,

\[ \frac{\text{Observed-Expected}}{\sqrt{\text{Expected}}} \]

which is the appropriately signed square root of the parts in the χ2 test statistic calculation. Therefore, cells that have Pearson residuals far from zero contributed substantially to the large χ2 test statistic that resulted in a small p-value and the ultimate rejection of H0. Patterns in where the large Pearson residuals are found may allow one to qualitatively determine which groups differ and, thus, which levels of the response differ the most. This process will be illustrated more fully in the examples. However, the Pearson residuals are obtained from the saved chisq.test() object by appending $residuals.

fp.chi$residuals
#R>         Absent     Present
#R>  A -1.60046100  0.80855778
#R>  B -0.06401844  0.03234231
#R>  C  0.70420284 -0.35576543
#R>  D  0.96027660 -0.48513467

These residuals are all relatively small which is not surprising given that the H0 was not rejected.

 

29.1.3 Example - Apostle Islands Plants

Below are the 11-steps for completing a full hypothesis test (Section 17.1) for the following situation:

In her Senior Capstone project a Northland College student recorded the dominant (i.e., most abundant) plant species in 100 randomly selected plots on both Devil’s Island and the Bayfield Peninsula (i.e., the mainland). There were a total of six “species” (one group was called “other”) recorded (labeled as A, B, C, D, E, and F). The results are shown in the table below. Determine, at the 5% level, if the frequency of dominant species differs between the two locations.

A B C D E F Sum
Devil’s Is 34 22 14 13 12 5 100
Mainland 62 12 8 7 6 5 100
Sum 96 34 22 20 18 10 200

 

As these data are given in summarized format they were entered into a matrix as described above. The entire R code for this analysis is shown in the “R Code and Results” subsection below.

  1. α = 0.05.
  2. H0: “the distribution of dominant plants species is the same between Devil’s Island and the Mainland” vs. HA: “the distribution of dominant plants species is NOT the same between Devil’s Island and the Mainland.”
  3. A Chi-Square Test is required because …
    1. a categorical variable with six levels (plant species) was recorded and
    2. two groups are being compared (Devil’s Island and Mainland).
  4. The data appear to be part of an observational study (the two groups existed and were not created by the researcher) where the plots were randomly selected.
  5. There are more than five individuals in each cell of the expected table (Table 29.1).
  6. The statistic is the observed frequency table given in the background information.
  7. χ2=16.54 with 5 df (Table 29.2).
  8. p-value=0.0055 (Table 29.2).
  9. H0 is rejected because the p-value is < α.
  10. There does appear to be a significant difference in the distribution of the dominant plants between the two sites. A look at the Pearson residuals (Table 29.3) and the row-percentage table (Table 29.4) both suggest that the biggest difference between the two locations is due to “plant A.”113
Table 29.1: Expected frequency table for dominant plant species on Devil’s Island and the Bayfield Peninsula.
A B C D E F Sum
Devil’s Is 48 17 11 10 9 5 100
Mainland 48 17 11 10 9 5 100
Sum 96 34 22 20 18 10 200

 

Table 29.2: Results from the Chi-Square Test for differences in the distribution of dominant plant species between Devil’s Island and the Bayfield Peninsula.
X-squared = 16.5442, df = 5, p-value = 0.00545

 

Table 29.3: Pearson residuals from the Chi-Square Test for differences in the distribution of dominant plant species between Devil’s Island and Bayfield Peninsula.
A B C D E F
Devil’s Is -2.021 1.213 0.905 0.949 1 0
Mainland 2.021 -1.213 -0.905 -0.949 -1 0

 

Table 29.4: Percentage of dominant plant species within each location (Devil’s Island and Bayfield Peninsula)
A B C D E F Sum
Devil’s Is 34 22 14 13 12 5 100
Mainland 62 12 8 7 6 5 100

R Appendix:

freq <- c(34,22,14,13,12,5,62,12,8,7,6,5)
ai.obs <- matrix(freq,nrow=2,byrow=TRUE)
rownames(ai.obs) <- c("Devil's Is","Mainland")
colnames(ai.obs) <- c("A","B","C","D","E","F")
( ai.chi <- chisq.test(ai.obs) )
ai.chi$expected
ai.chi$residuals
percTable(ai.obs,margin=1,digits=1)
ai.obs1 <- ai.obs[,-1]
( ai.chi1 <- chisq.test(ai.obs1) )

 

29.2 Goodness-of-Fit Test in R

29.2.1 Data Format

Consider the following example for this section.

Herriges and King (1999) examined modes of fishing for a large number of recreational saltwater users in southern California. One of the questions asked in their Southern California Sportfishing Survey was what “mode” they used for fishing – “from the beach,” “from a fishing pier,” “on a private boat,” or “on a chartered boat.” The results to this question, along with other data not used here, are found in FishingModes.csv (data, meta). One hypothesis of interest states that two-thirds of the users will fish from a boat, split evenly between private and charter boats, while the other one-third will fish from land, also split evenly between those fishing on the beach and those from a pier. Use the data in the mode variable of the data file to determine if this hypothesis is supported at the 10% level.

The data are loaded and examined below (the headtail() only shows a few variables).

sf <- read.csv("https://raw.githubusercontent.com/droglenc/NCData/master/FishingModes.csv")
str(sf)
#R>  'data.frame':  1182 obs. of  12 variables:
#R>   $ mode    : chr  "charter" "charter" "boat" "pier" ...
#R>   $ price   : num  182.9 34.5 24.3 15.1 41.5 ...
#R>   $ catch   : num  0.5391 0.4671 0.2413 0.0789 0.1082 ...
#R>   $ pbeach  : num  157.9 15.1 161.9 15.1 106.9 ...
#R>   $ ppier   : num  157.9 15.1 161.9 15.1 106.9 ...
#R>   $ pboat   : num  157.9 10.5 24.3 55.9 41.5 ...
#R>   $ pcharter: num  182.9 34.5 59.3 84.9 71 ...
#R>   $ cbeach  : num  0.0678 0.1049 0.5333 0.0678 0.0678 ...
#R>   $ cpier   : num  0.0503 0.0451 0.4522 0.0789 0.0503 ...
#R>   $ cboat   : num  0.26 0.157 0.241 0.164 0.108 ...
#R>   $ ccharter: num  0.539 0.467 1.027 0.539 0.324 ...
#R>   $ income  : num  7083 1250 3750 2083 4583 ...
headtail(sf,which=c("mode","price","catch"))
#R>          mode   price  catch
#R>  1    charter 182.930 0.5391
#R>  2    charter  34.534 0.4671
#R>  3       boat  24.334 0.2413
#R>  1180    pier  65.036 0.4522
#R>  1181   beach  36.636 0.5333
#R>  1182    boat 235.436 0.6817

As with the Chi-Square Test, the Goodness-of-Fit test can be made from either raw data, that are then summarized, or data that has already been summarized. Raw data are summarized into a one-way observed frequency table using xtabs() with a formula of ~cvarRow as described in Section 25.3.2. The results should be saved to an object for later use.

( sf.obs <- xtabs(~mode,data=sf) )
#R>  mode
#R>    beach    boat charter    pier 
#R>      134     418     452     178

 

IF the raw data do NOT exist, but a summary table is provided, then those summary statistics must be entered into an object using c(). Each item can be named in c() as shown below, which will make the results easier to interpret. For example, the same observed table from above would be constructed as follows.

( sf.obs <- c(beach=134,boat=418,charter=452,pier=178) )
#R>    beach    boat charter    pier 
#R>      134     418     452     178

 

29.2.2 Expected Table

The R function to perform a Goodness-of-Fit test described in the next section requires a vector of expected proportions in each group defined by the theoretical distribution in the hypotheses. This vector does not actually have to be proportions, per se, but it must be values that are directly proportional to the expected proportions. Thus, the vector can contain the proportions, percentages, actual expected values, or any other set of values that are related to the expected proportions by a constant.

In this example, the theoretical distribution suggests that “two-thirds of the users will fish from a boat, split evenly between private and charter boats, while the other one-third will fish from land, also split evenly between those fishing on the beach and those from a pier.” With this, the expected proportions are \(\frac{1}{3}\) for a private boat, \(\frac{1}{3}\) for a charter boat, \(\frac{1}{6}\) for beach, and \(\frac{1}{6}\) for a pier. These expected proportions could be entered into a vector in any of the following ways, as each way differs only by a constant value.

exp.p <- c(beach=1/6,boat=1/3,charter=1/3,pier=1/6)          #proportions
exp.p <- c(beach=16.67,boat=33.33,charter=33.33,pier=16.67)  #percentages
exp.p <- c(beach=1,boat=2,charter=2,pier=1)                  #ratios
exp.p <- c(beach=197,boat=394,charter=394,pier=197)          #expected values

Enter the expected proportions in the same order as the observed frequencies.

29.2.3 Goodness-of-Fit Test in R

The Goodness-of-Fit Test is computed with chisq.test() with the observed frequency table as the first argument and the following arguments:

  • p=: a vector of expected proportions for the levels of the theoretical distribution.
  • rescale.p=TRUE: rescales the values in p= to sum to 1. Rescaling is useful if the proportions in p= were rounded or are expected frequencies.
  • correct=FALSE: indicates to not use a “continuity correction.”

The results from chisq.test() should be assigned to an object so that useful information can be extracted.

( sf.gof <- chisq.test(sf.obs,p=exp.p,rescale.p=TRUE,correct=FALSE) )
#R>  Chi-squared test for given probabilities with sf.obs 
#R>  X-squared = 31.9797, df = 3, p-value = 5.285e-07

From the results above the χ2 test statistic for Step 7 is 31.980, and the p-value for Step 8 is 0.00000053 (with 3 df). The observed table for Step 6 was shown further above. The expected table for Step 5 is extracted from the chisq.test() object as follows.

sf.gof$expected
#R>    beach    boat charter    pier 
#R>      197     394     394     197

Rejecting H0 in a Goodness-of-Fit test simply indicates that the observed distribution differs from the theoretical distribution, but it is not clear exactly where the difference occurs. The following code can be used to neatly put the observed values, expected values, and Pearson residuals next to each other, which may be helpful for explaining any differences that were observed.

cbind(obs=sf.gof$observed,exp=sf.gof$expected,resids=sf.gof$residuals)
#R>          obs exp    resids
#R>  beach   134 197 -4.488564
#R>  boat    418 394  1.209103
#R>  charter 452 394  2.921998
#R>  pier    178 197 -1.353694

In this case it is clear that those respondents that answered “beach” were much lower than expected and those that answered “charter” were greater than expected.

Another way to identify where differences occurred is to examine confidence intervals for the proportions of individuals in each level. These confidence intervals are extracted from the saved chisq.test() object with gofCI().

gofCI(sf.gof,digits=3)
#R>          p.obs p.LCI p.UCI p.exp
#R>  beach   0.113 0.097 0.133 0.167
#R>  boat    0.354 0.327 0.381 0.333
#R>  charter 0.382 0.355 0.410 0.333
#R>  pier    0.151 0.131 0.172 0.167

Again, the confidence interval for the proportion that responded “beach” is less than the expected proportion and the proportion that answered “charter” is greater than expected.

 

29.2.4 Example - Loggerhead Shrikes

Below are the 11-steps for completing a full hypothesis test (Section 17.1) for the following situation:

Bohall-Wood (1987) constructed 24 random 16-km transects along roads in counties near Gainesville, FL. Two observers censused each transect once every 2 weeks from 18 October 1981 to 30 October 1982, by driving 32 km/h and scanning both sides of the road for perched and flying shrikes (Lanius ludovicianus). The habitat, whether the bird was on the roadside or actually in the habitat, and the perch type were recorded for each shrike observed. Habitats were grouped into five categories. The number of shrikes observed in each habitat was 1456 in open areas, 43 in midsuccessional, 112 in scattered trees, 44 in woods, and 6 in wetlands. Separate analyses were used to construct the proportion of habitat available in each of the five habitat types. These results were as follows: 0.358 open, 0.047 midsuccessional, 0.060 scattered trees, 0.531 woods, and 0.004 wetlands. Use these data to determine, at the 5% level, if shrikes are using the habitat in proportion to its availability.

  1. α=0.05.
  2. H0: “distribution of habitat use by shrikes is the same as the proportions of available habitat” vs. HA: “distribution of habitat use by shrikes is NOT the same as the proportions of available habitat.”
  3. A Goodness-of-Fit Test is required because …
    1. a categorical variable was recorded (habitat use),
    2. a single group (or population) was considered (shrikes in this area), and
    3. the observed distribution is compared to a theoretical distribution.
  4. The data appear to be part of an observational study where the individuals were not randomly selected but the transects upon which they were observed were.
  5. There are more than five individuals expected in each habitat level (Table 29.5).
  6. The statistic is the observed frequency table in Table 29.5.
  7. χ2=2345.1 with 4 df (Table 29.6).
  8. p-value<0.00005 (Table 29.6).
  9. H0 is rejected because the p-value<α.
  10. The shrikes do not appear to use habitats in the same proportions as the availability of the habitat.
  11. The 95% confidence intervals for the proportion of use in each habitat level are in Table 29.7. From these results it appears that the shrikes use the “open” habitat much more often and the “woods” habitat mush less often than would be expected if they used all habitats in proportion to their availability.
Table 29.5: Observed and expected frequencies and Pearson residuals for the Goodness-of-Fit Test for shrike habitat use.
Observed Expected Residuals
Open 1456 595 35.323
MidSucc 43 78 -3.969
ScatTree 112 100 1.236
Woods 6 882 -29.496
Wetland 44 7 14.493

 

Table 29.6: Results from the Goodness-of-Fit Test for shrike habitat use.
X-squared = 2345.071, df = 4, p-value < 2.2e-16

 

Table 29.7: Observed proportions, 95% condidence intervals for the proprtions, and expected proportions for shrike habitat use.
p.obs p.LCI p.UCI p.exp
Open 0.877 0.860 0.892 0.358
MidSucc 0.026 0.019 0.035 0.047
ScatTree 0.067 0.056 0.081 0.060
Woods 0.004 0.002 0.008 0.531
Wetland 0.026 0.020 0.035 0.004

R Appendix:

( obs <- c(Open=1456,MidSucc=43,ScatTree=112,Woods=6,Wetland=44) )
( p.exp <- c(Open=0.358,MidSucc=0.047,ScatTree=0.060,Woods=0.531,Wetland=0.004) )
( shrike.chi <- chisq.test(obs,p=p.exp,rescale.p=TRUE) )
cbind(obs=shrike.chi$observed,exp=shrike.chi$expected,resids=shrike.chi$residuals)
gofCI(shrike.chi,digits=3)

 

29.3 Generic R Code

The following generic codes were used in this module and are provided here so that you can efficiently copy and paste them into your assignment. Note the following:

  • dfobj should be replaced with the name of your data frame.
  • cvarPop and cvarResp should be replaced with the names of your categorical population/groups and response variables, respectively.
  • NUM is replaced with numeric values.

Also examine the “R Function Guide” on the class Resources page for more guidance.

Chi-Square Test

( obstbl <- xtabs(~cvarPop+cvarResp,data=dfobj) )  # observed freq table
( chi <- chisq.test(obstbl,correct=FALSE) )        # chi-square and p-value
chi$expected                                       # expected frequencies
percTable(obstbl,margin=1)                         # row percentage table

Goodness-of-Fit Test

( obstbl <- xtabs(~cvarResp,data=dfobj) )          # observed freq table
( exp.p <- c(lvl1=NUM,lvl2=NUM,lvl3=NUM,...) )     # expected proportions
( gof <- chisq.test(obstbl,p=exp.p,rescale.p=TRUE,correct=FALSE) )  # chi & p
gof$expected                                       # expected frequencies
gofCI(gof,digits=3)                                # CIs for p-hats

 


  1. Methods in this module require the NCStats package (as always) and the ggplot2 package (for making graphs). Both packages are loaded in the first code chunk of the assignment template.↩︎

  2. The continuity correction is not used here simply so that the results using R will match hand-calculations. The continuity correction should usually be used.↩︎

  3. When “Plant A” is removed from the observed table, the Chi-Square Test performed on the remaining plant species showed no difference in the distribution of the remaining plants between the two locations (p=0.9239). Thus, most of the difference in plant distributions between Devil’s Island and the Mainland appears to be due primarily to “plant A” with more of “plant A” found on the Mainland than on Devil’s Island.↩︎