Module 10 Bivariate EDA - Categorical

In this module we consider the relationship between two categorical variables. For example, the General Sociological Survey (GSS) is a very large survey that has been administered 25 times since 1972. The purpose of the GSS is to gather data on contemporary American society in order to monitor and explain trends in attitudes, behaviors, and attributes. Two questions from a recent GSS are:

  1. What is your highest degree earned? [choices – “less than high school diploma,” “high school diploma,” “junior college,” “bachelors,” or “graduate”; labeled as degree]
  2. How willing would you be to accept cuts in your standard of living in order to protect the environment? [choices – “very willing,” “fairly willing,” “neither willing nor unwilling,” “not very willing,” or “not at all willing”; labeled as grnsol]

An example of these data are shown below.

degree grnsol
ltHS vwill
HS will
HS unwill
JC vunwill
grad vunwill
. .
. .

 

These types of data are summarized with two-way frequency tables as shown in Section 10.1 and percentage tables as shown in Section 10.2. Specific questions can be answered from these tables as described in Section 10.3 and illustrated in Section 10.4.

10.1 Frequency Tables

Bivariate categorical data is summarized by counting the number of individuals that have each combination of the two categorical variables. For example, counting how many respondents had less than a HS degree and were very willing, how many had a high school degree and were willing, and so on.

The number of individuals of each combination is called a frequency and the frequencies for all combinations are displayed in a two-way frequency table (Table 10.1). For example, 40 of the respondents had less than a high school degree and were very willing to take a cut in their standard of living to protect the environment. Similarly, 542 respondents had a high school degree and were willing to cut their standard of living.

 

Table 10.1: Frequency table of respondent’s highest completed degree (rows) and willingness to cut their standard of living to protect the environment (columns).
vwill will neither unwill vunwill Sum
ltHS 40 145 132 151 178 646
HS 87 542 512 557 392 2090
JC 15 61 64 54 44 238
BS 42 199 179 187 75 682
grad 24 104 83 64 24 299
Sum 208 1051 970 1013 713 3955

A two-way frequency table may be augmented by including row and column totals (as in Table 10.1). Each marginal total represents the distribution of one of the categorical variables, while ignoring the other. For example there were 238 respondents whose highest completed degree was junior college and there were 713 respondents who were very unwilling to cut their standard of living to protect the environment.

  • Rows are horizontal (left-to-right) and columns are vertical (top-to-bottom).

If one variable can be considered as the response, then this variable should form the columns of the frequency table. For example, “willingness to cut” could be considered the response variable and it was, appropriately, placed as the column variable in Table 10.1.

  • If one of the two variables can be considered a response variable than it should be placed in the columns.

10.2 Percentage Tables

Two-way frequency tables may be converted to percentage tables to ease comparing between levels of the response variable or between populations. For example, it is difficult to determine from Table 10.1 if respondents with a high school degree were more likely to be very willing to cut their standard of living than respondents with a graduate degree, because there are approximately seven times as many respondents with a high school degree. However, if the frequencies were converted to percentages, then this comparison can be easily made. Three types of percentage tables may be constructed from a frequency table.

10.2.1 Total-Percentage Table

Each value in a total-percentage table is computed by dividing each cell of the frequency table by the total number of ALL individuals in the frequency table and multiplying by 100. For example, the value in the “vwill” column and “ltHS” row of the table-percentage table (Table 10.2) is computed by dividing the value in the “vwill” column and “ltHS” row of the frequency table (i.e., 40; Table 10.1) by the “Sum” of the entire frequency table (i.e., 3955) and multiplying by 100.

 

Table 10.2: Table-percentage table of respondent’s highest completed degree (rows) and willingness to cut their standard of living to protect the environment (columns).
vwill will neither unwill vunwill Sum
ltHS 1.0 3.7 3.3 3.8 4.5 16.3
HS 2.2 13.7 12.9 14.1 9.9 52.8
JC 0.4 1.5 1.6 1.4 1.1 6.0
BS 1.1 5.0 4.5 4.7 1.9 17.2
grad 0.6 2.6 2.1 1.6 0.6 7.5
Sum 5.3 26.5 24.4 25.6 18.0 99.8

 

The value in each cell of a total-percentage table is the percentage OF ALL individuals that have the characteristic of that column AND that row. For example, 1.0% of ALL respondents had less than a high school degree AND were very willing to cut their standard of living to protect the environment. In contrast to the interpretations of the row and column-percentage tables below, interpretations from the table-percentages table DOES refer to ALL individuals.

10.2.2 Row-Percentage Table

A row-percentage table is computed by dividing each cell of the frequency table by the total in the same ROW of the frequency table and multiplying by 100 (Table 10.3). For example, the value in the “vwill” column and “ltHS” row of the row-percentage table is computed by dividing the value in the “vwill” column and “ltHS” row of the frequency table (i.e., 40; Table 10.1) by the “Sum” of the “ltHS” ROW of the frequency table (i.e., 646) and multiplying by 100.

 

Table 10.3: Row-percentage table of respondent’s highest completed degree (rows) and willingness to cut their standard of living to protect the environment (columns).
vwill will neither unwill vunwill Sum
ltHS 6.2 22.4 20.4 23.4 27.6 100.0
HS 4.2 25.9 24.5 26.7 18.8 100.1
JC 6.3 25.6 26.9 22.7 18.5 100.0
BS 6.2 29.2 26.2 27.4 11.0 100.0
grad 8.0 34.8 27.8 21.4 8.0 100.0

 

The value in each cell of a row-percentage table is the percentage OF INDIVIDUALS IN THAT ROW that have the characteristic of that column. For example, 6.2% OF RESPONDENTS WITH LESS THAN A HIGH SCHOOL DEGREE were very willing to cut their standard of living to protect the environment.47

10.2.3 Column-Percentage Table

A column-percentage table is computed by dividing each cell of the frequency table by the total in the same COLUMN of the frequency table and multiplying by 100 (Table 10.4). For example, the value in the “vwill” column and “ltHS” row on the column-percentage table is computed by dividing the value in the “vwill” column and “ltHS” row of the frequency table (i.e., 40; Table 10.1) by the “Sum” of the “vwill” COLUMN of the frequency table (i.e., 208) and multiplying by 100.

 

Table 10.4: Column-percentage table of respondent’s highest completed degree (rows) and willingness to cut their standard of living to protect the environment (columns).
vwill will neither unwill vunwill
ltHS 19.2 13.8 13.6 14.9 25.0
HS 41.8 51.6 52.8 55.0 55.0
JC 7.2 5.8 6.6 5.3 6.2
BS 20.2 18.9 18.5 18.5 10.5
grad 11.5 9.9 8.6 6.3 3.4
Sum 99.9 100.0 100.1 100.0 100.1

 

The value in each cell of a column-percentage table is the percentage OF ALL INDIVIDUALS IN THAT COLUMN that have the characteristic of that row. For example, 19.2% OF RESPONDENTS WHO WERE VERY WILLING TO CUT THEIR STANDARD OF LIVING had less than a high school degree.48

10.3 Which Table to Use?

If the question asks for a number of individuals then use the frequency table. For example, a frequency table holds the answer to the question of “How many respondents with a graduate degree were (only) willing to cut their standard of living to protect the environment” (i.e., 104).

If the question asks for a percentage of ALL individual then use the total percentage table. For example, the total percentage table is used to answer “What percentage of all respondents had a high school degree and were very willing to cut their standard of living?” (i.e., 2.2%; Table 10.2).

If the questions asks for a percentage but the percentage is not of ALL individuals then use either the row or column percentage table depending on which group the question is focused on. If the question is focused on a group represented by a ROW then use the row percentage table. For example, the question – “What percentage of respondents with a bachelor’s degree were very unwilling to cut their standard of living to protect the environment?” is focused only on respondents with a bachelor’s degree. Because bacherlor’s degrees are shown in a row, then the row percentage table would be used to answer this question (i.e., 11.0%; Table 10.3).

In contract, the “What percentage of respondents who were neither willing nor unwilling to cut their standard of living had graduate degrees?” is focused on those respondents who were neither willing nor unwilling to cut their standard of living, which is represented by a COLUMN. Thus, a column percentage table is used to answer this question (i.e., 8.6%; Table 10.4).

Finally, consider this question – “What percentage of all respondents were very willing to cut their standard of living to help the environment?” This question has no restrictions, so the total-percentage table would be used. In addition, this question is only concerned with with the COLUMN variable and, thus, the answer will come from the “Sum” ROW. Therefore, 5.3%, were very willing to cut their standard of living to help the environment.

  • If the question is about a number of individuals then use the frequency table.
  • If the question is about the percentage of ALL individuals then use the table percentage table.
  • If the question is about a percentage of only some individuals then determine if the individuals in question are shown in a row (i.e., use the row percentage table) or a column (i.e., use the column percentage table).
  • If the question does not refer to one of the two variables, then the answer will generally come from the margin (the “Sum” row or column) of the other variable.

 

10.4 Example Calculations

In early 2021, the United States was deeply divided with respect to political allegiances and was just beginning to combat the COVID-19 virus with three different vaccines. The development of the vaccines had begun during the Donald Trump’s presidency, but administration of the vaccine had largely been carried out during the first months of Joseph Biden’s presidency. In March 2021, YouGov asked a sample of U.S. adults “Thinking about the vaccine rollout in the US, do you believe the Trump administration or the Biden administration deserves more credit?” The results for those respondents that identified a political affiliation and had an opinion about which administration deserved credit are shown in Table 10.5.

Table 10.5: Frequency table of respondent by political affiliation (rows) and which presidential administration they think deserves more credit for the COVID-19 vaccine rollout (columns).
Trump Biden Both Neither Sum
Democrat 249 2023 436 187 2895
Republican 2013 230 316 144 2703
Independent 323 313 127 108 871
Sum 2585 2566 879 439 6469

Use these results to answer the following questions. [Note that I have highlighted key phrases in the question that are useful for determining which table to use. These phrases will not be underlined in the exercises.]

  1. What percent of “Democrats” believe that Donald Trump’s administration deserves credit for the COVID-19 vaccine rollout?
    • Of the 2895 “Democrats” in the sample, 8.6% believed that Donald Trump’s administration deserved credit for the COVID-19 vaccine rollout (=\(\frac{249}{2895}\)×100).
  2. What percent of “Republicans” believe that Joseph Biden’s administration deserves credit for the COVID-19 vaccine rollout?
    • Of the 2703 “Replublicans” in the sample, 8.5% believed that Joseph Biden’s administration deserved credit for the COVID-19 vaccine rollout (=\(\frac{230}{2703}\)×100).
  3. What percent of those that believe that neither administration deserves credits for the COVID-19 vaccine rollout were “Independents?”
    • Of the 439 respondent that believed neither administration deserved credit for the COVID-19 vaccine rollout, 24.6% were “Independents” (=\(\frac{108}{439}\)×100).
  4. What percent of all respondents were “Independents?”
    • Of the 6469 respondents in the sample, 13.5% were “Independents” (=\(\frac{871}{6469}\)×100).
  5. What percent of all respondents were “Democrats” and believed that both administrations deserved credit for the COVID-19 vaccine rollout?
    • Of the 6469 respondents in the sample, 6.7% were “Democrats” and believed that both administrations deserved credit for the COVID-19 vaccine rollout (=\(\frac{436}{6469}\)×100).

 

10.5 Making an Overall Summary

An overall summary for a categorical bivariate EDA can be constructed by describing how the percentage of individuals in the response levels differs across the groups of the explanatory variable. If you followed the recommendation of having the response variable in the columns then this simplifies to describing how the ROW percentages differ across the rows.

For example, from Table 10.3 it is evident that there was a general increase in the percentage of respondents that were willing (either “willing” or “very willing”) to cut their standard of living to protect the environment as the level of education increased. In other words, it appears that more formally educated individuals were more willing to sacrifice the standard of living to protect the environment.

As another example, it is clear that most respondents feel that the candidate that aligns with their party affiliation deserves more of the credit for the COVID-19 vaccine rollout; i.e., more Democrats feel that Biden deserves more credit whereas most Republicans feel that Trump deserves more credit (Table 10.6). Independent respondents were fairly evenly split about which president deserves more credit.

 

Table 10.6: Row-percentage table of respondent’s political affiliation (rows) and which presidential administration they think deserves more credit for the COVID-19 vaccine rollout (columns).
Trump Biden Both Neither Sum
Democrat 8.6 69.9 15.1 6.5 100.1
Republican 74.5 8.5 11.7 5.3 100.0
Independent 37.1 35.9 14.6 12.4 100.0

 

  • If the response variable is properly placed in the columns of the frequency table, then an overall summary of how the response variables differs (or not) across the groups can be made by comparing across rows of the row percentage table.

 


  1. This statement must be read carefully. OF THE RESPONDENTS WITH LESS THAN A HIGH SCHOOL DEGREE, not of all respondents, 6.2% were very willing to cut their standard of living.↩︎

  2. Again, read this carefully. OF THE RESPONDENTS WHO WERE VERY WILLING TO CUT THEIR STANDARD OF LIVING, not of all respondents, 19.2% had less than a high school degree.↩︎