Module 10 Bivariate EDA - Categorical
In this module we consider the relationship between two categorical variables. For example, the General Sociological Survey (GSS) is a very large survey that has been administered 25 times since 1972. The purpose of the GSS is to gather data on contemporary American society in order to monitor and explain trends in attitudes, behaviors, and attributes. Two questions from a recent GSS are:
- What is your highest degree earned? [choices – “less than high school diploma,” “high school diploma,” “junior college,” “bachelors,” or “graduate”; labeled as degree]
- How willing would you be to accept cuts in your standard of living in order to protect the environment? [choices – “very willing,” “fairly willing,” “neither willing nor unwilling,” “not very willing,” or “not at all willing”; labeled as grnsol]
An example of these data are shown below.
degree | grnsol |
---|---|
ltHS | vwill |
HS | will |
HS | unwill |
JC | vunwill |
grad | vunwill |
. | . |
. | . |
These types of data are summarized with two-way frequency tables as shown in Section 10.1 and percentage tables as shown in Section 10.2. Specific questions can be answered from these tables as described in Section 10.3 and illustrated in Section 10.4.
10.1 Frequency Tables
Bivariate categorical data is summarized by counting the number of individuals that have each combination of the two categorical variables. For example, counting how many respondents had less than a HS degree and were very willing, how many had a high school degree and were willing, and so on.
The number of individuals of each combination is called a frequency and the frequencies for all combinations are displayed in a two-way frequency table (Table 10.1). For example, 40 of the respondents had less than a high school degree and were very willing to take a cut in their standard of living to protect the environment. Similarly, 542 respondents had a high school degree and were willing to cut their standard of living.
vwill | will | neither | unwill | vunwill | Sum | |
---|---|---|---|---|---|---|
ltHS | 40 | 145 | 132 | 151 | 178 | 646 |
HS | 87 | 542 | 512 | 557 | 392 | 2090 |
JC | 15 | 61 | 64 | 54 | 44 | 238 |
BS | 42 | 199 | 179 | 187 | 75 | 682 |
grad | 24 | 104 | 83 | 64 | 24 | 299 |
Sum | 208 | 1051 | 970 | 1013 | 713 | 3955 |
A two-way frequency table may be augmented by including row and column totals (as in Table 10.1). Each marginal total represents the distribution of one of the categorical variables, while ignoring the other. For example there were 238 respondents whose highest completed degree was junior college and there were 713 respondents who were very unwilling to cut their standard of living to protect the environment.
- Rows are horizontal (left-to-right) and columns are vertical (top-to-bottom).
If one variable can be considered as the response, then this variable should form the columns of the frequency table. For example, “willingness to cut” could be considered the response variable and it was, appropriately, placed as the column variable in Table 10.1.
- If one of the two variables can be considered a response variable than it should be placed in the columns.
10.2 Percentage Tables
Two-way frequency tables may be converted to percentage tables to ease comparing between levels of the response variable or between populations. For example, it is difficult to determine from Table 10.1 if respondents with a high school degree were more likely to be very willing to cut their standard of living than respondents with a graduate degree, because there are approximately seven times as many respondents with a high school degree. However, if the frequencies were converted to percentages, then this comparison can be easily made. Three types of percentage tables may be constructed from a frequency table.
10.2.1 Total-Percentage Table
Each value in a total-percentage table is computed by dividing each cell of the frequency table by the total number of ALL individuals in the frequency table and multiplying by 100. For example, the value in the “vwill” column and “ltHS” row of the table-percentage table (Table 10.2) is computed by dividing the value in the “vwill” column and “ltHS” row of the frequency table (i.e., 40; Table 10.1) by the “Sum” of the entire frequency table (i.e., 3955) and multiplying by 100.
vwill | will | neither | unwill | vunwill | Sum | |
---|---|---|---|---|---|---|
ltHS | 1.0 | 3.7 | 3.3 | 3.8 | 4.5 | 16.3 |
HS | 2.2 | 13.7 | 12.9 | 14.1 | 9.9 | 52.8 |
JC | 0.4 | 1.5 | 1.6 | 1.4 | 1.1 | 6.0 |
BS | 1.1 | 5.0 | 4.5 | 4.7 | 1.9 | 17.2 |
grad | 0.6 | 2.6 | 2.1 | 1.6 | 0.6 | 7.5 |
Sum | 5.3 | 26.5 | 24.4 | 25.6 | 18.0 | 99.8 |
The value in each cell of a total-percentage table is the percentage OF ALL individuals that have the characteristic of that column AND that row. For example, 1.0% of ALL respondents had less than a high school degree AND were very willing to cut their standard of living to protect the environment. In contrast to the interpretations of the row and column-percentage tables below, interpretations from the table-percentages table DOES refer to ALL individuals.
10.2.2 Row-Percentage Table
A row-percentage table is computed by dividing each cell of the frequency table by the total in the same ROW of the frequency table and multiplying by 100 (Table 10.3). For example, the value in the “vwill” column and “ltHS” row of the row-percentage table is computed by dividing the value in the “vwill” column and “ltHS” row of the frequency table (i.e., 40; Table 10.1) by the “Sum” of the “ltHS” ROW of the frequency table (i.e., 646) and multiplying by 100.
vwill | will | neither | unwill | vunwill | Sum | |
---|---|---|---|---|---|---|
ltHS | 6.2 | 22.4 | 20.4 | 23.4 | 27.6 | 100.0 |
HS | 4.2 | 25.9 | 24.5 | 26.7 | 18.8 | 100.1 |
JC | 6.3 | 25.6 | 26.9 | 22.7 | 18.5 | 100.0 |
BS | 6.2 | 29.2 | 26.2 | 27.4 | 11.0 | 100.0 |
grad | 8.0 | 34.8 | 27.8 | 21.4 | 8.0 | 100.0 |
The value in each cell of a row-percentage table is the percentage OF INDIVIDUALS IN THAT ROW that have the characteristic of that column. For example, 6.2% OF RESPONDENTS WITH LESS THAN A HIGH SCHOOL DEGREE were very willing to cut their standard of living to protect the environment.47
10.2.3 Column-Percentage Table
A column-percentage table is computed by dividing each cell of the frequency table by the total in the same COLUMN of the frequency table and multiplying by 100 (Table 10.4). For example, the value in the “vwill” column and “ltHS” row on the column-percentage table is computed by dividing the value in the “vwill” column and “ltHS” row of the frequency table (i.e., 40; Table 10.1) by the “Sum” of the “vwill” COLUMN of the frequency table (i.e., 208) and multiplying by 100.
vwill | will | neither | unwill | vunwill | |
---|---|---|---|---|---|
ltHS | 19.2 | 13.8 | 13.6 | 14.9 | 25.0 |
HS | 41.8 | 51.6 | 52.8 | 55.0 | 55.0 |
JC | 7.2 | 5.8 | 6.6 | 5.3 | 6.2 |
BS | 20.2 | 18.9 | 18.5 | 18.5 | 10.5 |
grad | 11.5 | 9.9 | 8.6 | 6.3 | 3.4 |
Sum | 99.9 | 100.0 | 100.1 | 100.0 | 100.1 |
The value in each cell of a column-percentage table is the percentage OF ALL INDIVIDUALS IN THAT COLUMN that have the characteristic of that row. For example, 19.2% OF RESPONDENTS WHO WERE VERY WILLING TO CUT THEIR STANDARD OF LIVING had less than a high school degree.48
10.3 Which Table to Use?
If the question asks for a number of individuals then use the frequency table. For example, a frequency table holds the answer to the question of “How many respondents with a graduate degree were (only) willing to cut their standard of living to protect the environment” (i.e., 104).
If the question asks for a percentage of ALL individual then use the total percentage table. For example, the total percentage table is used to answer “What percentage of all respondents had a high school degree and were very willing to cut their standard of living?” (i.e., 2.2%; Table 10.2).
If the questions asks for a percentage but the percentage is not of ALL individuals then use either the row or column percentage table depending on which group the question is focused on. If the question is focused on a group represented by a ROW then use the row percentage table. For example, the question – “What percentage of respondents with a bachelor’s degree were very unwilling to cut their standard of living to protect the environment?” is focused only on respondents with a bachelor’s degree. Because bacherlor’s degrees are shown in a row, then the row percentage table would be used to answer this question (i.e., 11.0%; Table 10.3).
In contract, the “What percentage of respondents who were neither willing nor unwilling to cut their standard of living had graduate degrees?” is focused on those respondents who were neither willing nor unwilling to cut their standard of living, which is represented by a COLUMN. Thus, a column percentage table is used to answer this question (i.e., 8.6%; Table 10.4).
Finally, consider this question – “What percentage of all respondents were very willing to cut their standard of living to help the environment?” This question has no restrictions, so the total-percentage table would be used. In addition, this question is only concerned with with the COLUMN variable and, thus, the answer will come from the “Sum” ROW. Therefore, 5.3%, were very willing to cut their standard of living to help the environment.
- If the question is about a number of individuals then use the frequency table.
- If the question is about the percentage of ALL individuals then use the table percentage table.
- If the question is about a percentage of only some individuals then determine if the individuals in question are shown in a row (i.e., use the row percentage table) or a column (i.e., use the column percentage table).
- If the question does not refer to one of the two variables, then the answer will generally come from the margin (the “Sum” row or column) of the other variable.
10.4 Example Calculations
In early 2021, the United States was deeply divided with respect to political allegiances and was just beginning to combat the COVID-19 virus with three different vaccines. The development of the vaccines had begun during the Donald Trump’s presidency, but administration of the vaccine had largely been carried out during the first months of Joseph Biden’s presidency. In March 2021, YouGov asked a sample of U.S. adults “Thinking about the vaccine rollout in the US, do you believe the Trump administration or the Biden administration deserves more credit?” The results for those respondents that identified a political affiliation and had an opinion about which administration deserved credit are shown in Table 10.5.
Trump | Biden | Both | Neither | Sum | |
---|---|---|---|---|---|
Democrat | 249 | 2023 | 436 | 187 | 2895 |
Republican | 2013 | 230 | 316 | 144 | 2703 |
Independent | 323 | 313 | 127 | 108 | 871 |
Sum | 2585 | 2566 | 879 | 439 | 6469 |
Use these results to answer the following questions. [Note that I have highlighted key phrases in the question that are useful for determining which table to use. These phrases will not be underlined in the exercises.]
- What percent of “Democrats” believe that Donald Trump’s administration deserves credit for the COVID-19 vaccine rollout?
- Of the 2895 “Democrats” in the sample, 8.6% believed that Donald Trump’s administration deserved credit for the COVID-19 vaccine rollout (=\(\frac{249}{2895}\)×100).
- What percent of “Republicans” believe that Joseph Biden’s administration deserves credit for the COVID-19 vaccine rollout?
- Of the 2703 “Replublicans” in the sample, 8.5% believed that Joseph Biden’s administration deserved credit for the COVID-19 vaccine rollout (=\(\frac{230}{2703}\)×100).
- What percent of those that believe that neither administration deserves credits for the COVID-19 vaccine rollout were “Independents?”
- Of the 439 respondent that believed neither administration deserved credit for the COVID-19 vaccine rollout, 24.6% were “Independents” (=\(\frac{108}{439}\)×100).
- What percent of all respondents were “Independents?”
- Of the 6469 respondents in the sample, 13.5% were “Independents” (=\(\frac{871}{6469}\)×100).
- What percent of all respondents were “Democrats” and believed that both administrations deserved credit for the COVID-19 vaccine rollout?
- Of the 6469 respondents in the sample, 6.7% were “Democrats” and believed that both administrations deserved credit for the COVID-19 vaccine rollout (=\(\frac{436}{6469}\)×100).
10.5 Making an Overall Summary
An overall summary for a categorical bivariate EDA can be constructed by describing how the percentage of individuals in the response levels differs across the groups of the explanatory variable. If you followed the recommendation of having the response variable in the columns then this simplifies to describing how the ROW percentages differ across the rows.
For example, from Table 10.3 it is evident that there was a general increase in the percentage of respondents that were willing (either “willing” or “very willing”) to cut their standard of living to protect the environment as the level of education increased. In other words, it appears that more formally educated individuals were more willing to sacrifice the standard of living to protect the environment.
As another example, it is clear that most respondents feel that the candidate that aligns with their party affiliation deserves more of the credit for the COVID-19 vaccine rollout; i.e., more Democrats feel that Biden deserves more credit whereas most Republicans feel that Trump deserves more credit (Table 10.6). Independent respondents were fairly evenly split about which president deserves more credit.
Trump | Biden | Both | Neither | Sum | |
---|---|---|---|---|---|
Democrat | 8.6 | 69.9 | 15.1 | 6.5 | 100.1 |
Republican | 74.5 | 8.5 | 11.7 | 5.3 | 100.0 |
Independent | 37.1 | 35.9 | 14.6 | 12.4 | 100.0 |
- If the response variable is properly placed in the columns of the frequency table, then an overall summary of how the response variables differs (or not) across the groups can be made by comparing across rows of the row percentage table.
This statement must be read carefully. OF THE RESPONDENTS WITH LESS THAN A HIGH SCHOOL DEGREE, not of all respondents, 6.2% were very willing to cut their standard of living.↩︎
Again, read this carefully. OF THE RESPONDENTS WHO WERE VERY WILLING TO CUT THEIR STANDARD OF LIVING, not of all respondents, 19.2% had less than a high school degree.↩︎