Association Between Categorical Variables: Contingency Analysis with Chi Square

Fraser, Cynthia

doi:10.1007/978-3-031-42555-4_3

Cynthia Fraser²

80 Accesses

Abstract

Categorical variables, including nominal (where numbers are simply labels) and ordinal, rank order variables, are described by tabulating their frequencies or probability. If two categorical variables are associated, the frequencies of values of one will depend on the frequencies of values of the other. Chi square tests the hypothesized association between two categorical variables and contingency analysis quantifies their association.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and Affiliations

McIntire School of Commerce, University of Virginia, Charlottesville, VA, USA
Cynthia Fraser

Authors

Cynthia Fraser
View author publications
You can also search for this author in PubMed Google Scholar

3.1 Electronic Supplementary Material(s)

Vastly master (XLSX 128 kb)

Appendices

Excel 3.1: Construct Crosstabulations and Assess Association Between Categorical Variables with PivotTables and PivotCharts

Vastly Tree Free Paper Products

In order to explore the possible association between spend on paper products and household size, begin by making a PivotTable to see the crosstabulation.

Open Vastly.xlsx.

Select filled cells in the Number in Household and Spend categories columns and then insert a PivotTable.

Drag Number in Household to ROW, Spend category to COLUMN, and Spend category to ∑ Values.

A screenshot of an Excel sheet includes 10 columns and 6 rows of data on the left. The column headers are the count of spend categories, column labels 100 +, 16 to 24, 25 to 30, 31 to 49, 5 to 15, 50 to 59, 60 to 100, less than 5, and grand total. The options of pivot table fields are on the right.

Managers agreed that those who reported spending less than $5 monthly on paper products were not of interest, or were not purchasers of paper products. Remove them by clicking the column labels box and unselecting that category.

A screenshot of an Excel sheet includes 9 columns and 6 rows of data. The column headers are count of spend categories, column labels, 16 to 24, 25 to 30, 31 to 49, 5 to 15, 50 to 59, 60 to 100, and grand total. A list of options highlights the disabled checkbox, less than 5.

Reorder spend categories. Select the column label and drag to the position desired.

A screenshot of an Excel sheet includes 9 columns and 6 rows. The column headers are count of spend categories, column labels 5 to 15, 16 to 24, 25 to 30, 31 to 49, 50 to 59, 60 to 100, 100 +, and grand total. Column header of 31 to 49 is selected.

Excel 3.2: Use Chi Square to Test Association

For chi square, make a table of expected cell counts and a table of cell contributions to chi square.

Find expected counts from the row, column and grand totals.

In J5, enter the formula for the expected count, multiplying cells containing the Grand Total of households of 1, J5, and the Grand Total of spend category 5 to 15, B10, and then dividing by the Grand Total, I10:

$$ =\$\mathbf{J}{\mathbf{5}}^{\ast}\mathbf{B}\$\mathbf{10}/\$\mathbf{I}\$\mathbf{10} $$

Down fill and right fill in the remaining cells in the table with this formula.

A screenshot of an Excel sheet includes 11 columns and 6 rows. The column headers are the count of spend categories, column labels 5 to 15, 16 to 24, 25 to 30, 31 to 49, 50 to 59, 60 to 100, 100 +, grand total, 5 to 15, and 16 to 24. The cells in the last two columns are left empty.

A screenshot of an Excel sheet includes 7 columns and 5 rows. The column headers are 5 to 15, 16 to 24, 25 to 30, 31 to 49, 50 to 59, 60 to 100, and 100 plus. All rows are selected.

In a third table, find each cell’s contribution to chi square, the squared difference between expected counts, in the second table, and actual counts, in the first table, divided by expected counts in the second table.

In the first cell, O5, of the third contributions to chi square table, enter:

$$ =\left(\mathbf{B}\mathbf{5}-\mathbf{J}\mathbf{5}\right)\hat{\mkern6mu} \mathbf{2}/\mathbf{J5}. $$

Down fill the column and right fill the rows:

A screenshot of an Excel sheet includes 2 columns and 5 rows. The column headers are expected and contribution to chi-square. The first column is divided into 4 sub-columns titled 31 to 49, 50 to 59, 60 to 100, and 100 plus. The second column is divided into 7 sub-columns, and they are selected.

Use the Excel function SUM(array1,array2) to find contributions to chi square from each of the household sizes.

In X5,=sum(Q5:W5), then downfill.

A screenshot of an Excel sheet has a table of the contribution to chi-square with 8 columns and 5 rows. The column headers are 5 to 15, 16 to 24, 25 to 30, 31 to 49, 50 to 59, 60 to 100, 100 plus, and blank. The last column is selected.

Add the cell contributions in the 5 to 15 spend category column:

In Q10, =sum(q5:q9).

Right fill to add contributions to chi square from the other spend categories and chi square:

A screenshot of an Excel sheet has a table of the contribution to chi-square with 8 columns and 6 rows. The column headers are 5 to 15, 16 to 24, 25 to 30, 31 to 49, 50 to 59, 60 to 100, 100 plus, and blank. The last row is selected.

Find the p value for this chi square using the Excel function

CHISQ.DIST.RT(chisquare,df) with degrees of freedom df of 24

A screenshot of an Excel sheet includes a table with 2 columns and 6 rows. The column headers are 100 plus and row totals. The last cell of the last row is selected. It indicates the values of the chi-square and p-value.

To see the conditional probabilities of spending categories given number in household category, convert cell counts to % of row.

From a data cell in the table, right click, Show Values As, % Row Total.

A screenshot of an Excel sheet has a table overlapped by a list of options. The show values as option is selected and the percent of the row total is selected from the drop-down list.

Make a PivotChart of Spend category by number in household: PivotTable Analyze, PivotChart. Choose a 100% stacked column.

A window titled Insert Chart includes the options of all charts on the left. The column option is selected. A stacked bar graph with 7 categories is on the right. The title reads 100% stacked column.

Choose gradient colors:

A screenshot of an Excel sheet highlights the design menu. The chart styles and a grid of colors are provided. The selected row of colors indicates the monochromatic palette 8 with a light-to-dark gradient.

Add axis titles and a stand alone title featuring your conclusion, as shown in Fig. 3.7:

A stacked bar graph of the percent of household size versus the number in households has 7 categories labeled 5 to 15, 16 to 24, 25 to 30, 31 to 49, 50 to 59, 60 to 100, and 100 plus. The bottom 2 categories have a decreasing trend from left to right, while the top 3 are increasing. — **Fig. 3.7**

Excel 3.3: Conduct Contingency Analysis with Three Categories

To control for household size, make a new pivot table with all three variables, purchase likelihood, spend category and household size. Drag household size, then spend category, to the rows. Drag purchase likelihood to the columns and ∑ values. Again, unselect spend <5 and drag spend categories so that they are in increasing order.

A screenshot of an Excel sheet has 4 columns and 16 rows. The column headers are count of spend categories, column labels likely, unlikely, and grand total.

Find expected counts for each of the household sizes. Use household size subtotals instead of the grand total.

A screenshot of an Excel sheet has a table with 6 columns and 8 rows. The column headers are count of spend categories, column labels likely, unlikely, grand total, likely, and unlikely. Four cells are selected.

To easily see sparse cells, select the expected cells, then choose Conditional Formatting, with Less Than, 5:

A screenshot highlights the conditional formatting option. The highlight cells rules option is selected from the drop-down list. Less than is selected.

A screenshot of an Excel sheet has a table with 6 columns and 8 rows. The last two columns on the right are selected.

Six of the cells have fewer than five expected.

To begin eliminating the sparse cells, combine 100+ with 60 to 100.

Select A11:A12, then combine.

PivotTable Analyze, Group Selection:

A screenshot of an Excel sheet includes a table with 6 columns and 8 rows. The column headers are count of spend categories, column labels likely, unlikely, grand total, likely, and unlikely. Four cells are highlighted and 2 cells are selected.

In the PivotTable Analyze menu, choose the collapse icon to remove the unnecessary labels:

A screenshot of an Excel sheet includes a table with 4 columns and 9 rows. The column headers are count of toilet tissue purchase likelihood, column labels likely, unlikely, and grand total. The last two cells of the first column are selected. The collapse icon in the pivot table analyze menu is selected.

Change the Group 1 label to 60+ .

Update expected counts.

A screenshot of an Excel sheet has a table with 6 columns and 14 rows. The column headers are count of spend categories, column labels likely, unlikely, grand total, likely, and unlikely. Two values are highlighted. The bottom 6 cells in the last two columns are selected.

Two sparse cells remain. Combining spend of 16 to 24 and 5 to 15 and spend of 25 to 30 with 31 to 49 will eliminate those.

Update expected counts.

Find contributions to chi square, then sum by rows and by columns to find separate chisquares (with three degrees of freedom) and pvalues for each of the household size categories.

A screenshot of an Excel sheet has 7 columns and 26 rows. The column headers are count of spend categories, column labels likely, unlikely, grand total, expected, contribution to chi-square, and p-value. The fifth and the sixth columns are divided into 2 sub-columns titled likely and unlikely.

Controlling for household size reveals that only among the largest 5+ households is there a significant association between spend and purchase likelihood. Larger spenders are more likely to purchase, among households of 5 or more.

To illustrate, first show counts as percents of row total. Then copy cells in A and B and paste special (without formulas below the PivotTable. Move subtotals to a third column for each household size:

A screenshot of an Excel sheet includes a table with 2 columns and 8 rows. The column headers are likely and overall within household size. The values are in percentage.

Select these new cells and insert a column chart:

A screenshot of an Excel sheet includes a table with 2 columns and 12 rows on the left and a double bar graph on the right. The bars represent likely and overall within household size. The bars are in 5 categories with 4 values each.

Select the chart, then in Design, Change the Chart type to Combo:

A window titled Change Chart Type indicates the options of all charts on the left. The combo option is selected. A bar graph and the options to choose the chart type and axis of the data series are on the right. The horizontal lines over the bars represent the overall values.

Reduce decimals in the table (which will reduce decimals in the graph), add vertical and horizontal axis titles and chart title.

A bar graph of percent spent within household size likely to purchase versus spend given household size has 4 bars in 5 categories. The horizontal lines at the top of the bars indicate the increasing overall values from left to right.

Case 3.1: Generation, Purchase Frequency and Paper Spend

Tranlin managers seek to identify the generations that spend more than average on paper products, and learn whether promising generations purchase more frequently.

A concept test was conducted with 1522 consumers who reacted to a verbal description of Tranlin’s sustainable toilet tissue. Data in Vastly 2023.xlsx contain consumer paper spend, generation and purchase frequency.

Managers sense that Millennials and Gen Z are likely to be the biggest spenders. If you encounter sparse cells, avoid combining these two generations with each other or with other generations in case they differ.

Spend by Generation

1.
Is paper spend associated with generation? Cite chi square with dfs and pvalue, as well as your conclusion.
2.
Produce a 100% stacked column chart of %s in spending categories by generation. Focus on the highest spenders in your chart title.

Purchase Frequency by Generation

3.
Is purchase frequency associated with generation? Cite chi square with dfs and pvalue, as well as your conclusion.
4.
Produce a 100% stacked column chart of %s in purchase frequency categories by generation. Focus on the most frequent consumers in your chart title.

Case 3.2: Contingency Analysis Case: Generation, Household Size and Interest in Vastly Sustainable Toilet Tissue

Tranlin managers seek to identify the target market(s) to whom their sustainable paper concepts appeal.

A concept test was conducted with 1522 consumers who reacted to a verbal description of Tranlin’s sustainable toilet tissue. Consumers were asked to rate their interest in purchasing Vastly toilet tissue. Data in Vastly.xlsx contain consumer purchase likelihood, generation and household size. Generations, from youngest to oldest, are: Gen Z, Millennials, Gen X, Baby Boomers, Oldest. Order Generations this way in your analysis and graphs.

Purchase Likelihood by Generation

Test the hypothesis that purchase likelihood is associated with generation.

1.
Is purchase likelihood associated with generation? Cite chi square with degrees of freedom and pvalue to support your answer.
2.
Produce a column chart of % likely to purchase by generation, including a reference line of the percent interested overall. (Since there are just two interest categories, show only Likely Trier percents.)

Generation by Household Size

Prior analysis revealed that larger households are more likely to purchase.

3.
Do Millennials or Gen Z live in larger households?

Test the hypothesis that household size is associated with generation in this sample.
4.
Is household size associated with generation? Cite chi square with degrees of freedom and pvalue to support your answer.
5.
Produce a stacked column chart of % of household sizes by generation.

Conclusions

6.
Which should Vastly target?
- ___ women in larger Gen Z households
- ___ women in larger Millennial households
- ___women in larger households in both Gen z and Millennials
- ___ women in larger households in all generations

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Fraser, C. (2024). Association Between Categorical Variables: Contingency Analysis with Chi Square. In: Business Statistics for Competitive Advantage with Excel and JMP . Springer, Cham. https://doi.org/10.1007/978-3-031-42555-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-42555-4_3
Published: 05 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42554-7
Online ISBN: 978-3-031-42555-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics