Geometric Mean Type of Proportional Reduction in Variation Measure for Two-Way Contingency Tables

Urasaki, Wataru; Wada, Yuki; Nakagawa, Tomoyuki; Tahata, Kouji; Tomizawa, Sadao

doi:10.1007/s13571-023-00320-w

Geometric Mean Type of Proportional Reduction in Variation Measure for Two-Way Contingency Tables

Open access
Published: 03 January 2024

Volume 86, pages 139–163, (2024)
Cite this article

Download PDF

You have full access to this open access article

Sankhya B Aims and scope Submit manuscript

Geometric Mean Type of Proportional Reduction in Variation Measure for Two-Way Contingency Tables

Download PDF

Wataru Urasaki ORCID: orcid.org/0000-0003-0155-4294¹,
Yuki Wada¹^na1,
Tomoyuki Nakagawa²^na1,
Kouji Tahata¹^na1 &
…
Sadao Tomizawa^1,2^na1

605 Accesses
Explore all metrics

Abstract

Traditional analysis of two-way contingency tables with explanatory and response variables focuses on the independence of two variables. However, if the variables do not show independence or a clear relationship, the analysis shifts to the degree of association. Various measures have been proposed to calculate the degree of association. One is the proportional reduction in variation (PRV) measure. This measure describes the PRV from the marginal distribution to the conditional distribution of the response variable. Although conventional PRV measures can assess the association of the entire contingency table, they cannot accurately assess the association for each explanatory variable. In this paper, we propose a geometric mean type of PRV (geoPRV) measure, which aims to sensitively capture the association of each explanatory variable to the response variable. Our approach uses a geometric mean, and enabling analysis without underestimating the values when the cells in the contingency table are partially biased. The geoPRV measure can be constructed using any function that satisfies specific conditions. This approach has practical advantages, and in special cases, conventional PRV measures can be expressed as geometric mean types.

Extension of generalized proportional reduction in variation measure for two-way contingency tables

Article 21 October 2022

Generalized Cramér’s coefficient via f-divergence for contingency tables

Article Open access 05 October 2023

Power Comparisons in Contingency Tables

Article 25 May 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Categorical variables, which originate from distinct categories, are employed in various fields such as medicine, psychology, education, and social science. Here, let’s consider two types of categorical variables: R and C categories. These two variables have $R \times C$ combinations, which can be represented in a table with R rows and C columns. This is called a two-way contingency table, where each (i, j) cell ($i=1,\ldots , R;~j=1,\ldots , C$) displays only the observed frequencies. Typically, a two-way contingency table is used to evaluate whether two variables are related (i.e., statistically independent). However, if the independence of the two variables is rejected using Pearson’s chi-square test, for example, or the variables are clearly related, then the strength of association can provide useful insight into their relationship.

Previous studies have proposed various association models to examine the association structure of contingency tables. Goodman (1979) proposed the uniform association model. Although this is the simplest association model, it is severely constrained. Consequently, various extended models such as the linear-by-linear association model and the R, C, $R+C$, and RC models have been proposed by Goodman (1985, 1986), Agresti (1983), and Liu & Agresti (2005), respectively. Many authors have presented models related to Goodman’s association model (see, e.g., Goodman, 1979, 1981; Chuang et al., 1985; Gilula & Haberman, 1986; Rom & Sarkar, 1992). For details on the various association models, see Goodman’s paper above and the writings of Agresti (2010, 2013) and Kateri (2014).

A typical criterion to evaluate an association model is the goodness-of-fit test with the models. Additionally, numerous measures have been proposed as indicators to show the degree of association within the interval from 0 to 1.

Examples include Pearson’s coefficients $\phi ^2$ of mean square contingency, P of contingency, and Tshuprow’s coefficient T (see, e.g., Bishop et al. , 2007, Chap. 11, Agresti , 2013, Chap. 3). However, these measures have a weakness because $\phi ^2$ does not attain 1 even if the contingency table has a complete association structure (i.e., the maximum departure from independence). Similarly, the numbers of rows and columns in the table affect P and T. To solve this problem, Cramér (1946) proposed Cramér’s coefficient V, which becomes 1 if the contingency table has a complete associative structure for all rows and columns. Recently, Tomizawa et al. (2004) proposed power-divergence type measures $V^2_{t(\lambda )}$ $(t=1,2,3)$ by linking the contingency table and the divergence. Power-divergence type measures evaluate the difference between two probability distributions. (For more information about divergence and power-divergence, see Rényi , 1961; Cressie & Read , 1984; Read & Cressie , 2012, Chap. 2.) These measures calculate the degree of association for each (i, j) cell in the contingency table. The degree of association is derived from the sum of all cells. Hence, these measures can be applied to most contingency tables without distinguishing whether the row and column variables are explanatory or response variables. Often in actual contingency table analysis, the row and column variables are defined as explanatory or response variables, and ignoring the characteristics of each variable in the analysis is inappropriate.

For an $R \times C$ contingency table with the explanatory variable X and the response variable Y, various measures have been proposed to assess the degree of association. For an ordinal variable, Agresti (2010) in Chap. 7 provides a detailed description of the ordinal measures of association. By contrast, if the category is a nominal variable, using an ordinal measure is not suitable in the analysis. Thus, measures explained by the proportional reduction in variation (PRV) from the marginal distribution of the response Y to the conditional distributions of Y given an explanatory variable X are used to describe the degree of association. Measures constructed by this method are called PRV measures. Examples of PRV measures are the concentration coefficient $\tau $ (Goodman & Kruskal, 1954) and uncertainty coefficient U (Theil, 1970), which employ the Gini concentration and Shannon entropy for the variation measure, respectively. Additionally, when a contingency table has an ordinal category, Tomizawa & Yukawa (2004) proposed $\phi ^{\lambda }$, which incorporates ordinal information.

The PRV measure is a valuable tool to summarize the strength of association of the entire contingency table because the value is easily interpreted due to its construction. That is, the value shows how much more effective it is to predict the response variable Y when the explanatory variable X is known rather than when it is unknown. Sometimes, the analysis aims to evaluate the association of specific categories of explanatory variables. However, conventional PRV measures may not accurately reflect the partial association numerically because they underestimate the strength of association. In the previous studies of models and measures to evaluate the symmetry of the contingency table, Nakagawa et al. (2020), Saigusa et al. (2016), and Saigusa et al. (2019) proposed evaluating the partial symmetry using the geometric mean. On the other hand, little research has examined partial associations.

In this paper, we propose a geometric mean type of PRV (geoPRV) measure via a geometric mean and functions satisfying certain conditions. The geoPRV measure has practical advantages. In special cases, the geoPRV measure allows previously proposed PRV measures to be expressed as geometric mean types. Because the geometric mean sensitively captures the association of each explanatory variable, the analysis does not underestimate the degree of association when cells in the contingency table are partially biased. The proposed measure strongly reflects a partial association structure as it provides the relationship to the response variable by explanatory variable category. Therefore, the geoPRV measure evaluates the strength of association in the same manner as existing PRV measures if the entire contingency table has an association structure. However, the geoPRV measure simultaneously elucidates the strength of a partial association structure numerically.

Furthermore, whether the categorical variable is nominal or ordinal does not influence the geoPRV measure because its value does not change even when the rows and columns are swapped.

The rest of this paper is organized as follows. Section 2 introduces previous research on an extension of generalized PRV (eGPRV) measure and proposes the geoPRV measure. Section 3 presents the approximate confidence intervals of the proposed measure. Section 4 confirms the values and confidence intervals of the proposed measure using several artificial and actual datasets. Additionally, it compares the geoPRV measure with the eGPRV measure. Section 5 presents our conclusions.

2 PRV Measure

In this section, we introduce measures using a function f(x) that satisfies the following conditions: (i) f(x) is a convex function, (ii) $0 \cdot f(0/0)=0$, (iii) $\lim _{x\rightarrow +0}f(x)=0$, and (iv) $f(1)=0$. Kateri & Papaioannou (1994), Momozaki et al. (2023), and Tahata (2022) have introduced examples of the function and proposed models and measures. These proposals are intended to generalize existing models. The measures have application advantages, which can be used to easily construct new models or to make adjustments to fit the analysis using tuning parameters. Section 2.1 overviews conventional PRV measures proposed by Momozaki et al. (2023), while Section 2.2 describes our proposed geoPRV measure and its characteristics.

2.1 Conventional PRV Measure

Consider an $R\times C$ contingency table with nominal categories of the explanatory variable X and the response variable Y. Let $p_{ij}$ denote the probability that an observation will fall in the ith row and jth column of the table ($i=1,\ldots , R;j=1,\ldots , C)$. In addition, denote $p_{i\cdot }$ and $p_{\cdot j}$ as $p_{i\cdot }=\sum _{l=1}^C p_{il}$, $p_{\cdot j}=\sum _{k=1}^Rp_{kj}$. The conventional PRV measure takes the following form

$$ \Phi = \frac{V(Y)-E[V(Y \vert X)]}{V(Y)} = \frac{\displaystyle V(Y)-\sum _{i=1}^Rp_{i\cdot }V(Y \vert X=i)}{V(Y)}, $$

where V(Y) is a measure of the variation for the marginal distribution of Y, and $E[V(Y \vert X)]$ is the expectation for the conditional variation of Y given the distribution of X (see, Agresti, 2013, Chap.2). $\Phi $ uses the weighted arithmetic mean of $V(Y \vert X=i)$, that is, $\sum _{i=1}^Rp_{i\cdot }V(Y \vert X=i)$. Various PRV measures can be expressed by changing the variation measure. Example include the uncertainty coefficient U for the variation measure $V(Y)=-\sum _{j=1}^C p_{\cdot j}\log p_{\cdot j}$, which is called the Shannon entropy, and the concentration coefficient $\tau $ for the variation measure $V(Y)=1-\sum _{j=1}^C p_{\cdot j}^2$, which called the Gini concentration. Tomizawa et al. (1997) proposed a generalized PRV measure $T^{(\lambda )}$ that includes U and $\tau $. $T^{(\lambda )}$ uses $V(Y) = \left( 1-\sum _{j=1}^C p_{\cdot j}^{\lambda +1} \right) /\lambda $ as the variation measure, which is Patil & Taillie (1982) the diversity index of degree $\lambda $ for the marginal distribution $p_{\cdot j}$. Furthermore, Momozaki et al. (2023) proposed an extension of generalized PRV (eGPRV) measure, which incorporates U, $\tau $, and $T^{(\lambda )}$. The eGPRV measure is given as

$$ \Phi _f = \frac{\displaystyle -\sum _{j=1}^{C}f(p_{\cdot j}) - \sum _{i=1}^R p_{i\cdot } \left[ - \sum _{j=1} ^C f \left( \frac{p_{ij}}{p_{i\cdot }} \right) \right] }{\displaystyle - \sum _{j=1}^C f(p_{\cdot j})}. $$

The variation measure used in the eGPRV measure $\Phi _f$ is $V(Y)= -\sum _{j=1}^{C}f(p_{\cdot j})$.

2.2 geoPRV Measure

We propose a new PRV measure using the weighted geometric mean of $V(Y \vert X=i)$. Our proposed measure aims to sensitively capture the association of each explanatory variable to the response variable. Assuming that $p_{\cdot j}>0$ and $V(Y \vert X=i)$ is a real number greater than or equal to 0 ($i=1,\ldots , R;~j=1,\ldots , C$), we define the geoPRV measure for $R\times C$ contingency tables as

$$ \Phi _{G} = \frac{\displaystyle V(Y)-\prod _{i=1}^R \left[ V(Y \vert X=i) \right] ^{p_{i\cdot }}}{V(Y)}, $$

where V(Y) is a measure of the variation for the marginal distribution of Y. The geoPRV measure can use the same variation as the conventional PRV measure, for example,

$$ \Phi _{Gf} = \frac{\displaystyle -\sum _{j=1}^C f(p_{\cdot j}) - \prod _{i=1}^R \left[ -\sum _{j=1}^C f \left( \frac{p_{ij}}{p_{i\cdot }} \right) \right] ^{p_{i\cdot }}}{\displaystyle -\sum _{j=1}^C f(p_{\cdot j})}, $$

where the variation measure $V(Y) = -\sum _{j=1}^C f(p_{\cdot j})$. In addition, the following theorem for $\Phi _{Gf}$ holds.

Theorem 1

The measure $\Phi _{Gf}$ satisfies the following conditions:

(i)
$\Phi _f \le \Phi _{Gf}$.
(ii)
$\Phi _{Gf}$ must lie between 0 and 1.
(iii)
$\Phi _{Gf}=0$ is equivalent to the independence of X and Y.
(iv)
$\Phi _{Gf}=1$ is equivalent to $\prod _{i=1}^R \left[ V(Y \vert X=i) \right] ^{p_{i\cdot }}=0$. That is, for at least one s, there exists t such that $p_{st}\ne 0$ and $p_{sj}=0$ for every j with $j\ne t$.

Theorem 2

The value of $\Phi _{Gf}$ is invariant to permutations of the row and column categories.

Appendix A and B show proofs of Theorem 1 and 2, respectively. The geoPRV measure differs from the conventional PRV measure in that $\Phi _{Gf}=1$ when i exists such that $p_{ij}=p_{i\cdot }\ne 0$. Another important feature of the geoPRV measure is that it takes an equal or higher value than the conventional PRV measure, allowing for a stronger representation of the row and column relationships.

A property of the geoPRV measure is that the larger the value of $\Phi _{G}$, the stronger the association between the response variable Y and the explanatory variable X. That is, the larger the value of $\Phi _{G}$, the more accurate the prediction of the Y category if the X is known. By contrast, if the value of $\Phi _{G}$ is 0, X category does not affect the Y category.

3 Approximate Confidence Interval for the Measure

Since the measure $\Phi _{Gf}$ is unknown, we derive the confidence interval of $\Phi _{Gf}$. Let $n_{ij}$ denote the frequency for a cell (i, j) and $n=\sum _{i=1}^R\sum _{j=1}^C n_{ij}$ ($i=1,\ldots ,R;~j=1,\ldots ,C$). Assume that the observed frequencies $n_{ij}$ have a multinomial distribution. Here, we consider an approximate standard error and large-sample confidence interval for $\Phi _{Gf}$ using the delta method (Bishop et al. , 2007, Chap. 14 and Appendix C in Agresti , 2013, Chap. 16). This leads to the following theorem.

Theorem 3

Let $\widehat{\Phi }_{Gf}$ denote a plug-in estimator of $\Phi _{Gf}$. $\sqrt{n}( \widehat{\Phi }_{Gf}-\Phi _{Gf} )$ converges into a normal distribution with a mean of zero and variance $\sigma ^2 [ \Phi _{Gf} ]$, where

$$\begin{aligned} \sigma ^2[\Phi _{Gf}] = \left( \delta ^{(f)}\right) ^2 \left[ \sum _{i=1}^R\sum _{j=1}^Cp_{ij}(\Delta _{ij}^{(f)})^2 - \left( \sum _{i=1}^R\sum _{j=1}^C p_{ij}\Delta _{ij}^{(f)} \right) ^2 \right] , \end{aligned}$$

with

$$\begin{aligned} \delta ^{(f)}= & {} \frac{\displaystyle \prod _{s=1}^R \left[ -\sum _{t=1}^C f \left( \frac{p_{st}}{p_{s\cdot }} \right) \right] ^{p_{s\cdot }}}{\displaystyle \left( \sum _{t=1}^C f(p_{\cdot t}) \right) ^2},\\ \Delta _{ij}^{(f)}= & {} f'(p_{\cdot j}) -\varepsilon _{ij}^{(f)}\sum _{t=1}^C f(p_{\cdot t}),\\ \varepsilon _{ij}^{(f)}= & {} \log \left[ -\sum _{t=1}^C f \left( \frac{p_{it}}{p_{i\cdot }} \right) \right] + \frac{\displaystyle \sum _{t=1}^C \left\{ -\frac{p_{it}}{p_{i\cdot }} f' \left( \frac{p_{it}}{p_{i\cdot }} \right) \right\} + f' \left( \frac{p_{ij}}{p_{i\cdot }} \right) }{\displaystyle \sum _{t=1}^C f' \left( \frac{p_{it}}{p_{i\cdot }} \right) }, \end{aligned}$$

and $f'(x)$ is the derivative of function f(x) by x.

The proof of Theorem 3 is given in Appendix C.

Let $\widehat{\sigma }^2 \left[ \Phi _{Gf} \right] $ denote a plug-in estimator of $\sigma ^2 \left[ \Phi _{Gf} \right] $. From Theorem 3, since $\widehat{\sigma } \left[ \Phi _{Gf} \right] $ is a consistent estimator of $\sigma \left[ \Phi _{Gf} \right] $, $\widehat{\sigma } \left[ \Phi _{Gf} \right] / \sqrt{n}$ is the estimated standard error for $\widehat{\Phi }_{Gf}$, and $\widehat{\Phi }_{Gf} \pm z_{\alpha /2} \widehat{\sigma } \left[ \Phi _{Gf} \right] / \sqrt{n}$ is the approximate $100(1-\alpha )\%$ confidence limit for $\Phi _{Gf}$, where $z_{\alpha /2}$ is the upper two-sided normal distribution percentile at level $\alpha $.

4 Numerical Experiments

In this section, we demonstrate the performance of the geoPRV measure $\Phi _{Gf}$ and confirm the difference between $\Phi _{Gf}$ and the conventional PRV measure $\Phi _f$ proposed by Momozaki et al. (2023). We use $\Phi _f$ and $\Phi _{Gf}$, which have the variation measure $V(Y)=-\sum _{j=1}^Cf(p_{\cdot j})$. In addition to applying $f(x)=\left( x^{\lambda +1} - x \right) /\lambda $ for $\lambda >-1$ and $g(x)=(x - 1)^2/(\omega x + 1 - \omega ) - (x - 1)/(1 - \omega )$ for $0 \le \omega < 1$ (see, Ichimori, 2013), the former is expressed as $\Phi _f^{(\lambda )}$ and $\Phi _{Gf}^{(\lambda )}$, while the latter is expressed as $\Phi _g^{(\omega )}$ and $\Phi _{Gg}^{(\omega )}$. It should be noted that the values $\Phi ^{(\lambda )}_f$ and $\Phi ^{(\lambda )}_{Gf}$ at $\lambda = 0$ have the continuous limits as $\lambda \rightarrow 0$. For the tuning parameters, we set $\lambda =0,~0.5,$ or 1.0 and $\omega =0,~0.5$ or 0.9.

4.1 Artificial Dataset 1

Table 1 shows the artificial dataset considered in this study. The dataset clearly shows the difference in the characteristics between conventional PRV measures and the geoPRV measure. Tables 1a, 1b, and 1c represent changes in the partial association structure by moving the cell probability in the first row. In Table 1a, although all cell probabilities in the first row are non-zero values, most of the marginal probability $p_{1\cdot }$ is concentrated in the (1,2) and (1,3) cells. Similarly, the marginal probability $p_{1\cdot }$ in Table 1b is mostly concentrated in the (1,3) cell. By contrast, Table 1c has a complete partial association in the first row because the probability exists only in the (1,3) cell. In the sense that the first row has a bias in the cell probability, Tables 1a, 1b, and 1c have partial weak, slight, and complete association structures in the first row, respectively.

Table 1 $3 \times 3$ probability tables, with a (a) weak, (b) slightly strong, and (c) complete association structure in the first row

Full size table

Tables 2a and 2b list the values of $\Phi _f^{(\lambda )}$ and $\Phi _{Gf}^{(\lambda )}$, respectively. For instance, Table 2a shows that when Table 1c is parsed, the measure $\Phi _f^{(\lambda )}=0.2628,~0.1990,$ or 0.1784 for $\lambda $ = 0, 0.5, or 1.0, respectively, and does not capture the complete association structure of the first row. By contrast, $\Phi _{Gf}^{(\lambda )} = 1$ for all values of $\lambda $, allowing the local association structure to be identified. Considering the results of the $\Phi _{Gf}^{(\lambda )} $ and $\Phi _{f}^{(\lambda )}$ for any $\lambda $ from Table 1a to 1c, the simulation shows that $\Phi _{Gf}^{(\lambda )}$ changes significantly by capturing partially related structures compared to $\Phi _{f}^{(\lambda )}$.

Table 2 Values of $\Phi _{f}^{(\lambda )}$ and $\Phi _{Gf}^{(\lambda )}$, applied to Table 1

Full size table

Table 3 $4 \times 4$ probability tables formed using three cutpoints for each variable at $z_{0.25}, z_{0.50},$ and $z_{0.75}$ from a bivariate normal distribution with the conditions $\mu _1 = \mu _2 = 0$, $\sigma ^2_1 = \sigma ^2_2 = 1$ and $\rho $ increasing by 0.2 increments from 0 to 1

Full size table

4.2 Artificial Dataset 2

Table 3 shows another artificial dataset, which is described to examine the value of the geoPRV measure $\Phi _{Gf}$ as the association of the entire contingency table changes. The data are suitable for the survey by converting the bivariate normal distribution with means $\mu _1 = \mu _2 = 0$ and variances $\sigma ^2_1 = \sigma ^2_2 = 1$, in which the correlation coefficient was changed from 0 to 1 in 0.2 increments into $4 \times 4$ contingency tables with equal-interval frequency. From Theorem 2 and the properties of the PRV measures, when the correlation coefficients have the same absolute values (i.e., when the rows of the contingency table are simply swapped), the values are equal. Consequently, the results for the negative correlation coefficient case are omitted.

Table 4 shows the value of $\Phi _f^{(\lambda )}$ and $\Phi _{Gf}^{(\lambda )}$ for each value of $\rho $. The values of $\Phi _f^{(\lambda )}$ and $\Phi _{Gf}^{(\lambda )}$ increase as the absolute value of the $\rho $ increases. Additionally, the results confirm that $\Phi _{Gf}^{(\lambda )} = 0$ if and only if the measures show that the values are independent, and $\Phi _{Gf}^{(\lambda )} = 1.0$ if and only if the measures confirm that there is a structure with partial or complete association. Moreover, if there is a relationship for the entire contingency table, the values of $\Phi _{Gf}^{(\lambda )}$ are larger than those of $\Phi _{f}^{(\lambda )}$ by Theorem 1, but the differences are small.

Table 4 Values of $\Phi _f^{(\lambda )}$ and $\Phi _{Gf}^{(\lambda )}$ for each $\rho $

Full size table

4.3 Actual Dataset 1

Consider the case where the PRV measure is adapted to the data in Table 5, which represents a survey of cannabis use among students conducted at the University of Ioannina (Greece) in 1995 (published in Marselos et al., 1997). The students’ frequency of alcohol consumption is measured on a four-level scale ranging from at most once per month up to more than twice per week, while their trial of cannabis is rated through a three-level variable (never tried-tried once or twice-more often). The first and second rows in the data show a partial bias of the frequency.

Table 5 Students’ survey about cannabis use at the University of Ioannina

Full size table

Tables 6a and 6b provide estimates of $\Phi _f^{(\lambda )}$ and $\Phi _{Gf}^{(\lambda )}$, respectively. For instance, when $\lambda =1$, the measure $\widehat{\Phi }_f^{(1)}=0.1034$ for Table 6a, and $\widehat{\Phi }_{Gf}^{(1)}=0.2992$ for Table 6b. $\widehat{\Phi }_f^{(1)}$ shows that the average condition variation of trying cannabis is $10.34\%$ smaller than the marginal variation. Similarly, $\widehat{\Phi }_{Gf}^{(1)}$ shows that the average condition variation of trying cannabis is $29.92\%$ smaller.

On the basis of the results of these values, the following can be interpreted from Table 5:

(1):: Overall, there is a strong association between regular alcohol consumption and cannabis use.
(2):: There are fairly strong associations between some alcohol consumption and cannabis use.

Although these interpretations seem to be intuitive when looking at Table 5, analysis using the measures provides an objective interpretation numerically. Hence, it indicates how strongly associated structures are in the contingency table.

Table 6 Estimate of $\Phi _{f}^{(\lambda )}$ and $\Phi _{Gf}^{(\lambda )}$, estimated approximate standard error for $\widehat{\Phi }_{f}^{(\lambda )}$ and $\widehat{\Phi }_{Gf}^{(\lambda )}$, and approximate $95\%$ confidence interval for $\Phi _{f}^{(\lambda )}$ and $\Phi _{Gf}^{(\lambda )}$

Full size table

Table 7 Occupational status for Japanese father-son pairs

Full size table

4.4 Actual Dataset 2

By analyzing multiple contingency tables using the measures, it is possible to numerically determine the difference between the associations of the contingency tables. Table 7 shows data from Hashimoto (1999). These data describe the cross-classifications of the father’s and son’s occupational status categories in Japan, which were examined in 1975 and 1985. In addition, we can consider the father’s states as an explanatory variable and the son’s states as a response variable. Table 7 is an example of a square contingency table because the row and column variables take the same categorical values. Generally, square contingency tables are analyzed using a test for symmetry or a measure for symmetry (e.g., the Bowker’s chi-squared statistic by Bowker, 1948, the power-divergence type measure of departure from symmetry by Tomizawa et al., 1998). Here, we use the PRV measures to evaluate the influence of fathers on their sons’ occupations. By comparing the degree of association in two different years, we also aim to examine the changes in the effects over a 10-year period.

Table 8 First column indicates the estimate $\widehat{\Phi }_{g}^{(\omega )}$ of $\Phi _{g}^{(\omega )}$, the second column indicates the estimated approximate standard error for $\widehat{\Phi }_{g}^{(\omega )}$, and the final column indicates approximate $95\%$ confidence interval for $\Phi _{g}^{(\omega )}$

Full size table

Tables 8 and 9 give estimates of $\Phi _g^{(\omega )}$ and $\Phi _{Gg}^{(\omega )}$, respectively. Comparing the estimates for each $\omega $ in Tables 8 and 9 shows that the values for both measures are almost the same. In addition, the estimates are slightly larger in Table 8b, suggesting that Table 7b is more related, but the difference is small because all the confidence intervals are covered. Similarly, Table 7b is more related because its estimate is slightly larger in Table 9b. However, the confidence interval does not cover at $\omega = 0.9$. These results lead to the following interpretion for Tables 7a and 7b:

(1):: The occupational status categories of fathers and sons in 1975 and 1985 both have weak associations overall, indicating that individual explanatory variables do not have remarkable associations.
(2):: Although the association of Table 7b is slightly larger than Table 7a, the results of the confidence intervals indicate that the overall difference is statistically insignificant.
(3):: The partial association in Table 7b is slightly larger than Table 7a, and the results of the confidence intervals indicate that there may be a difference.

Statistical differences in the results of some confidence intervals, as in (3), are affected by the differences in the characteristics of variation associated with changing the tuning parameters. In this case, it is difficult to give an interpretation by referring to variation because the variation in the special cases did not differ (e.g., $\omega = 0$). However, when special cases show differences in variation, further interpretation can be given by focusing on the characteristics.

Table 9 Estimate of $\Phi _{Gg}^{(\omega )}$, estimated approximate standard error for $\widehat{\Phi }_{Gg}^{(\omega )}$, and approximate $95\%$ confidence interval for $\Phi _{Gg}^{(\omega )}$

Full size table

5 Conclusion

In this paper, we proposed the geometric mean type of PRV (geoPRV) measure, which uses variation composed of the geometric mean and arbitrary functions that satisfy certain conditions. The proposed measure has three properties suitable for examining the degree of association and satisfies the conventional measures. (i) The measure increases monotonically as the degree of association increases. (ii) The value is 0 when there is a structure of null association. (iii) The value is 1 when there is a complete structure of association. Because the geoPRV measure uses the geometric mean, it can capture the association to the response variables for individual explanatory variables that cannot be investigated by existing PRV measures. Analyses using the existing PRV measures and the geoPRV measure simultaneously should examine the association of the entire contingency table as well as partial association. The geoPRV measure can be analyzed using variations with various characteristics by providing functions and tuning parameters that satisfy specific conditions such as the measure $\Phi _{f}$. Therefore, analysis using the geoPRV measure together with existing PRV measures can lead to a deeper understanding of the data and provide further interpretation.

Because it may be unclear how to set the functions and tuning parameters using solely the geoPRV measure, we suggest the following when selecting the values. The analyst should select the best measurement method. If the optimal measurement method is unclear, consider the data from multiple perspectives by comparing the various results obtained by changing the tuning parameters. In addition, analysis with several functions and tuning parameters may show differences when comparing the confidence intervals, as in Actual Dataset 2. For this reason, analysis with several functions and tuning parameters can lead to secure detection of irregular situations and safe analysis of data, which cannot be characterized beforehand. Statisticians may be interested in mathematically choosing the tuning parameters to use such as the characteristics of the data or relationships between row and column variables. However, this is beyond the scope of this study and is a topic for future work.

Although various measures of contingency tables have been proposed, several recent studies have conducted analyses using the Goodman-Kraskal’s PRV measure (e.g., Gea-Izquierdo, 2023; Iordache et al., 2022). We believe that the proposed PRV measure can provide a new perspective that focuses on the association of individual explanatory variables, including the association of the entire contingency table.

Availability of data and materials

Not applicable.

References

Agresti, A. (1983). A survey of strategies for modeling cross-classifications having ordinal variables. Journal of the American Statistical Association, 78(381), 184–198.
Agresti, A. (2010). Analysis of ordinal categorical data (2nd ed.). John Wiley & Sons.
Agresti, A. (2013). Categorical data analysis (3rd ed.). John Wiley & Sons.
Bishop, Y.M., Fienberg, S.E., Holland, P.W. (2007). Discrete multivariate analysis: Theory and practice. Springer Science & Business Media.
Bowker, A.H. (1948). A test for symmetry in contingency tables. Journal of the American Statistical Association, 43(244), 572–574.
Chuang, C., Gheva, D., Odoroff, C. (1985). Methods for diagnosing multiplicative-interaction models for two-way contingency tables. Communications in Statistics-Theory and Methods, 14(9), 2057–2080.
Cramér, H. (1946). Mathematical methods of statistics. Princeton university press.
Cressie, N., & Read, T.R. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46(3), 440–464.
Gea-Izquierdo, E. (2023). Biological risk of legionella pneumophila in irrigation systems. Revista de Salud Pública, 22, 434–439.
Gilula, Z., & Haberman, S.J. (1986). Canonical analysis of contingency tables by maximum likelihood. Journal of the American Statistical Association, 81(395), 780–788.
Goodman, L.A. (1979). Simple models for the analysis of association in cross-classifications having ordered categories. Journal of the American Statistical Association, 74(367), 537–552.
Goodman, L.A. (1981). Association models and canonical correlation in the analysis of cross-classifications having ordered categories. Journal of the American Statistical Association, 76(374), 320–334.
Goodman, L.A. (1985). The analysis of cross-classified data having ordered and/or unordered categories: Association models, correlation models, and asymmetry models for contingency tables with or without missing entries. The Annals of Statistics, 13, 10–69.
Goodman, L.A. (1986). Some useful extensions of the usual correspondence analysis approach and the usual log-linear models approach in the analysis of contingency tables. International Statistical Review/Revue Internationale de Statistique, 243–270.
Goodman, L.A., & Kruskal, W.H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49(268), 732–764.
Hashimoto, K. (1999). Gendai nihon no kaikyuu kouzou (class structure in modern japan: theory, method and quantitative analysis). Toshindo, Tokyo (in Japanese).
Ichimori, T. (2013). On inequalities between f-divergence. Technical Note, IPSJ Journal , 54(11), 2344–2348.
Iordache, A.M., Nechita, C., Voica, C., Pluháček, T., Schug, K.A. (2022). Climate change extreme and seasonal toxic metal occurrence in romanian freshwaters in the last two decades-case study and critical review. NPJ Clean Water , 5(1), 2.
Kateri, M. (2014). Contingency table analysis. Springer.
Kateri, M., & Papaioannou, T. (1994). f-divergence association models. University of Ioannina.
Liu, I., & Agresti, A. (2005). The analysis of ordered categorical data: An overview and a survey of recent developments. Test, 14, 1–73.
Marselos, M., Boutsouris, K., Liapi, H., Malamas, M., Kateri, M., Papaioannou, T. (1997). Epidemiological aspects of the use of cannabis among university students in greece. European Addiction Research, 3(4), 184–191.
Momozaki, T., Wada, Y., Nakagawa, T., Tomizawa, S. (2023). Extension of generalized proportional reduction in variation measure for two-way contingency tables. Behaviormetrika, 50(1), 385–398.
Nakagawa, T., Takei, T., Ishii, A., Tomizawa, S. (2020). Geometric mean type measure of marginal homogeneity for square contingency tables with ordered categories. Journal of Mathematics and Statistics, 16(1), 170–175.
Patil, G., & Taillie, C. (1982). Diversity as a concept and its measurement. Journal of the American Statistical Association, 77(379), 548–561.
Read, T.R., & Cressie, N.A. (2012). Goodness-of-fit statistics for discrete multivariate data. Springer Science & Business Media.
Rényi, A. (1961). On measures of entropy and information. Proceedings of the fourth berkeley symposium on mathematical statistics and probability.
Rom, D., & Sarkar, S.K. (1992). A generalized model for the analysis of association in ordinal contingency tables. Journal of Statistical Planning and Inference, 33(2), 205–212.
Saigusa, Y., Tahata, K., Tomizawa, S. (2016). Measure of departure from partial symmetry for square contingency tables. Journal of Mathematics and Statistics, 12(3), 152–156.
Saigusa, Y., Takami, M., Ishii, A., Nakagawa, T., Tomizawa, S. (2019). Measure for departure from cumulative partial symmetry for square contingency tables with ordered categories. Journal of Statistics: Advances in Theory and Applications, 21(1), 53–70.
Tahata, K. (2022). Advances in quasi-symmetry for square contingency tables. Symmetry, 14(5), 1051.
Theil, H. (1970). On the estimation of relationships involving qualitative variables. American Journal of Sociology, 76(1), 103–154.
Tomizawa, S., Miyamoto, N., Houya, H. (2004). Generalization of cramer’s coefficient of association for contingency tables: theory and methods. South African Statistical Journal , 38(1), 1–24.
Tomizawa, S., Seo, T., Ebi, M. (1997). Generalized proportional reduction in variation measure for two-way contingency tables. Behaviormetrika, 24, 193–201.
Tomizawa, S., Seo, T., Yamamoto, H. (1998). Power-divergence-type measure of departure from symmetry for square contingency tables that have nominal categories. Journal of Applied Statistics, 25(3), 387–398.
Tomizawa, S., & Yukawa, T. (2004). Proportional reduction in variation measure for two-way contingency tables with ordered categories. Journal of Statistical Research, 38(1), 45–59.

Download references

Acknowledgements

The authors are grateful to the editor and the referees for their valuable comments and suggestions.

Funding

Open access funding provided by Tokyo University of Science. This work was supported by JSPS Grant-in-Aid for Scientific Research (C) Number JP20K03756.

Author information

Yuki Wada, Tomoyuki Nakagawa, Kouji Tahata and Sadao Tomizawa contributed equally to this work.

Authors and Affiliations

Department of Information Sciences, Tokyo University of Science, Noda City, 278-8510, Chiba, Japan
Wataru Urasaki, Yuki Wada, Kouji Tahata & Sadao Tomizawa
School of Data Science, Meisei University, Hino City, 191-8506, Tokyo, Japan
Tomoyuki Nakagawa & Sadao Tomizawa

Authors

Wataru Urasaki
View author publications
You can also search for this author in PubMed Google Scholar
Yuki Wada
View author publications
You can also search for this author in PubMed Google Scholar
Tomoyuki Nakagawa
View author publications
You can also search for this author in PubMed Google Scholar
Kouji Tahata
View author publications
You can also search for this author in PubMed Google Scholar
Sadao Tomizawa
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

These authors contributed equally to this work.

Corresponding author

Correspondence to Wataru Urasaki.

Ethics declarations

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

All authors have read and agreed to the published version of the manuscript.

Conflict of interest

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Proof of Theorem 1

Proof

(i):

Let $\phi $ denote the numerator of a fraction

$$ \Phi _{Gf}-\Phi _f = \frac{\displaystyle \sum _{i=1}^R p_{i\cdot } V(Y \vert X=i)-\prod _{i=1}^R\left[ V(Y \vert X=i) \right] ^{p_{i\cdot }}}{\displaystyle -\sum _{j=1}^Cf(p_{\cdot j})}. $$

If there exists i such that $V(Y \vert X=i)=0$, then $\phi \ge 0$ is easily verified (i.e., $\Phi _f\le \Phi _{Gf}$). Otherwise, assume that $f(x)=-\log x$, which is a convex function since $f''(x)=1/x^2>0$ where $f''(x)$ is the second derivative of the function f(x) with respect to x. From Jensen’s inequality,

$$\begin{aligned}{} & {} \sum _{i=1}^R p_{i\cdot }[-\log V(Y \vert X=i)]\ge -\log \left[ \sum _{i=1}^R p_{i\cdot }V(Y \vert X=i) \right] \\\Longleftrightarrow & {} \sum _{i=1}^R \log \left[ V(Y \vert X=i) \right] ^{p_{i\cdot }} \le \log \left[ \sum _{i=1}^R p_{i\cdot }V(Y \vert X=i) \right] \\\Longleftrightarrow & {} \log \prod _{i=1}^R \left[ V(Y \vert X=i) \right] ^{p_{i\cdot }} \le \log \left[ \sum _{i=1}^R p_{i\cdot }V(Y \vert X=i) \right] \\\Longleftrightarrow & {} \prod _{i=1}^R \left[ V(Y \vert X=i) \right] ^{p_{i\cdot }} \le \left[ \sum _{i=1}^R p_{i\cdot }V(Y \vert X=i) \right] \end{aligned}$$

where $p_{i\cdot }\ge 0$, $\sum _{i=1}^R p_{i\cdot }=1$. Therefore, $\phi \ge 0$, i.e., $\Phi _f\le \Phi _{Gf}$ holds.

(ii):

The inequality $0\le \Phi _f\le 1$ is already proven by Momozaki et al. (2023), and $\Phi _f\le \Phi _{Gf}$ holds as proved above. Hence, $\Phi _{Gf}\ge 0$ holds since $0\le \Phi _f\le \Phi _{Gf}$. In addition, since $\prod _{i=1}^R \left[ V(Y \vert X=i) \right] ^{p_{i\cdot }}\ge 0$, $\Phi _{Gf}\le 1$. Thus, $0\le \Phi _{Gf}\le 1$ holds.

(iii):

Since $0\le \Phi _f\le \Phi _{Gf}$, if $\Phi _{Gf}=0$ then $\Phi _f=0$. Hence, since $p_{ij}=p_{i\cdot }p_{\cdot j}$ holds for $\Phi _f=0$ (Momozaki et al., 2023), $p_{ij}=p_{i\cdot }p_{\cdot j}$ holds for $\Phi _{Gf}=0$. Thus, $\Phi _{Gf}=0\Longrightarrow p_{ij}=p_{i\cdot }p_{\cdot j}$ holds. Moreover, $\Phi _{Gf}=0\Longleftarrow p_{ij}=p_{i\cdot }p_{\cdot j}$ can easily be checked.

(iv):

If $\Phi _{Gf}=1$, then $\prod _{i=1}^R \left[ V(Y \vert X=i) \right] ^{p_{i\cdot }}=0$. That is, for some s, $V(Y \vert X=s)=-\sum _{j=1}^Cf\left( \frac{p_{ij}}{p_{i\cdot }} \right) =0$ ($s=1,\ldots ,R$). Thus, there exists i such that $p_{ij}\ne 0$ and $p_{ik}=0$ ($k\ne j$).$\square $

Proof of Theorem 2

Proof

Since the first terms in the denominator and numerator of $\Phi _{Gf}$ are independ of the row category, we focus on the second term in the numerator. This term is given as

$$\begin{aligned} \prod _{i=1}^R \left[ - \sum _{j=1} ^C f \left( \frac{p_{ij}}{p_{i\cdot }} \right) \right] ^{p_{i\cdot }} = \prod _{i=1}^R \left[ -f \left( \frac{p_{i1}}{p_{i\cdot }} \right) - \cdots - f \left( \frac{p_{iC}}{p_{i\cdot }}\right) \right] ^{p_{i\cdot }}, \end{aligned}$$

and the values are invariant to the reordering of the sums. Namely, the value of $\Phi _{Gf}$ is invariant with respect to the permutation of row categories. Similarly, the value of $\Phi _{Gf}$ is also invariant with respect to the permutation of column categories.$\square $

Proof of Theorem 3

Proof

Let

$$ \varvec{n}=(n_{11},n_{12},\ldots ,n_{1C},n_{21},\ldots ,n_{RC})^\top , $$

$$ \varvec{p}=(p_{11},p_{12},\ldots ,p_{1C},p_{21},\ldots ,p_{RC})^\top , $$

$\widehat{\varvec{p}}=\varvec{n}/n$, and $\varvec{a}^\top $ is the transpose of $\varvec{a}$. Then $\sqrt{n}\left( \widehat{\varvec{p}} - \varvec{p} \right) $ converges into a normal distribution with a mean of zero and the covariance matrix $\textrm{diag}(\varvec{p}) - \varvec{pp}^\top $, where $\textrm{diag}(\varvec{p})$ is a diagonal matrix with the elements of $\varvec{p}$ on the main diagonal (Bishop et al. , 2007, Chap. 14).

The Taylor expansion of the function $\widehat{\Phi }_{Gf}$ around $\varvec{p}$ is given by

$$ \widehat{\Phi }_{Gf} = \Phi _{Gf} + \left( \frac{\partial \Phi _{Gf}}{\partial \varvec{p}^\top } \right) (\widehat{\varvec{p}}-\varvec{p}) + o_p(n^{-1/2}). $$

Since

$$ \sqrt{n}(\widehat{\Phi }_{Gf}-\Phi _{Gf}) = \sqrt{n} \left( \frac{\partial \Phi _{Gf}}{\partial \varvec{p}^\top } \right) (\widehat{\varvec{p}}-\varvec{p}) + o_p(1), $$

$$ \sqrt{n}(\widehat{\Phi }_{Gf}-\Phi _{Gf}) \overset{d}{\rightarrow }\ N(0,\sigma ^2[\Phi _{Gf}]), $$

where

$$ \sigma ^2[\Phi _{Gf}] = \left( \delta ^{(f)}\right) ^2 \left[ \sum _{i=1}^R\sum _{j=1}^Cp_{ij}(\Delta _{ij}^{(f)})^2 - \left( \sum _{i=1}^R\sum _{j=1}^C p_{ij}\Delta _{ij}^{(f)} \right) ^2 \right] , $$

with

$$\begin{aligned} \delta ^{(f)}= & {} \frac{\displaystyle \prod _{s=1}^R \left[ -\sum _{t=1}^C f \left( \frac{p_{st}}{p_{s\cdot }} \right) \right] ^{p_{s\cdot }}}{\displaystyle \left( \sum _{t=1}^C f(p_{\cdot t}) \right) ^2}\\ \Delta _{ij}^{(f)}= & {} f'(p_{\cdot j}) -\varepsilon _{ij}^{(f)}\sum _{t=1}^C f(p_{\cdot t}),\\ \varepsilon _{ij}^{(f)}= & {} \log \left[ -\sum _{t=1}^C f \left( \frac{p_{it}}{p_{i\cdot }} \right) \right] + \frac{\displaystyle \sum _{t=1}^C \left\{ -\frac{p_{it}}{p_{i\cdot }} f' \left( \frac{p_{it}}{p_{i\cdot }} \right) \right\} + f' \left( \frac{p_{ij}}{p_{i\cdot }} \right) }{\displaystyle \sum _{t=1}^C f' \left( \frac{p_{it}}{p_{i\cdot }} \right) }, \end{aligned}$$

$f'(x)$ is the derivative of function f(x) by x.$\square $

Program (R Code)

Appendix D contains the R function geoPRV(), which uses the geoPRV measure to evaluate the degree of association in an $R \times C$ contingency table. The arguments of the function are:

dat : the two-way contingency table of size $R\times C$, where $R,C > 1$,
f : the function f(x) with tuning parameter a given to geoPRV measure, and satisfies the following conditions: (i) f(x) is a convex function, (ii) $0 \cdot f(0/0)=0$, (iii) $\lim _{x\rightarrow +0}f(x)=0$, and (iv) $f(1)=0$,
para : the tuning parameter given to f(x).

If f(x) is used without a tuning parameter, put an arbitrary number in para.

The numerical summaries produced from the function geoPRV() are:

the contingency table under investigation, Data,
the estimate of geoPRV measure, geoPRV,
the estimated approximate standard error for geoPRV measure, SE,
the approximate $95\%$ confidence interval for geoPRV measure, CI.

Therefore, when cannabis.dat is the R object assigned to Table 5 so that

then the function produces the following numerical summaries, which are the results of $\lambda = 1.0$ in Table 6b,

Note that due to the R-language specification, geoPRV(dat, f_lambda, 0) does not work well for analysis using $f(x) = x\log x $. Therefore, the function must be given separately, as in

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Urasaki, W., Wada, Y., Nakagawa, T. et al. Geometric Mean Type of Proportional Reduction in Variation Measure for Two-Way Contingency Tables. Sankhya B 86, 139–163 (2024). https://doi.org/10.1007/s13571-023-00320-w

Download citation

Received: 16 June 2023
Accepted: 09 November 2023
Published: 03 January 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s13571-023-00320-w

Keywords

AMS (2000) subject classification.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Geometric Mean Type of Proportional Reduction in Variation Measure for Two-Way Contingency Tables

Abstract

Similar content being viewed by others

Extension of generalized proportional reduction in variation measure for two-way contingency tables

Generalized Cramér’s coefficient via f-divergence for contingency tables

Power Comparisons in Contingency Tables

1 Introduction

2 PRV Measure

2.1 Conventional PRV Measure

2.2 geoPRV Measure

Theorem 1

Theorem 2

3 Approximate Confidence Interval for the Measure

Theorem 3

4 Numerical Experiments

4.1 Artificial Dataset 1

4.2 Artificial Dataset 2

4.3 Actual Dataset 1

4.4 Actual Dataset 2

5 Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent for publication

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix

Proof of Theorem 1

Proof

Proof of Theorem 2

Proof

Proof of Theorem 3

Proof

Program (R Code)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

AMS (2000) subject classification.

Search

Navigation