1 Introduction

Categorical variables, which originate from distinct categories, are employed in various fields such as medicine, psychology, education, and social science. Here, let’s consider two types of categorical variables: R and C categories. These two variables have \(R \times C\) combinations, which can be represented in a table with R rows and C columns. This is called a two-way contingency table, where each (ij) cell (\(i=1,\ldots , R;~j=1,\ldots , C\)) displays only the observed frequencies. Typically, a two-way contingency table is used to evaluate whether two variables are related (i.e., statistically independent). However, if the independence of the two variables is rejected using Pearson’s chi-square test, for example, or the variables are clearly related, then the strength of association can provide useful insight into their relationship.

Previous studies have proposed various association models to examine the association structure of contingency tables. Goodman (1979) proposed the uniform association model. Although this is the simplest association model, it is severely constrained. Consequently, various extended models such as the linear-by-linear association model and the R, C, \(R+C\), and RC models have been proposed by Goodman (1985, 1986), Agresti (1983), and Liu & Agresti (2005), respectively. Many authors have presented models related to Goodman’s association model (see, e.g., Goodman, 1979, 1981; Chuang et al., 1985; Gilula & Haberman, 1986; Rom & Sarkar, 1992). For details on the various association models, see Goodman’s paper above and the writings of Agresti (2010, 2013) and Kateri (2014).

A typical criterion to evaluate an association model is the goodness-of-fit test with the models. Additionally, numerous measures have been proposed as indicators to show the degree of association within the interval from 0 to 1.

Examples include Pearson’s coefficients \(\phi ^2\) of mean square contingency, P of contingency, and Tshuprow’s coefficient T (see, e.g., Bishop et al. , 2007, Chap. 11, Agresti , 2013, Chap. 3). However, these measures have a weakness because \(\phi ^2\) does not attain 1 even if the contingency table has a complete association structure (i.e., the maximum departure from independence). Similarly, the numbers of rows and columns in the table affect P and T. To solve this problem, Cramér (1946) proposed Cramér’s coefficient V, which becomes 1 if the contingency table has a complete associative structure for all rows and columns. Recently, Tomizawa et al. (2004) proposed power-divergence type measures \(V^2_{t(\lambda )}\) \((t=1,2,3)\) by linking the contingency table and the divergence. Power-divergence type measures evaluate the difference between two probability distributions. (For more information about divergence and power-divergence, see Rényi , 1961; Cressie & Read , 1984; Read & Cressie , 2012, Chap. 2.) These measures calculate the degree of association for each (ij) cell in the contingency table. The degree of association is derived from the sum of all cells. Hence, these measures can be applied to most contingency tables without distinguishing whether the row and column variables are explanatory or response variables. Often in actual contingency table analysis, the row and column variables are defined as explanatory or response variables, and ignoring the characteristics of each variable in the analysis is inappropriate.

For an \(R \times C\) contingency table with the explanatory variable X and the response variable Y, various measures have been proposed to assess the degree of association. For an ordinal variable, Agresti (2010) in Chap. 7 provides a detailed description of the ordinal measures of association. By contrast, if the category is a nominal variable, using an ordinal measure is not suitable in the analysis. Thus, measures explained by the proportional reduction in variation (PRV) from the marginal distribution of the response Y to the conditional distributions of Y given an explanatory variable X are used to describe the degree of association. Measures constructed by this method are called PRV measures. Examples of PRV measures are the concentration coefficient \(\tau \) (Goodman & Kruskal, 1954) and uncertainty coefficient U (Theil, 1970), which employ the Gini concentration and Shannon entropy for the variation measure, respectively. Additionally, when a contingency table has an ordinal category, Tomizawa & Yukawa (2004) proposed \(\phi ^{\lambda }\), which incorporates ordinal information.

The PRV measure is a valuable tool to summarize the strength of association of the entire contingency table because the value is easily interpreted due to its construction. That is, the value shows how much more effective it is to predict the response variable Y when the explanatory variable X is known rather than when it is unknown. Sometimes, the analysis aims to evaluate the association of specific categories of explanatory variables. However, conventional PRV measures may not accurately reflect the partial association numerically because they underestimate the strength of association. In the previous studies of models and measures to evaluate the symmetry of the contingency table, Nakagawa et al. (2020), Saigusa et al. (2016), and Saigusa et al. (2019) proposed evaluating the partial symmetry using the geometric mean. On the other hand, little research has examined partial associations.

In this paper, we propose a geometric mean type of PRV (geoPRV) measure via a geometric mean and functions satisfying certain conditions. The geoPRV measure has practical advantages. In special cases, the geoPRV measure allows previously proposed PRV measures to be expressed as geometric mean types. Because the geometric mean sensitively captures the association of each explanatory variable, the analysis does not underestimate the degree of association when cells in the contingency table are partially biased. The proposed measure strongly reflects a partial association structure as it provides the relationship to the response variable by explanatory variable category. Therefore, the geoPRV measure evaluates the strength of association in the same manner as existing PRV measures if the entire contingency table has an association structure. However, the geoPRV measure simultaneously elucidates the strength of a partial association structure numerically.

Furthermore, whether the categorical variable is nominal or ordinal does not influence the geoPRV measure because its value does not change even when the rows and columns are swapped.

The rest of this paper is organized as follows. Section 2 introduces previous research on an extension of generalized PRV (eGPRV) measure and proposes the geoPRV measure. Section 3 presents the approximate confidence intervals of the proposed measure. Section 4 confirms the values and confidence intervals of the proposed measure using several artificial and actual datasets. Additionally, it compares the geoPRV measure with the eGPRV measure. Section 5 presents our conclusions.

2 PRV Measure

In this section, we introduce measures using a function f(x) that satisfies the following conditions: (i) f(x) is a convex function, (ii) \(0 \cdot f(0/0)=0\), (iii) \(\lim _{x\rightarrow +0}f(x)=0\), and (iv) \(f(1)=0\). Kateri & Papaioannou (1994), Momozaki et al. (2023), and Tahata (2022) have introduced examples of the function and proposed models and measures. These proposals are intended to generalize existing models. The measures have application advantages, which can be used to easily construct new models or to make adjustments to fit the analysis using tuning parameters. Section 2.1 overviews conventional PRV measures proposed by Momozaki et al. (2023), while Section 2.2 describes our proposed geoPRV measure and its characteristics.

2.1 Conventional PRV Measure

Consider an \(R\times C\) contingency table with nominal categories of the explanatory variable X and the response variable Y. Let \(p_{ij}\) denote the probability that an observation will fall in the ith row and jth column of the table (\(i=1,\ldots , R;j=1,\ldots , C)\). In addition, denote \(p_{i\cdot }\) and \(p_{\cdot j}\) as \(p_{i\cdot }=\sum _{l=1}^C p_{il}\), \(p_{\cdot j}=\sum _{k=1}^Rp_{kj}\). The conventional PRV measure takes the following form

$$ \Phi = \frac{V(Y)-E[V(Y \vert X)]}{V(Y)} = \frac{\displaystyle V(Y)-\sum _{i=1}^Rp_{i\cdot }V(Y \vert X=i)}{V(Y)}, $$

where V(Y) is a measure of the variation for the marginal distribution of Y, and \(E[V(Y \vert X)]\) is the expectation for the conditional variation of Y given the distribution of X (see, Agresti, 2013, Chap.2). \(\Phi \) uses the weighted arithmetic mean of \(V(Y \vert X=i)\), that is, \(\sum _{i=1}^Rp_{i\cdot }V(Y \vert X=i)\). Various PRV measures can be expressed by changing the variation measure. Example include the uncertainty coefficient U for the variation measure \(V(Y)=-\sum _{j=1}^C p_{\cdot j}\log p_{\cdot j}\), which is called the Shannon entropy, and the concentration coefficient \(\tau \) for the variation measure \(V(Y)=1-\sum _{j=1}^C p_{\cdot j}^2\), which called the Gini concentration. Tomizawa et al. (1997) proposed a generalized PRV measure \(T^{(\lambda )}\) that includes U and \(\tau \). \(T^{(\lambda )}\) uses \(V(Y) = \left( 1-\sum _{j=1}^C p_{\cdot j}^{\lambda +1} \right) /\lambda \) as the variation measure, which is Patil & Taillie (1982) the diversity index of degree \(\lambda \) for the marginal distribution \(p_{\cdot j}\). Furthermore, Momozaki et al. (2023) proposed an extension of generalized PRV (eGPRV) measure, which incorporates U, \(\tau \), and \(T^{(\lambda )}\). The eGPRV measure is given as

$$ \Phi _f = \frac{\displaystyle -\sum _{j=1}^{C}f(p_{\cdot j}) - \sum _{i=1}^R p_{i\cdot } \left[ - \sum _{j=1} ^C f \left( \frac{p_{ij}}{p_{i\cdot }} \right) \right] }{\displaystyle - \sum _{j=1}^C f(p_{\cdot j})}. $$

The variation measure used in the eGPRV measure \(\Phi _f\) is \(V(Y)= -\sum _{j=1}^{C}f(p_{\cdot j})\).

2.2 geoPRV Measure

We propose a new PRV measure using the weighted geometric mean of \(V(Y \vert X=i)\). Our proposed measure aims to sensitively capture the association of each explanatory variable to the response variable. Assuming that \(p_{\cdot j}>0\) and \(V(Y \vert X=i)\) is a real number greater than or equal to 0 (\(i=1,\ldots , R;~j=1,\ldots , C\)), we define the geoPRV measure for \(R\times C\) contingency tables as

$$ \Phi _{G} = \frac{\displaystyle V(Y)-\prod _{i=1}^R \left[ V(Y \vert X=i) \right] ^{p_{i\cdot }}}{V(Y)}, $$

where V(Y) is a measure of the variation for the marginal distribution of Y. The geoPRV measure can use the same variation as the conventional PRV measure, for example,

$$ \Phi _{Gf} = \frac{\displaystyle -\sum _{j=1}^C f(p_{\cdot j}) - \prod _{i=1}^R \left[ -\sum _{j=1}^C f \left( \frac{p_{ij}}{p_{i\cdot }} \right) \right] ^{p_{i\cdot }}}{\displaystyle -\sum _{j=1}^C f(p_{\cdot j})}, $$

where the variation measure \(V(Y) = -\sum _{j=1}^C f(p_{\cdot j})\). In addition, the following theorem for \(\Phi _{Gf}\) holds.

Theorem 1

The measure \(\Phi _{Gf}\) satisfies the following conditions:

  1. (i)

    \(\Phi _f \le \Phi _{Gf}\).

  2. (ii)

    \(\Phi _{Gf}\) must lie between 0 and 1.

  3. (iii)

    \(\Phi _{Gf}=0\) is equivalent to the independence of X and Y.

  4. (iv)

    \(\Phi _{Gf}=1\) is equivalent to \(\prod _{i=1}^R \left[ V(Y \vert X=i) \right] ^{p_{i\cdot }}=0\). That is, for at least one s, there exists t such that \(p_{st}\ne 0\) and \(p_{sj}=0\) for every j with \(j\ne t\).

Theorem 2

The value of \(\Phi _{Gf}\) is invariant to permutations of the row and column categories.

Appendix A and B show proofs of Theorem 1 and 2, respectively. The geoPRV measure differs from the conventional PRV measure in that \(\Phi _{Gf}=1\) when i exists such that \(p_{ij}=p_{i\cdot }\ne 0\). Another important feature of the geoPRV measure is that it takes an equal or higher value than the conventional PRV measure, allowing for a stronger representation of the row and column relationships.

A property of the geoPRV measure is that the larger the value of \(\Phi _{G}\), the stronger the association between the response variable Y and the explanatory variable X. That is, the larger the value of \(\Phi _{G}\), the more accurate the prediction of the Y category if the X is known. By contrast, if the value of \(\Phi _{G}\) is 0, X category does not affect the Y category.

3 Approximate Confidence Interval for the Measure

Since the measure \(\Phi _{Gf}\) is unknown, we derive the confidence interval of \(\Phi _{Gf}\). Let \(n_{ij}\) denote the frequency for a cell (ij) and \(n=\sum _{i=1}^R\sum _{j=1}^C n_{ij}\) (\(i=1,\ldots ,R;~j=1,\ldots ,C\)). Assume that the observed frequencies \(n_{ij}\) have a multinomial distribution. Here, we consider an approximate standard error and large-sample confidence interval for \(\Phi _{Gf}\) using the delta method (Bishop et al. , 2007, Chap. 14 and Appendix C in Agresti , 2013, Chap. 16). This leads to the following theorem.

Theorem 3

Let \(\widehat{\Phi }_{Gf}\) denote a plug-in estimator of \(\Phi _{Gf}\). \(\sqrt{n}( \widehat{\Phi }_{Gf}-\Phi _{Gf} )\) converges into a normal distribution with a mean of zero and variance \(\sigma ^2 [ \Phi _{Gf} ]\), where

$$\begin{aligned} \sigma ^2[\Phi _{Gf}] = \left( \delta ^{(f)}\right) ^2 \left[ \sum _{i=1}^R\sum _{j=1}^Cp_{ij}(\Delta _{ij}^{(f)})^2 - \left( \sum _{i=1}^R\sum _{j=1}^C p_{ij}\Delta _{ij}^{(f)} \right) ^2 \right] , \end{aligned}$$

with

$$\begin{aligned} \delta ^{(f)}= & {} \frac{\displaystyle \prod _{s=1}^R \left[ -\sum _{t=1}^C f \left( \frac{p_{st}}{p_{s\cdot }} \right) \right] ^{p_{s\cdot }}}{\displaystyle \left( \sum _{t=1}^C f(p_{\cdot t}) \right) ^2},\\ \Delta _{ij}^{(f)}= & {} f'(p_{\cdot j}) -\varepsilon _{ij}^{(f)}\sum _{t=1}^C f(p_{\cdot t}),\\ \varepsilon _{ij}^{(f)}= & {} \log \left[ -\sum _{t=1}^C f \left( \frac{p_{it}}{p_{i\cdot }} \right) \right] + \frac{\displaystyle \sum _{t=1}^C \left\{ -\frac{p_{it}}{p_{i\cdot }} f' \left( \frac{p_{it}}{p_{i\cdot }} \right) \right\} + f' \left( \frac{p_{ij}}{p_{i\cdot }} \right) }{\displaystyle \sum _{t=1}^C f' \left( \frac{p_{it}}{p_{i\cdot }} \right) }, \end{aligned}$$

and \(f'(x)\) is the derivative of function f(x) by x.

The proof of Theorem 3 is given in Appendix C.

Let \(\widehat{\sigma }^2 \left[ \Phi _{Gf} \right] \) denote a plug-in estimator of \(\sigma ^2 \left[ \Phi _{Gf} \right] \). From Theorem 3, since \(\widehat{\sigma } \left[ \Phi _{Gf} \right] \) is a consistent estimator of \(\sigma \left[ \Phi _{Gf} \right] \), \(\widehat{\sigma } \left[ \Phi _{Gf} \right] / \sqrt{n}\) is the estimated standard error for \(\widehat{\Phi }_{Gf}\), and \(\widehat{\Phi }_{Gf} \pm z_{\alpha /2} \widehat{\sigma } \left[ \Phi _{Gf} \right] / \sqrt{n}\) is the approximate \(100(1-\alpha )\%\) confidence limit for \(\Phi _{Gf}\), where \(z_{\alpha /2}\) is the upper two-sided normal distribution percentile at level \(\alpha \).

4 Numerical Experiments

In this section, we demonstrate the performance of the geoPRV measure \(\Phi _{Gf}\) and confirm the difference between \(\Phi _{Gf}\) and the conventional PRV measure \(\Phi _f\) proposed by Momozaki et al. (2023). We use \(\Phi _f\) and \(\Phi _{Gf}\), which have the variation measure \(V(Y)=-\sum _{j=1}^Cf(p_{\cdot j})\). In addition to applying \(f(x)=\left( x^{\lambda +1} - x \right) /\lambda \) for \(\lambda >-1\) and \(g(x)=(x - 1)^2/(\omega x + 1 - \omega ) - (x - 1)/(1 - \omega )\) for \(0 \le \omega < 1\) (see, Ichimori, 2013), the former is expressed as \(\Phi _f^{(\lambda )}\) and \(\Phi _{Gf}^{(\lambda )}\), while the latter is expressed as \(\Phi _g^{(\omega )}\) and \(\Phi _{Gg}^{(\omega )}\). It should be noted that the values \(\Phi ^{(\lambda )}_f\) and \(\Phi ^{(\lambda )}_{Gf}\) at \(\lambda = 0\) have the continuous limits as \(\lambda \rightarrow 0\). For the tuning parameters, we set \(\lambda =0,~0.5,\) or 1.0 and \(\omega =0,~0.5\) or 0.9.

4.1 Artificial Dataset 1

 Table 1 shows the artificial dataset considered in this study. The dataset clearly shows the difference in the characteristics between conventional PRV measures and the geoPRV measure. Tables 1a, 1b, and 1c represent changes in the partial association structure by moving the cell probability in the first row. In Table 1a, although all cell probabilities in the first row are non-zero values, most of the marginal probability \(p_{1\cdot }\) is concentrated in the (1,2) and (1,3) cells. Similarly, the marginal probability \(p_{1\cdot }\) in Table 1b is mostly concentrated in the (1,3) cell. By contrast, Table 1c has a complete partial association in the first row because the probability exists only in the (1,3) cell. In the sense that the first row has a bias in the cell probability, Tables 1a, 1b, and 1c have partial weak, slight, and complete association structures in the first row, respectively.

Table 1 \(3 \times 3\) probability tables, with a (a) weak, (b) slightly strong, and (c) complete association structure in the first row

Tables 2a and 2b list the values of \(\Phi _f^{(\lambda )}\) and \(\Phi _{Gf}^{(\lambda )}\), respectively. For instance, Table 2a shows that when Table 1c is parsed, the measure \(\Phi _f^{(\lambda )}=0.2628,~0.1990,\) or 0.1784 for \(\lambda \) = 0, 0.5, or 1.0, respectively, and does not capture the complete association structure of the first row. By contrast, \(\Phi _{Gf}^{(\lambda )} = 1\) for all values of \(\lambda \), allowing the local association structure to be identified. Considering the results of the \(\Phi _{Gf}^{(\lambda )} \) and \(\Phi _{f}^{(\lambda )}\) for any \(\lambda \) from Table 1a to 1c, the simulation shows that \(\Phi _{Gf}^{(\lambda )}\) changes significantly by capturing partially related structures compared to \(\Phi _{f}^{(\lambda )}\).

Table 2 Values of \(\Phi _{f}^{(\lambda )}\) and \(\Phi _{Gf}^{(\lambda )}\), applied to Table 1
Table 3 \(4 \times 4\) probability tables formed using three cutpoints for each variable at \(z_{0.25}, z_{0.50},\) and \(z_{0.75}\) from a bivariate normal distribution with the conditions \(\mu _1 = \mu _2 = 0\), \(\sigma ^2_1 = \sigma ^2_2 = 1\) and \(\rho \) increasing by 0.2 increments from 0 to 1

4.2 Artificial Dataset 2

 Table 3 shows another artificial dataset, which is described to examine the value of the geoPRV measure \(\Phi _{Gf}\) as the association of the entire contingency table changes. The data are suitable for the survey by converting the bivariate normal distribution with means \(\mu _1 = \mu _2 = 0\) and variances \(\sigma ^2_1 = \sigma ^2_2 = 1\), in which the correlation coefficient was changed from 0 to 1 in 0.2 increments into \(4 \times 4\) contingency tables with equal-interval frequency. From Theorem 2 and the properties of the PRV measures, when the correlation coefficients have the same absolute values (i.e., when the rows of the contingency table are simply swapped), the values are equal. Consequently, the results for the negative correlation coefficient case are omitted.

Table 4 shows the value of \(\Phi _f^{(\lambda )}\) and \(\Phi _{Gf}^{(\lambda )}\) for each value of \(\rho \). The values of \(\Phi _f^{(\lambda )}\) and \(\Phi _{Gf}^{(\lambda )}\) increase as the absolute value of the \(\rho \) increases. Additionally, the results confirm that \(\Phi _{Gf}^{(\lambda )} = 0\) if and only if the measures show that the values are independent, and \(\Phi _{Gf}^{(\lambda )} = 1.0\) if and only if the measures confirm that there is a structure with partial or complete association. Moreover, if there is a relationship for the entire contingency table, the values of \(\Phi _{Gf}^{(\lambda )}\) are larger than those of \(\Phi _{f}^{(\lambda )}\) by Theorem 1, but the differences are small.

Table 4 Values of \(\Phi _f^{(\lambda )}\) and \(\Phi _{Gf}^{(\lambda )}\) for each \(\rho \)

4.3 Actual Dataset 1

 Consider the case where the PRV measure is adapted to the data in Table 5, which represents a survey of cannabis use among students conducted at the University of Ioannina (Greece) in 1995 (published in Marselos et al., 1997). The students’ frequency of alcohol consumption is measured on a four-level scale ranging from at most once per month up to more than twice per week, while their trial of cannabis is rated through a three-level variable (never tried-tried once or twice-more often). The first and second rows in the data show a partial bias of the frequency.

Table 5 Students’ survey about cannabis use at the University of Ioannina

Tables 6a and 6b provide estimates of \(\Phi _f^{(\lambda )}\) and \(\Phi _{Gf}^{(\lambda )}\), respectively. For instance, when \(\lambda =1\), the measure \(\widehat{\Phi }_f^{(1)}=0.1034\) for Table 6a, and \(\widehat{\Phi }_{Gf}^{(1)}=0.2992\) for Table 6b. \(\widehat{\Phi }_f^{(1)}\) shows that the average condition variation of trying cannabis is \(10.34\%\) smaller than the marginal variation. Similarly, \(\widehat{\Phi }_{Gf}^{(1)}\) shows that the average condition variation of trying cannabis is \(29.92\%\) smaller.

On the basis of the results of these values, the following can be interpreted from Table 5:

(1):

Overall, there is a strong association between regular alcohol consumption and cannabis use.

(2):

There are fairly strong associations between some alcohol consumption and cannabis use.

Although these interpretations seem to be intuitive when looking at Table 5, analysis using the measures provides an objective interpretation numerically. Hence, it indicates how strongly associated structures are in the contingency table.

Table 6 Estimate of \(\Phi _{f}^{(\lambda )}\) and \(\Phi _{Gf}^{(\lambda )}\), estimated approximate standard error for \(\widehat{\Phi }_{f}^{(\lambda )}\) and \(\widehat{\Phi }_{Gf}^{(\lambda )}\), and approximate \(95\%\) confidence interval for \(\Phi _{f}^{(\lambda )}\) and \(\Phi _{Gf}^{(\lambda )}\)
Table 7 Occupational status for Japanese father-son pairs

4.4 Actual Dataset 2

 By analyzing multiple contingency tables using the measures, it is possible to numerically determine the difference between the associations of the contingency tables. Table 7 shows data from Hashimoto (1999). These data describe the cross-classifications of the father’s and son’s occupational status categories in Japan, which were examined in 1975 and 1985. In addition, we can consider the father’s states as an explanatory variable and the son’s states as a response variable. Table 7 is an example of a square contingency table because the row and column variables take the same categorical values. Generally, square contingency tables are analyzed using a test for symmetry or a measure for symmetry (e.g., the Bowker’s chi-squared statistic by Bowker, 1948, the power-divergence type measure of departure from symmetry by Tomizawa et al., 1998). Here, we use the PRV measures to evaluate the influence of fathers on their sons’ occupations. By comparing the degree of association in two different years, we also aim to examine the changes in the effects over a 10-year period.

Table 8 First column indicates the estimate \(\widehat{\Phi }_{g}^{(\omega )}\) of \(\Phi _{g}^{(\omega )}\), the second column indicates the estimated approximate standard error for \(\widehat{\Phi }_{g}^{(\omega )}\), and the final column indicates approximate \(95\%\) confidence interval for \(\Phi _{g}^{(\omega )}\)

Tables 8 and 9 give estimates of \(\Phi _g^{(\omega )}\) and \(\Phi _{Gg}^{(\omega )}\), respectively. Comparing the estimates for each \(\omega \) in Tables 8 and 9 shows that the values for both measures are almost the same. In addition, the estimates are slightly larger in Table 8b, suggesting that Table 7b is more related, but the difference is small because all the confidence intervals are covered. Similarly, Table 7b is more related because its estimate is slightly larger in Table 9b. However, the confidence interval does not cover at \(\omega = 0.9\). These results lead to the following interpretion for Tables 7a and 7b:

(1):

The occupational status categories of fathers and sons in 1975 and 1985 both have weak associations overall, indicating that individual explanatory variables do not have remarkable associations.

(2):

Although the association of Table 7b is slightly larger than Table 7a, the results of the confidence intervals indicate that the overall difference is statistically insignificant.

(3):

The partial association in Table 7b is slightly larger than Table 7a, and the results of the confidence intervals indicate that there may be a difference.

Statistical differences in the results of some confidence intervals, as in (3), are affected by the differences in the characteristics of variation associated with changing the tuning parameters. In this case, it is difficult to give an interpretation by referring to variation because the variation in the special cases did not differ (e.g., \(\omega = 0\)). However, when special cases show differences in variation, further interpretation can be given by focusing on the characteristics.

Table 9 Estimate of \(\Phi _{Gg}^{(\omega )}\), estimated approximate standard error for \(\widehat{\Phi }_{Gg}^{(\omega )}\), and approximate \(95\%\) confidence interval for \(\Phi _{Gg}^{(\omega )}\)

5 Conclusion

In this paper, we proposed the geometric mean type of PRV (geoPRV) measure, which uses variation composed of the geometric mean and arbitrary functions that satisfy certain conditions. The proposed measure has three properties suitable for examining the degree of association and satisfies the conventional measures. (i) The measure increases monotonically as the degree of association increases. (ii) The value is 0 when there is a structure of null association. (iii) The value is 1 when there is a complete structure of association. Because the geoPRV measure uses the geometric mean, it can capture the association to the response variables for individual explanatory variables that cannot be investigated by existing PRV measures. Analyses using the existing PRV measures and the geoPRV measure simultaneously should examine the association of the entire contingency table as well as partial association. The geoPRV measure can be analyzed using variations with various characteristics by providing functions and tuning parameters that satisfy specific conditions such as the measure \(\Phi _{f}\). Therefore, analysis using the geoPRV measure together with existing PRV measures can lead to a deeper understanding of the data and provide further interpretation.

Because it may be unclear how to set the functions and tuning parameters using solely the geoPRV measure, we suggest the following when selecting the values. The analyst should select the best measurement method. If the optimal measurement method is unclear, consider the data from multiple perspectives by comparing the various results obtained by changing the tuning parameters. In addition, analysis with several functions and tuning parameters may show differences when comparing the confidence intervals, as in Actual Dataset 2. For this reason, analysis with several functions and tuning parameters can lead to secure detection of irregular situations and safe analysis of data, which cannot be characterized beforehand. Statisticians may be interested in mathematically choosing the tuning parameters to use such as the characteristics of the data or relationships between row and column variables. However, this is beyond the scope of this study and is a topic for future work.

Although various measures of contingency tables have been proposed, several recent studies have conducted analyses using the Goodman-Kraskal’s PRV measure (e.g., Gea-Izquierdo, 2023; Iordache et al., 2022). We believe that the proposed PRV measure can provide a new perspective that focuses on the association of individual explanatory variables, including the association of the entire contingency table.