Geometric Mean Type of Proportional Reduction in Variation Measure for Two-Way Contingency Tables

In a two-way contingency table analysis with explanatory and response variables, the analyst is interested in the independence of the two variables. However, if the test of independence does not show independence or clearly shows a relationship, the analyst is interested in the degree of their association. Various measures have been proposed to calculate the degree of their association, one of which is the proportional reduction in variation (PRV) measure which describes the PRV from the marginal distribution to the conditional distribution of the response. The conventional PRV measures can assess the association of the entire contingency table, but they can not accurately assess the association for each explanatory variable. In this paper, we propose a geometric mean type of PRV (geoPRV) measure that aims to sensitively capture the association of each explanatory variable to the response variable by using a geometric mean, and it enables analysis without underestimation when there is partial bias in cells of the contingency table. Furthermore, the geoPRV measure is constructed by using any functions that satisfy specific conditions, which has application advantages and makes it possible to express conventional PRV measures as geometric mean types in special cases.


Introduction
Categorical variables are formed from categories and are employed in various fields such as medicine, psychology, education, and social science.Considering two types of categorical variables, one consisting of R categories and the other consisting of C categories.These two variables have R×C combinations, which can be represented in a table with R rows and C columns.This is called a two-way contingency table, where each (i, j) cell (i = 1, 2, . . ., R; j = 1, 2, . . ., C) displays only the observed frequencies.Typically, the two-way contingency table is used to evaluate whether the two variables are related, i.e., statistically independent.If the independence of the two variables is rejected for example by Pearson's chi-square test, or they are clearly considered to be related, we are interested in the strength of their association.
As a method to investigate the associative structure of the contingency table, association models have been proposed by Gilula and Haberman (1986), Goodman (1981Goodman ( , 1985)), and Rom and Sarkar (1992).This method can determine whether there is a relationship between row and column variables by the goodness-of-fit test with models.However, this method only focuses on whether or not there is a relationship, and we can not quantitatively determine what the degree of association is.
Instead of the goodness-of-fit test with the models, a variety of measures have been proposed as indicators that can show the degree of association within the interval from 0 to 1 by Agresti (2003), Bishop et al. (2007), Cramér (1999), Everitt (1992), Tomizawa et al. (2004), and Tschuprow (1925Tschuprow ( , 1939)).These measures calculate the degree of deviation from independence for each (i, j) cell in the contingency table and derive the degree of association from the sum of all cells.Because of the method, these measures can be applied to most contingency tables without distinguishing whether row and column variables are explanatory or response variables.However, in actual contingency table analysis, there are cases where the row and column variables are defined as explanatory or response variables.In such cases, it is not appropriate to analyze each variable by ignoring its characteristics.
Alternative measures have been proposed by Goodman andKruskal (1954), andTheil (1970), which is explained by the proportional reduction in variation (PRV) from the marginal distribution to the conditional distributions of the response.The measures constructed by the method is called PRV measure.The PRV measure is an important tool in summarizing the strength of association of the entire contingency table because the way it is constructed makes it easy to interpret the values.In addition, we sometimes want to focus on the association of some categories of explanatory variables, but conventional PRV measures underestimate the strength and thus may not be able to accurately reflect the partial association numerically.In the study of models and scales for evaluating the symmetry of the contingency table, Nakagawa et al. (2020), Saigusa et al. (2016), andSaigusa et al. (2019) proposed to evaluate the partial symmetry by using the geometric mean.On the other hand, little research has been done in the case of the partial association.
In this paper, we propose a geometric mean type of PRV (geoPRV) measure via a geometric mean and functions satisfying certain conditions.Therefore, the geoPRV measure has application advantages and makes it possible to express previously proposed PRV measures as geometric mean types in special cases.By using the geometric mean to sensitively capture the association of each explanatory variable, analysis can be performed without underestimating the degree of association when cells in the contingency table are partially biased.In addition, the geoPRV measure enables us to know local association structures.Further-more, the geoPRV measure can be analyzed regardless of whether the categorical variable is nominal or ordinal because its value does not change even when rows and columns are swapped.The rest of this paper is organized as follows.Section 2 introduces previous research on an extension of generalized PRV (eGPRV) measure and proposes the geoPRV measure.Section 3 presents the approximate confidence intervals of the proposed measures.Section 4 confirms the values and confidence intervals of the proposed measure using several artificial and actual data sets, and compares them with the eGPRV measure.Section 5 presents our conclusions.

PRV Measure
In this Section, we introduce measures using function f (x) that satisfy the following conditions: Examples of the function are introduced, and models and measures using it have been proposed by Kateri and Papaioannou (1994), Momozaki et al. (2022) and Tahata (2022).These proposals are intended to generalize existing models and measures and have application advantages that make it easy to construct new ones and allow adjustments with tuning parameters to fit the analysis.Section 2.1 provides some conventional PRV measures by Momozaki et al. (2022).In Section 2.2, we propose a geometric mean type of PRV measure and its characteristics.

Conventional PRV Measure
Consider R × C contingency table with nominal categories of the explanatory variable X and the response variable Y .Let p ij denote the probability that an observation will fall in the ith row and jth column of the table (i = 1, . . ., R; j = 1, . . ., C).In addition, p i• and p •j are denoted as where V (Y ) is a measure of variation for the marginal distribution of Y , and is the expectation for the conditional variation of Y given the distribution of X (see, Agresti, 2003).Φ is using the weighted arithmetic mean of By changing the variation measure, various PRV measures can be expressed, such as uncertainty coefficient U for the variation measure V (Y ) = − C j=1 p •j log p •j called Shannon entropy and concentration coefficient τ for the variation measure Agresti, 2003).Tomizawa et al. (1997) proposed a generalized PRV measure T (λ) that includes U and τ by using /λ as the variation measure which is Patil and Taillie (1982) diversity index of degree λ for the marginal distribution p •j .Furthermore, Momozaki et al. (2022) proposed an extension of generalized PRV (eGPRV) measure that includes U, τ , and T (λ) :

Geometric Mean Type of PRV Measure
We propose a new PRV measure by using the weighted geometric mean of V (Y |X = i) that aims to sensitively capture the association of each explanatory variable to the response variable.Assume that p •j > 0 and V (Y |X = i) is a real number greater than or equal to 0 (i = 1, . . ., R; j = 1, . . ., C).We propose a geometric mean type of PRV (geoPRV) measure for R × C contingency tables defined as where V (Y ) is a measure of variation for the marginal distribution of Y .The geoPRV measure can use the same variation as the conventional PRV measure, for example, , where the variation measure V (Y ) = − C j=1 f (p •j ).In addition, the following theorem for Φ Gf holds.
Theorem 1.The measure Φ Gf satisfies the following conditions: (ii) Φ Gf must lie between 0 and 1.
, for at least one s, there exists t such that p st = 0 and p sj = 0 for every j with j = t.
Theorem 2. The value of Φ Gf is invariant to permutations of row and column categories.
For proof of Theorem 1 and Theorem 2, see Appendix A and Appendix B, respectively.The geoPRV measure differs from the conventional PRV measure in that Φ Gf = 1 when there exists i such that p ij = p i• = 0. Another important feature of the geoPRV measures is that it takes higher or equal values than the conventional PRV measures, allowing for a stronger representation of row and column relationships.
A property of the geoPRV measure is that the larger the value of Φ G , the stronger the association between the response variable Y and the explanatory variable X.In other words, the larger the value of Φ G , the more accurately you can predict the Y category if you know the X category than if you do not.In contrast, if the value of Φ G is 0, the Y category is not affected by the X category at all.

Approximate Confidence Interval for the Measure
Since the measure Φ G is unknown, we derived a confidence interval of Φ G .Let n ij denote the frequency for a cell (i, j), and n = R i=1 C j=1 n ij (i = 1, 2, . . ., R; j = 1, 2, . . ., C). Assume that the observed frequencies {n ij } have a multinomial distribution, we consider an approximate standard error and large-sample confidence interval for Φ G using the delta method (Bishop et al., 2007, andAppendix C in Agresti, 2010).
verges in distribution to a normal distribution with mean zero and variance σ 2 [Φ Gf ], where , and f ′ (x) is the derivative of function f (x) by x.

Numerical Experiments
In this section, we confirmed the performance of geoPRV measure Φ Gf , and the difference between Φ Gf and the conventional PRV measure Φ f proposed by Momozaki et al. (2022).We use Φ f and Φ Gf , which have the variation measure Ichimori, 2013), the former is expressed as Φ Gf , while the latter is expressed as Φ (ω) g and Φ (ω) Gg .For the tuning parameters, set λ = 0, 0.5, 1.0 and ω = 0, 0.5, 0.9.

Artificial data 1
Consider the artificial data in Table 1.These are data to clearly show the difference in characteristics between conventional PRV measures and the geoPRV measure.Table 1c shows the case where the explanatory variable in the first row has a complete association structure with the response variable in the third column.On the other hand, Table 1a and Table 1b show the case where the explanatory variable in the first row has a weak or slightly strong association structure to the response variable, respectively.

The values of Φ
Gf are provided in Table 2a and Table 2b, respectively.For instance, Table 2a shows that when Table 1c is parsed the measure Φ (λ) f = 0.2628, 0.1990, 0.1784 for each λ and does not capture the complete association structure of the first row.In contrast, Φ (λ) Gf = 1 in all λ, allowing us to identify the local complete association structure.Similarly, consider the results of the Φ (λ) Gf and Φ (λ) f in any λ from Table 1a to Table 1c.As can be seen from these results, the simulation also shows that Φ

Artificial data 2
Consider the artificial data in Table 3.These data are intended to examine the value of the geoPRV measure Φ Gf as the association of the entire contingency table changes.Therefore, we obtained data suitable for the survey by converting the bivariate normal distribution with means µ 1 = µ 2 = 0 and variances σ 2 1 = σ 2 2 = 1, in which the correlation coefficient was changed from 0 to 1 by 0.2, into the 4 × 4 contingency tables with equal-interval frequency.From Theorem 2 and the properties of the PRV measures, when the absolute values of the correlation coefficients are the same, i.e., when the rows of the contingency table are simply swapped, the values are equal, so the results for the negative correlation coefficient case are omitted.
Table 4 shows the value of Φ (λ) f and Φ

(λ)
Gf for each value of ρ, respectively.We observe that the values of Φ (λ) f and Φ

Actual data 1
Consider the case where the PRV measure is adapted to the data in Table 5, a survey of cannabis use among students conducted at the University of Ioannina (Greece) in 1995 and published in Marselos et al. (1997).The students' frequency of alcohol consumption is measured on a four-level scale ranging from at most once per month up to more frequently than twice per week while their trial of cannabis through a three-level variable (never tried-tried once or twice-more often).We can see the partial bias of the frequency for the first and second rows in the data.The estimates of Φ (λ) Gf are provided in Table 6a and Table 6b, respectively.For instance, when λ = 1, the measure Φ (1) f = 0.1034 for Table 6a, and  Φ (1) Gf = 0.2992 for Table 6b.Φ (1) f shows that the average condition variation of trying cannabis is 10.34% smaller than the marginal variation, and similarly Φ (1) Gf shows that the average condition variation of trying cannabis is 29.92% smaller.Based on the results of these values, the following can be interpreted from Table 5: (1) There is a strong association overall between alcohol consumption and cannabis use experience associated.
(2) There are fairly strong associations between some alcohol consumption and cannabis use experience.
These interpretations seem to be intuitive looking at Table 5.However, by analyzing using the measures, we have been able to present an objective interpretation numerically and to show how strongly associated structures are in the contingency table.

Actual data 2
By analyzing multiple contingency tables using the measures, it is possible to numerically determine how much difference there are between the associations of the contingency tables.Therefore, consider the data in Table 7 are taken from Hashimoto (1999).These data describe the cross-classifications of the father's and son's occupational status categories in Japan which were examined in 1975 and 1985.In addition, we can consider the father's states as an explanatory variable and the son's states as an response variable, since the father's occupational status categories seem to have an influence on the son's.The analysis of Table 7 aims to show what differences there are in the associations of occupational status categories for fathers and sons in 1975 and 1985.both measures are almost the same.In addition, comparing Table 8a and Table 8b, the estimate is slightly larger in Table 8b, so it can be assumed that Table 8b is more related, but there is little difference because all the confidence intervals are covered.When we also compare 9a and Table 9b, we can see that 9b is larger because the estimate is slightly larger in 9b.However, we can see that the confidence interval does not cover at ω = 0.9.From the results of these values, the following can be interpreted for Table 7a and Table 7b: (1) The occupational status categories of fathers and sons in 1975 and 1985 both have weak associations overall, further indicating that individual explanatory variables do not have remarkably associations.
(2) Although the association of Table 7b is slightly larger than Table 7a, the results of the confidence intervals indicate that there is no statistical difference.
(3) The partial association in Table 7b is slightly larger than Table 7a, and the results of confidence intervals indicate that there may be a statistical difference.
When there are statistical differences from the results of some confidence intervals, as in (3), it is affected by differences in the characteristics of variation associated with changing the tuning parameters.In this case, it is difficult to give an interpretation by referring to variation because there was no difference in the variation in the special cases (e.g., ω = 0).However, when there are differences in variation in special cases, further interpretation can be given by focusing on the characteristics.

Conclusion
In this paper, we proposed a geometric mean type of PRV (geoPRV) measure that uses variation composed of geometric mean and arbitrary functions that satisfy certain conditions.We showed that the proposed measure has the following three properties that are suitable for examining the degree of association, which satisfies the conventional measures: (i) The measure increases monotonically as the degree of association increases; (ii) The value is 0 when there is a structure of null association, and (iii) The value is 1 when there is a complete structure of association.Furthermore, by using geometric means, the geoPRV measure can capture the association to the response variables for individual explanatory variables that could not be investigated by the existing PRV measures.Analyses using the existing PRV measures and the geoPRV measure simultaneously will be able to examine the association of the entire contingency table and the partial association.Also, the geoPRV measure can be analyzed using variations with various characteristics by providing functions and tuning parameters that satisfy the conditions, such as the measure Φ f .Therefore, analysis using the geoPRV measure together can lead to a deeper understanding of the data and provide further interpretation.While various measures of contingency tables have been proposed, there have been several studies in recent years that have conducted analyses using the Goodman-Kraskal's PRV measure (e.g. Gea-Izquierdo, 2023;Iordache et al., 2022).We believe that the new PRV measure in this paper, when examined and compared together with the existing Goodman-Kraskal's PRV measure, may provide a new perspective that pays attention to the association of individual explanatory variables, including the association of the entire contingency table.
term is and the values are invariant to the reordering of the sums.Namely, the value of Φ Gf is invariant with respect to the permutation of row categories.Similarly, the value of Φ Gf is also invariant with respect to the permutation of column categories.

C Proof of Theorem 3
Proof to a normal distribution with mean zero and the covariance matrix diag(p) − pp ⊤ , where diag(p) is a diagonal matrix with the elements of p on the main diagonal (Bishop et al., 2007).The Taylor expansion of the function Φ Gf around p is given by Therefore, since where 0.0 ρ = 0.2 ρ = 0.4 ρ = 0.6 ρ = 0.8 ρ = 1

Table 1 :
The 3 × 3 probability tables, which have a (a) weak (b) slightly strong, and (c) complete association structure in the first row.

Table 2 :
The value of Φ

Table 5 :
Students' survey about cannabis use at the University of Ioannina I tried cannabis . . .

Table 8 and
Table 9 give the estimates of Φ Gg , respectively.Comparing the estimates for each ω in Table 8 and Table 9, we can see that the values for