1 Introduction

Contingency tables and their analysis are important for various fields, such as medicine, psychology, education, and social science. Typically, contingency tables are used to evaluate whether row and column variables are statistically independent. If the independence of the two variables is rejected, for example, through Pearson’s chi-squared test, or if they are clearly related, then we are interested in their strength of association. Many coefficients have been proposed to measure the strength of association between the two variables, namely, to measure the degree of departure from independence. Pearson’s coefficient \(\phi ^2\) of mean square contingency, P of contingency and Tschuprow’s coefficient T (Tschuprow 1925; Tschuprow 1939) serve as prime examples (see, e.g., Bishop et al. 2007; Everitt 1992; Agresti 2003). These measures can represent the strength of association within an interval of 0–1, where the value 0 indicates the independence of the contingency table. However, the problem with \(\phi ^2\) is that its value does not attain 1 despite that the contingency table has a complete association structure (i.e., maximum departure from independence). Similarly, P and T do not always attain the value of 1 depending on the number of rows and columns in the table. To address this issue, Cramér (1946) proposed Cramér’s coefficient \(V^2\), which can reach the value of 1 if the contingency table has a complete association structure for all rows and columns. Specifically, \(V^2\) indicates the strength of association in the contingency table as \(0 \le V^2 \le 1\), with the value of 0 identifying the independent structure and the value of 1 identifying the complete association structure.

Rényi (1961) introduced a class of measures of a divergence of two distributions. Recent studies have linked contingency tables and the divergence. Tomizawa et al. (2004) proposed measures \(V^2_{t(\lambda )}\) (\(t=1, 2, 3\)) based on the power-divergence with parameter \(\lambda \ge 0\). This study extended the measure that had been limited to examining \(V^2\) (\(\lambda = 1\)) and showed to be the members of a single-parameter family, including a measure based on the KL-divergence (\(\lambda = 0\)). (For more details of the power-divergence, see Cressie and Read (1984), and Read and Cressie (1988).) Furthermore, the f-divergence is introduced by Ali and Silvey (1966) and Csiszár (1963) as a useful generalization of the relative entropy, which retains some of its major properties. It is also called the \(\phi\)-divergence. In contingency table analysis, a considerable amount of literature has been published on modeling using the f-divergence (e.g., Kateri and Papaioannou 1994; Kateri and Papaioannou 1997; Kateri and Agresti 2007; Fujisawa and Tahata 2020; Tahata 2022; Yoshimoto et al. 2019). Many studies on goodness-of-fit tests using the f-divergence have been conducted in the literature, showing the usefulness of the f-divergence (e.g., Pardo 2018; Felipe et al. 2014, 2018, etc.). However, discussions on the measures using the f-divergence are limited.

In this paper, we propose a wider class of measures than the conventional one via the f-divergence. This study’s contribution is proving that a measure applying a function f(x) that satisfies the condition of the f-divergence has desirable properties for measuring the strength of association in contingency tables. This contribution allows us to easily construct a new measure using a divergence that has desirable properties for the analyst. For example, we conduct numerical experiments with a measure applying the \(\theta\)-divergence. Furthermore, we can give further interpretation of the association between rows and columns in the contingency table, which could not be obtained with the conventional one.

The rest of this paper is organized as follows. Section 2 proposes new measures to express the strength of association between the row and column variables in two-way contingency tables. Furthermore, the section shows that the proposed measures have desirable properties for measuring the strength of association. Section 3 presents the relationship between the measures and correlation coefficient in a bivariate normal distribution of the latent variables in the contingency tables. Section 4 demonstrates its simulation experiment. Section 5 presents the approximate confidence intervals of the proposed measures. Section 6 presents the analysis examples of the proposed measures applying the power-divergence and the \(\theta\)-divergence with actual data. Finally, Section 7 provides some concluding remarks.

2 Generalized measure

We consider association measures using the f-divergence for an \(r \times c\) contingency table. For the \(r \times c\) contingency table, let \(p_{ij}\) denote the probability that an observation will fall in the ith row and jth column of the table \((i = 1, \dots , r; j=1, \dots , c)\). Moreover, let \(p_{i\cdot }\) and \(p_{\cdot j}\) be \(p_{i \cdot } = \sum ^c_{t=1} p_{it}\) and \(p_{\cdot j} = \sum ^r_{s=1} p_{sj}\). Hereinafter, we assume that \(\{p_{i\cdot } \ne 0,\) \(p_{\cdot j} \ge 0\}\) when \(r \le c\) and \(\{p_{i\cdot } \ge 0,\) \(p_{\cdot j} \ne 0\}\) when \(r > c\).

In Sason and Verdú (2016), the f-divergence from P to Q is defined as \(I_f(P;Q) = \int f(dP/dQ) dQ\), where f is a convex function and \(P \ll Q\). For the \(r \times c\) contingency table, P and Q are given as discrete distributions \(\{p_{ij}\}\) and \(\{q_{ij}\}\). Accordingly, we have \(dP/dQ = \{p_{ij}/q_{ij}\}\). Thus, the f-divergence from \(\{p_{ij}\}\) to \(\{q_{ij}\}\) is given as

$$\begin{aligned} I_f(P;Q) = I_f(\{p_{ij}\};\{q_{ij}\})&= \sum _{i}\sum _{j} q_{ij} f\left( \frac{p_{ij}}{ q_{ij}} \right) , \end{aligned}$$

where f(x) is a once-differentiable and strictly convex function on \((0, +\infty )\) with \(f(1) = 0\), \(\lim _{x \rightarrow 0}f(x) = 0\), \(0f(0/0) = 0\), and \(0f(a/0) = a\lim _{x \rightarrow \infty }f(x) / x\) (see Csiszár 2004). By choosing function f, many important divergences, such as the KL-divergence (\(f(x) = x\log x\)), the Pearson’s divergence (\(f(x) = x^2-x\)), the power-divergence (\(f(x)=(x^{\lambda +1}-x)/\lambda (\lambda +1)\)), and the \(\theta\)-divergence (\(f(x) = (x-1)^2/(\theta x + 1 - \theta ) + (x-1)/(1 - \theta )\) ), are included in special cases of the f-divergence (see, Sason and Verdú 2016; Ichimori 2013, e.g.,). Furthermore, the f-divergence is one of the monotone and regular divergences. The class of monotone and regular divergences is introduced in Cencov (2000) and studied in Corcuera and Giummolé (1998) as a wide class of invariant divergences with respect to Markov embeddings. The class of monotone and regular divergence is often used as the measures of goodness of prediction (see Geisser 1993; Corcuera and Giummolè 1999b, 1999, etc.). Studying these measures aims to obtain a quantitative measure of how well a row or column variable predicts the other variable. Therefore, we consider that the measures using the f-divergence are appropriate for measuring the association and a natural generalization of that of Tomizawa et al. (2004).

Measures that present the strength of association between row and column variables are proposed in three cases: (I) When the row and column variables are response and explanatory variables, respectively (II) When they are explanatory and response variables, respectively (III) When response and explanatory variables are undefined. Further, we define measures for the asymmetric situation (in the case of (I) and (II)) and for the symmetric situation (in the case of (III)).

The following are the three properties that should be possessed by the measures: (i) The measures are contained within an interval (e.g., from 0 to 1). (ii) When the measure is minimal, the row and column variables are statistically independent. (iii) When the measure is maximal, the categories of one variable can be identified from the other. Conventional measures satisfy all of these properties. In the remainder of this section, we prove that the proposed measures also satisfy these three properties.

2.1 Case I

For an asymmetric situation wherein the column variable is the explanatory variable and the row variable is the response variable, we propose the following measure that presents the strength of association between the row and column variables by

$$\begin{aligned} V^2_{1(f)}&= \frac{I_f(\{p_{ij}\};\{p_{i \cdot }p_{\cdot j}\})}{K_{1(f)}}, \end{aligned}$$

where

$$\begin{aligned} I_f(\{p_{ij}\};\{p_{i \cdot }p_{\cdot j}\})&= \sum ^r_{i=1} \sum ^c_{j=1}p_{i \cdot }p_{\cdot j} f\left( \frac{p_{ij}}{p_{i \cdot }p_{\cdot j}} \right) , \\ K_{1(f)}&= \sum ^r_{i=1}p^2_{i \cdot } f\left( \frac{1}{p_{i \cdot }} \right) . \end{aligned}$$

Then, the following theorem for the measure \(V^2_{1(f)}\) is obtained.

Theorem 1

For each convex function f,

  1. (i)

    \(0 \le V^2_{1(f)} \le 1\).

  2. (ii)

    \(V^2_{1(f)} = 0\) if and only if a structure of null association exists in the table (i.e., \(\{p_{ij} = p_{i \cdot }p_{\cdot j}\})\).

  3. (iii)

    \(V^2_{1(f)} = 1\) if and only if a structure of complete association exists. For each column j \((j = 1, 2, \dots , c)\), \(i_j\) uniquely exists such that \(p_{i_j, j} > 0\) and \(p_{ij} = 0\) for all other \(i(\ne i_j)\) (assuming \(p_{i\cdot } > 0\) for all i).

The proof of Theorem 1 is provided in the Supplementary Material. Similar to the interpretation of measure \(V^2_{1(\lambda )}\), \(V^2_{1(f)}\) indicates the degree to which the prediction of the row category of an individual may be improved if knowledge regarding the column category of the individual exists. In this sense, \(V^2_{1(f)}\) shows the strength of association between the row and column variables. The examples of the f-divergence are given below. When \(f(x) = x \log x\), \(I_f(\{p_{ij}\};\{p_{i \cdot }p_{\cdot j}\})\) is identical to the KL-divergence and \(V^2_{1(f)}\) represented by

$$\begin{aligned} V_{KL}&= \frac{\displaystyle \sum\nolimits ^r_{i=1} \sum\nolimits ^c_{j=1}p_{ij}\log \left( \frac{p_{ij}}{p_{i \cdot }p_{\cdot j}} \right) }{- \displaystyle \sum\nolimits ^r_{i=1} p_{i\cdot }\log p_{i\cdot }} \end{aligned}$$

and \(V_{KL}\) is identical to the Thile’s uncertainty coefficient U (see, Theil 1970). When \(f(x) = x^2-x\), the Pearson’s divergence is derived, and \(V^2_{1(f)}\) is identical to Cramér’s coefficient \(V^2\) with \(r \le c\), and \(f(x)=(x^{\lambda +1}-x)/\lambda (\lambda +1)\), \(V^2_{1(f)}\) is identical to the power-divergence-type measure

$$\begin{aligned} V^2_{1(\lambda )}&= \frac{\displaystyle \sum\nolimits ^r_{i=1} \sum\nolimits ^c_{j=1}p_{ij} \left[ \left( \frac{p_{ij}}{p_{i \cdot }p_{\cdot j}} \right) ^{\lambda } - 1 \right] }{\displaystyle \sum\nolimits ^r_{i=1} p^{1-\lambda }_{i\cdot } - 1}. \end{aligned}$$

Further, in the case of \(f(x) = (x-1)^2/(\theta x + 1 - \theta ) + (x-1)/(1 - \theta )\) for \(0 \le \theta < 1\), \(I_f(\{p_{ij}\};\{p_{i \cdot }p_{\cdot j}\})\) is identical to the \(\theta\)-divergence and \(V^2_{1(f)}\) represented by the \(\theta\)-divergence-type measure

$$\begin{aligned} V^2_{1(\theta )}&= \frac{\displaystyle \sum\nolimits ^r_{i=1} \sum\nolimits ^c_{j=1}\frac{(p_{ij}-p_{i \cdot }p_{\cdot j})^2 }{\theta p_{ij} + (1 - \theta )p_{i \cdot }p_{\cdot j}}}{\displaystyle \sum\nolimits ^r_{i=1} \frac{p_{i\cdot }(1-p_{i\cdot })}{(1-\theta ) \left( \theta + (1-\theta )p_{i\cdot } \right) } }. \end{aligned}$$

Measure \(V^2_{1(\theta )}\), such as \(V^2_{1(\lambda )}\), is also a single-parameter measure and one of the generalizations of \(V^2\), which agrees with \(V^2\) at \(\theta = 0\). The numerator coincides with the triangular discrimination \(\Delta\) at \(\theta = 0.5\) (see, Dragomir et al. 2000; Topsoe 2000). Unlike the power-divergence, the \(\theta\)-divergence can measure departures from independence similar to the Euclidean distance, especially in the case of the triangular discrimination \(\Delta\), which can measure symmetrical distances of \(\{p_{ij}\}\)and \(\{p_{i \cdot }p_{\cdot j}\}\). In the numerical experiments discussed in Sections 4 and 6, we treat the \(\theta\)-divergence-type measure as the example of the new single-parameter measure that can be considered by extending \(V^2\) and compare it with the conventional one. Moreover, analysis corresponding to various contingency tables can be performed by changing the function.

2.2 Case II

For the asymmetric situation wherein the row and column variables are the explanatory and response variables, respectively, we propose the following measure, which presents the strength of association between the row and column variables:

$$\begin{aligned} V^2_{2(f)}&= \frac{I_f(\{p_{ij}\};\{p_{i \cdot }p_{\cdot j}\})}{K_{2(f)}}, \end{aligned}$$

where

$$\begin{aligned} K_{2(f)}&= \sum ^c_{j=1}p^2_{\cdot j} f\left( \frac{1}{p_{\cdot j}} \right) . \end{aligned}$$

Therefore, the following theorem is obtained for measure \(V^2_{2(f)}\).

Theorem 2

For each convex function f,

  1. (i)

    \(0 \le V^2_{2(f)} \le 1\).

  2. (ii)

    \(V^2_{2(f)} = 0\) if and only if a structure of null association exists in the table (i.e., \(\{p_{ij} = p_{i \cdot }p_{\cdot j}\})\).

  3. (iii)

    \(V^2_{2(f)} = 1\) if and only if a structure of complete association exists; that is, for each row i \((i = 1, 2, \ldots , r)\), \(j_i\) uniquely exists such that \(p_{i, j_i} > 0\) and \(p_{ij} = 0\) for all other \(j(\ne j_i)\) ( assuming \(p_{\cdot j} > 0\) for all j).

The proof of Theorem 2 is obtained in a similar manner to the proof of Theorem 1. \(V^2_{2(f)}\) coincides with the value of \(V^2_{1(f)}\) when the row and column variables are interchanged in the table, and \(V^2_{2(f)}\) has no special characteristics compared to \(V^2_{1(f)}\). However, it is proposed because of its importance in Case III.

2.3 Case III

In an \(r \times c\) contingency table wherein explanatory and response variables are undefined, using \(V^2_{1(f)}\) and \(V^2_{2(f)}\) is inappropriate if we are interested in determining the degree to what knowledge about the value of one variable can help us predict the value of the other variable. For this asymmetric situation, we propose the following measure that combines the ideas of both \(V^2_{1(f)}\) and \(V^2_{2(f)}\):

$$\begin{aligned} V^2_{3(f)}&= h^{-1} \left( \left( w_1h\left( V^2_{1(f)} \right) + w_2h\left( V^2_{2(f)} \right) \right) \right) , \end{aligned}$$

where h is the monotonic function and \(w_1 + w_2 = 1\) \((w_1, w_2 \ge 0)\). Then, the following theorem is attained for measure \(V^2_{3(f)}\).

Theorem 3

For each convex function f,

  1. (i)

    \(0 \le V^2_{3(f)} \le 1\).

  2. (ii)

    \(V^2_{3(f)} = 0\) if and only if a structure of null association exists in the table (i.e., \(\{p_{ij} = p_{i \cdot }p_{\cdot j}\})\).

  3. (iii)

    \(V^2_{3(f)} = 1\) if and only if a structure of complete association exists; that is, at most one non zero probability appears in each row or each column (assuming all marginal probabilities are non zero).

The proof of Theorem 3 is provided in the Supplementary Material. We can show that, if \(h(u) = \log u\) and \(w_1=w_2\), \(V^2_{3(f)}\) is denoted by

$$\begin{aligned} V^2_{G(f)}&= \frac{I_f(\{p_{ij}\};\{p_{I \cdot }p_{\cdot j}\})}{\sqrt{K_{1(f)}K_{2(f)}}} = \sqrt{V^2_{1(f)} V^2_{2(f)}}, \end{aligned}$$

and if \(h(u) = 1/u\) and \(w_1=w_2\), \(V^2_{3(f)}\) is represented by

$$\begin{aligned} V^2_{H(f)}&= \frac{2I_f(\{p_{ij}\};\{p_{i \cdot }p_{\cdot j}\})}{K_{1(f)} + K_{2(f)}} = \frac{2V^2_{1(f)} V^2_{2(f)}}{V^2_{1(f)} + V^2_{2(f)}}. \end{aligned}$$

Notably, \(V^2_{G(f)}\) and \(V^2_{H(f)}\) are the geometric mean and harmonic mean of \(V^2_{1(f)}\) and \(V^2_{2(f)}\), respectively. We confirm that, when \(f(x) = x^2-x\) with \(r=c\), \(V^2_{3(f)}\) is identical to Cramér’s coefficient \(V^2\). Conversely, for \(f(x)=(x^{\lambda +1}-x)/\lambda (\lambda +1)\), \(V^2_{3(f)}\) is consistent with Miyamoto’s measure \(G^2_{(\lambda )}\) (Miyamoto et al. 2007).

For an \(r \times r\) contingency table with the same row and column classifications, \(V^2_{3(f)}=1\) if and only if the main diagonal cell probabilities in the \(r \times r\) table are non zero and the off-diagonal cell probabilities are all zero after interchanging some row and column categories. Therefore, all observations concentrate on the main diagonal cells. While predicting the values of categories of an individual, \(V^2_{3(f)}\) would specify the degree to which the prediction could be improved if knowledge about the value of one variable exists. In this sense, \(V^2_{3(f)}\) also indicates the strength of association between the row and column variables. If only the marginal distributions \(\{p_{i\cdot }\}\) and \(\{p_{\cdot j}\}\) are known, we consider predicting the values of the individual row and column categories in terms of probabilities with independent structures.

Theorem 4

For any fixed convex functions f and monotonic functions h,

  1. 1.

    \(\min (V^2_{1(f)}, V^2_{2(f)}) \le V^2_{3(f)} \le \max (V^2_{1(f)}, V^2_{2(f)})\),

  2. 2.

    \(\min (V^2_{1(f)}, V^2_{2(f)}) \le V^2_{H(f)} \le V^2_{G(f)} \le \max (V^2_{1(f)}, V^2_{2(f)})\).

The proof of Theorem 4 is provided in the Supplementary Material. When \(f(x) = x^2-x\) with \(r=c\), we observe that \(V^2_{1(f)} = V^2_{2(f)} = V^2_{3(f)} = V^2_{H(f)} = V^2_{G(f)} = V^2\) (being the Cramér’s coefficient).

3 Relationship between measures and bivariate normal distribution

In the analysis of the two-way contingency table, Tallis (1962), Lancaster and Hamdan (1964), Kirk (1973), and Divgi (1979) proposed an approach based on the bivariate normal distribution. This approach assumes that the classification of rows and columns results from continuous random variables with the bivariate normal distribution, that is, the sample contingency table comes from a discretized bivariate normal distribution. In many contexts, this assumption is invalid and a more general approach is needed. Therefore, Goodman (1981); Goodman (1985) presented an approximation close to the correlation structure of discrete bivariate distributions based on the association model, and Becker (1989) also made a similar proposal based on the KL-divergence. Assuming the bivariate normal distribution is important for examining the correlation structure of the contingency table, and previous studies have considered the association based on the model. In this section, we explain the relationship between the measures \(V^2_{t(f)}\) (\(t = 1,2,3\)) and the correlation coefficient \(\rho\) when the bivariate normal distribution can be assumed for the latent variables in the contingency table.

Assuming the latent variables, the (i,j) cell probability \(p_{ij}\) of the \(r \times c\) contingency table is denoted as

$$\begin{aligned} p_{ij}&= P(X = i, Y=j) \\&= P(x_{i-1}< X^* \le x_i, y_{j-1} < Y^* \le y_j) \\&= f_{X^*, Y^*}({\tilde{x}}_i, {\tilde{y}}_j)\Delta _{x_i} \Delta _{y_j}, \end{aligned}$$

where \(x_{i-1} < {\tilde{x}}_i \le x_i\), \(y_{j-1} < {\tilde{y}}_j \le y_j\) and \(f_{X^*, Y^*}({\tilde{x}}_i, {\tilde{y}}_j)\) is a continuous joint density function of random variables \(X^*\) and \(Y^*\). \(\Delta _{x_i}\) and \(\Delta _{y_j}\) are the width of intervals \((x_{i-1}, x_{i}]\) and \((y_{j-1}, y_{j}]\), respectively. In this situation, it is possible to approximate \(I_f(\{p_{ij}\};\{p_{i \cdot }p_{\cdot j}\})\) as follows:

$$\begin{aligned} \begin{aligned}&I_f(\{p_{ij}\};\{p_{i \cdot }p_{\cdot j}\}) \\&= \sum ^r_{i=1} \sum ^c_{j=1} f_{X^*}({\tilde{x}}_i) f_{Y^*}({\tilde{y}}_j) f\left( \frac{f_{X^*,Y^*}({\tilde{x}}_i, {\tilde{y}}_j)}{f_{X^*}({\tilde{x}}_i) f_{Y^*}({\tilde{y}}_j)} \right) \Delta _{x_i} \Delta _{y_j} \\&\xrightarrow [\Delta _{x_i} \Delta _{y_j} \rightarrow 0]{} \int ^{\infty }_{-\infty } \int ^{\infty }_{-\infty } f_{X^*}(x) f_{Y^*}(y) f\left( \frac{f_{X^*,Y^*}(x, y)}{f_{X^*}(x) f_{Y^*}(y)} \right) dx dy, \end{aligned} \end{aligned}$$
(1)

where \(f_{X^*}(x)\) and \(f_{Y^*}(y)\) are marginal probability density functions of \(f_{X^*,Y^*}(x, y)\).

Let \(X^*\) and \(Y^*\) be random variables according to the bivariate normal distribution and joint density function is

$$\begin{aligned} \begin{aligned} f_{X^*,Y^*}(x, y)&= \frac{1}{2\pi \sigma _x \sigma _y \sqrt{1-\rho ^2}} \exp \left[ -\frac{1}{2(1-\rho ^2)} \right. \\&\quad \left. \qquad \left\{ \left( \frac{x-\mu _x}{\sigma _x} \right) ^2 - 2\rho \left( \frac{x-\mu _x}{\sigma _x} \right) \left( \frac{y-\mu _y}{\sigma _y} \right) + \left( \frac{y-\mu _y}{\sigma _y} \right) ^2 \right\} \right] \\&\quad -\infty< x< +\infty , \quad -\infty< y < +\infty \end{aligned} \end{aligned}$$

where \(\rho\) is the correlation coefficient between \(X^*\) and \(Y^*\). The value of the correlation coefficient ranges from \(-1\) to 1. In the formula, the standard deviation \(\sigma _x\) and \(\sigma _y\) are positive constants. However, the means \(\mu _x\) and \(\mu _y\) do not have to be positive constants. When applying \(f(x)=(x^{\lambda +1}-x)/\lambda (\lambda +1)\), the relationship between the power-divergence and correlation coefficient \(\rho\) is expressed as

$$\begin{aligned} I_f(\{p_{ij}\};\{p_{i \cdot }p_{\cdot j}\})&\approx \frac{1}{\lambda (\lambda + 1)} \left\{ (1-\rho ^2)^{-\frac{\lambda }{2}}(1-\lambda ^2 \rho ^2)^{-\frac{1}{2}}-1 \right\} , \end{aligned}$$
(2)

where \(\lambda < 1/\vert \rho \vert\). Therefore, it is better to use less than 1 for \(\lambda\) under the assumption. If we want to capture the relationship between the measures and correlation coefficient \(\rho\), by applying the value at \(\lambda = 0\), which is assumed to be the continuous limit as \(\lambda \rightarrow 0\) (i.e \(f(x) = x\log x\)), it can be expressed as

$$\begin{aligned} I_f(\{p_{ij}\};\{p_{i \cdot }p_{\cdot j}\})&\approx - \frac{1}{2}\log (1-\rho ^2). \end{aligned}$$
(3)

When we consider the latent variables and approximate the divergence, the relationship can be shown as (2) and (3). These equations show that the value is monotonically increasing with respect to \(\vert \rho \vert\). Therefore, by considering the measures, the relationship can be captured and an upper limit can be established.

This section showed the relationship between the measures and correlation coefficient \(\rho\) using the bivariate normal distribution and \(f(x)=(x^{\lambda +1}-x)/\lambda (\lambda +1)\) as examples. However, in the \(\theta\)-divergence and more general divergence cases, it is difficult to calculate (1) in a closed form. Therefore, in the next section, we confirm that the value of the measures increases monotonically as the correlation coefficient moves away from 0, even when the \(\theta\)-divergence is applied.

4 Numerical study

This section compares the measurements by function or parameter. In the numerical study, we use artificial data generated from discrete bivariate distributions with zero means and unit variances, as in Goodman (1981); Goodman (1985), and Becker (1989). The method of partitioning the bivariate normal distribution is to use cut-points that generate uniform marginal distributions. For instance, when creating a \(4\times 4\) probability table, we split the bivariate normal distribution using \(z_{0.25}\), \(z_{0.50}\), and \(z_{0.75}\) as cut-points. The \(4\times 4\) artificial probability tables created for the numerical study are given in the Supplementary Material. The benefit of this method is that the strength of association between the row and column variables in the contingency table is known from the bivariate normal distribution, which is appropriate for examining the measures. For the comparison of the measures, we use Tomizawa’s power-divergence-type measures (\(f(x)=(x^{\lambda +1}-x)/\lambda (\lambda +1)\) for \(0 \le \lambda \le 1\)) and the newly proposed \(\theta\)-divergence-type measures (\(f(x) = (x-1)^2/(\theta x + 1 - \theta ) + (x-1)/(1 - \theta )\) for \(0 \le \theta < 1\)), both of which are a single-parameter divergence and extensions of Cramér’s coefficient \(V^2\).

Table 1 presents the values of the measures \(V^2_{t(f)}\) (\(t=1,2,3\)) for each \(4\times 4\) probability tables with \(\rho = 0.0, 0.4, 0.8, 1.0\). Notably, in the case of \(r \times r\) artificial contingency tables, each of \(\{p_{i \cdot }\}\) and \(\{p_{\cdot j}\}\) is constant so that \(V^2_{1(f)} = V^2_{2(f)} = V^2_{3(f)}\). Table 1 shows that, when the correlation is away from 0, \({\hat{V}}^2_{t(f)}\) are close to 1.0. Further, \(\rho = 0\) if and only if the measures show that a structure of null association exists in the table, and \(\rho = 1.0\) if and only if the measures confirm that a structure of complete association exists. The sharp increase around \(\rho = 1.0\) can be explained by the previous section’s relationship between the measures and correlation coefficients \(\rho\). Another important finding is how each measure increases at \(\rho = 0.4, 0.8\). In the case of \(V^2\) (\(\lambda = 1, \theta = 0\)), the increasing trend of \(V^2\) with the change of \(\rho\) is slower than most measures. It may not be possible to accurately determine small differences in the strength of association when comparing multiple contingency tables, so having a broad perspective by extension may allow careful analysis. These results suggest that \(V^2\) may not accurately determine small differences in the strength of association when comparing multiple contingency tables made by the bivariate normal distribution. The same is true for the power-divergence-type measures, which have an increasing trend similar to \(V^2\). We may consider that it is better to use the \(\theta\)-divergence-type measures with \(\theta =0.7\) in order to determine the small differences in the strength of association. Values of \(V^2_{t(f)}\) for other \(\rho\), and coverage probabilities are provided in the Supplementary Material.

Table 1 Values of measures \(V^2_{t(f)}\) \((t =1, 2, 3)\) setting (a) the power-divergence for any \(\lambda\) and (b) the \(\theta\)-divergence for any \(\theta\) in \(4\times 4\) probability tables with \(\rho = 0, 0.4, 0.8, 1.0\)

5 Approximate confidence intervals for measure

In the previous section, we confirmed the values of the proposed measures with simulated data. However, when analyzing real data, \(p_{ij}\) is unknown, and these values are also unknown. Hence, it is necessary to construct confidence intervals. Therefore, in this section, we construct asymptotic confidence intervals by using the delta method. \(\{n_{ij}\}\) denotes the observed frequency from a multinomial distribution, and n denotes the total number of observations, namely, \(\sum ^r_{i=1} \sum ^c_{j=1}n_{ij}\). The approximate standard error and large-sample confidence interval are obtained for \(V^2_{t(f)} (t =1, 2, 3)\) using the delta method, which is described in, for example, Agresti (2003); Bishop et al. (2007), etc. The estimator of \(V^2_{t(f)}\) (i.e., \({\hat{V}}^2_{t(f)}\)) is given by \(V^2_{t(f)}\) with \(\{p_{ij}\}\) replaced by \(\{ {\hat{p}}_{ij}\}\), where \({\hat{p}}_{ij} = n_{ij}/n\). When using the delta method, \(\sqrt{n}({\hat{V}}^2_{t(f)} - V^2_{t(f)})\) has a asymptotically normal distribution (i.e., as \(n \rightarrow \infty\)) with mean 0 and variance \(\sigma ^2[V^2_{t(f)}]\). Refer to the Supplementary Material for the values of \(\sigma ^2[V^2_{t(f)}]\).

We define f(x) as once-differentiable and strictly convex and \(f'(x)\) as a first derivative of f(x) with respect to x. Assume \({\hat{\sigma }}^2[V^2_{t(f)}]\) be \(\sigma ^2[V^2_{t(f)}]\) with \(\{p_{ij}\}\) replaced by \(\{ {\hat{p}}_{ij}\}\). Then, an estimated standard error of \({\hat{V}}^2_{t(f)}\) is \({\hat{\sigma }}[V^2_{t(f)}] / \sqrt{n}\), and an approximate \(100(1-\alpha )\) percent confidence interval of \({\hat{V}}^2_{t(f)}\) is \({\hat{V}}^2_{t(f)} \pm z_{\alpha /2} {\hat{\sigma }}[V^2_{t(f)}] / \sqrt{n}\), where \(z_{\alpha /2}\) is the upper \(\alpha /2\) percentage point from the standard normal distribution.

6 Examples

In this section, we explain the benefits of using the f-divergence to extend the measures, with some actual data examples. We use Tomizawa’s power-divergence-type measures (\(f(x)=(x^{\lambda +1}-x)/\lambda (\lambda +1)\) for \(\lambda = 0.0, 0.6, 1.0, 1.2, 1.5\)) and the newly proposed \(\theta\)-divergence-type measures (\(f(x) = (x-1)^2/(\theta x + 1 - \theta ) + (x-1)/(1 - \theta )\) for \(\theta = 0.0, 0.3, 0.5, 0.7, 0.9\)), both of which are a single-parameter divergence and extensions of Cramér’s coefficient \(V^2\). Let observe the estimates of the measures and confidence intervals.

6.1 Example 1

Consider the data in Table 2, taken from the 2006 General Social Survey. These are data, which show the relationship between family income and education in the United States separately for black and white categories of race. By applying the measures \(V^2_{1(f)}\), we consider to what educational degree can be improved when the prediction of family income for black and white categories of an individual is known.

Table 2 Data on educational degrees and family income, by race

Table 3 shows the estimates of the measures, standard errors, and \(95\%\) confidence intervals. Tables 3(a1, b1) and 3(b1, b2) show the results of the analysis of Tables 2(a) and 2(b), respectively. One interesting finding is the confidence intervals for all \(V^2_{1(f)}\) do not contain zero for any \(\lambda\) and \(\theta\). The results show that the two actual data have an associated structure from a point of view, other than Cramér’s coefficient \(V^2\). Another important finding is the comparison of the confidence intervals. For conventional the power-divergence-type measures, a comparison of Tables 3(a1) and 3(b1) shows that confidence intervals overlap for all \(\lambda\). Meanwhile, when \(\theta = 0.9\) in Tables 3(a2) and 3(b2), the confidence intervals do not overlap. Table 2(a), where the estimate is closer to 0, has higher independence. Therefore, this analysis revealed the merit of using the measures extended with the f-divergence to express the differences that did not appear in the conventional one.

Table 3 Estimate of the measure \(V^2_{1(f)}\), estimated approximate standard error for \({\hat{V}}^2_{1(f)}\), and approximate \(95\%\) confidence interval of \(V^2_{1(f)}\) applying (a1, b1) the power-divergence for any \(\lambda\) and (a2, b2) the \(\theta\)-divergence for any \(\theta\)

6.2 Example 2

Consider the data in Table 4, obtained from Tomizawa (1985). These tables provide information on the unaided distance vision of 4746 university students aged 18 to about 25 and 3168 elementary students aged 6 to about 12. In Table 4, the row and column variables are the right and left eye grades, respectively, with the categories ordered from the highest grade (1) to the lowest grade (4). As the right and left eye grades have similar classifications, we apply measure \(V^2_{H(f)}\).

Table 4 Unaided distance vision data for university and elementary students

Table 5 provides the estimates of the measures, standard errors, and confidence intervals. Tables 5(a1, a2) and 5(b1, b2) show the results of the analysis of Tables 4(a) and 4(b), respectively. The results of this analysis show that the two actual data have a strong structure of association in terms of the estimates and confidence intervals for all \(\lambda\) and \(\theta\). After comparing the value of the measures between Tables 5(a1, a2) and 5(b1, b2), we found that the strength of association between the right and left eyes is greater for university students in terms of the estimates for each parameter. After comparing the confidence intervals between Tables 5(a1, a2) and 5(b1, b2) can be similarly concluded, we found that the confidence intervals for all \(\lambda\) overlap, but not when \(\theta = 0.5, 0.7, 0.9\). Another interesting finding is that, unlike Example 1, the confidence intervals of the results of the analysis in Example 2 do not overlap when \(\theta = 0.5, 0.7\). In terms of the triangular discrimination \(\Delta\) (\(\theta = 0.5\)), which is not observed in the \(V^2\) or the power-divergence-type measures, it can be assumed that this result provides evidence that Table 4(b) has a stronger association structure. Therefore, the extension by the f-divergence helps us perform the analysis safely.

Table 5 Estimate of measures \(V^2_{H(f)}\), estimated approximate standard error for \({\hat{V}}^2_{H(f)}\), and approximate \(95\%\) confidence interval of \(V^2_{H(f)}\) applying (a1, b1) the power-divergence for any \(\lambda\) and (a2, b2) the \(\theta\)-divergence for any \(\theta\)

Remark 1

(Brief guideline for choosing functions and parameters) In an analysis with our proposed measure \(V^2_{t(f)} (t=1,2,3)\), users need to choose divergence and set parameters. The choice of these divergences and the setting of parameters should be determined by the distance from which users want to examine data. An instance of how to choose is described, along with the limitations of Cramér’s coefficient \(V^2\). Kvålseth (2018) points out some limitations of \(V^2\). The main objective of this study is to generalize \(V^2\), but it is also possible to provide an improvement. This limitation is that the degree of association may be overestimated when the observation frequency is small. It also has Tomizawa’s power-divergence-type measures \(V^2_{t(\lambda )} (t=1,2,3)\), which have not been improved by generalization with the parameter \(\lambda\). In such cases, an evaluation can be given from a point of view similar to the Euclidean distance using the \(\theta\)-divergence (except \(\theta =0\)). As an example, consider the article data in Table 6. The data indicate clearly very near independence between row and column categories with the Euclidean distance \(\vert p_{ij}-p_{i\cdot }p_{\cdot j}\vert\) being either 0 or 0.01 and \(\sum ^3_{i=1} \sum ^3_{j=1} \vert p_{ij}-p_{i\cdot }p_{\cdot j}\vert = 0.04\). However, Cramér’s coefficient and Tomizawa’s power-divergence-type measure both have large values, while the \(\theta\)-divergence-type measure is close to the Euclidean distance in Table 7. In this way, it is necessary to choose divergence and set parameters according to the viewpoint from which users evaluate the degree of departure from independence.

Table 6 Artificial data to show differences from Cramér’s coefficient \(V^2\) and Tomizawa’s power-divergence-type measures \(V^2_{t(\lambda )} (t=1,2,3)\)
Table 7 Values of measures \(V^2_{H(f)}\) setting (a) the power-divergence for any \(\lambda\) and (b) the \(\theta\)-divergence for Table 6

7 Conclusion

We found that the strength of association between the row and column variables in two-way contingency tables can be safely analyzed by proposing measures \(V^2_{t(f)} (t =1, 2, 3)\) that generalizes Cramér’s coefficient \(V^2\) via the f-divergence. First, this study proved that a measure applying a function f(x) that satisfies the condition of the f-divergence has desirable properties for measuring the strength of association in contingency tables. Hence, we can easily construct a new measure using a divergence that has essential properties for the analyst. Furthermore, we can give a further interpretation of the association between rows and columns in contingency tables, which could not be obtained with a conventional one. Second, we showed the relationship between the proposed measures \(V^2_{t(f)}\) and the bivariate normal distribution. We found that the relationship between the power-divergence and correlation coefficient \(\rho\) is approximately formulated and more succinct with \(\lambda = 0\).

Measures \(V^2_{t(f)}\) always range between 0 and 1, independent of the dimensions r and c and the sample size n. Thus, comparing the strength of association between the row and column variables in several tables is useful. This is crucial in checking the relative magnitude of the strength of association between the row and column variables to the degree of complete association. Specifically, \(V^2_{1(f)}\) (\(V^2_{2(f)}\)) would be effective when the row and column variables are the response (explanatory) and explanatory (response) variables, respectively, while \(V^2_{3(f)}\) would be useful when explanatory and response variables are not defined. Furthermore, we first need to check if independence is established by using a test statistic, such as Pearson’s chi-squared statistics, to analyze the strength of association between the row and column variables. Then, if it is determined that a structure of association exists, the next step would be to measure the strength of the association by using \(V^2_{t(f)}\). However, if the table is determined as independent, employed \(V^2_{t(f)}\) may not be meaningful. Furthermore, \(V^2_{t(f)}\) is invariant under any permutation of the categories. Therefore, we can apply it to the data analysis on a nominal or ordinal scale.

When using our proposed measures \(V^2_{t(f)}\), brief guideline for choosing divergence and setting parameters is described in remark 1. On the other hand, if users do not specifically decide on divergence and parameters, they should consider the data in-depth by applying some divergence and parameters, as in the selection method described in Momozaki et al. (2023). We agree with the method. Statisticians may be interested in mathematically choosing divergence and parameters to use, such as characteristics of the data or relationships between row and column variables. However it would be difficult to discuss in this article, and so these aspects would be considered in future studies. In addition, it is necessary to expand multi-way contingency tables.