1 Introduction

Consider \(R \times R\) square contingency tables with the same row and column ordinal classifications. Square contingency tables are used in many disciplines that include data science, engineering, and medical research, see, for example, Agresti [1].

Let \(s_{k}\) be the ordered score of category k for all \(k=1,\dots ,R\), where \(s_{1}<\cdots < s_{R}\). We consider the data sets that: (i) can be assigned the known ordered scores for all categories, (ii) can be assigned the known ordered scores for all except one category, and (iii) cannot be assigned the known ordered scores for all categories.

Typical examples of types (i) and (ii) are categorical variables set as intervals based on a continuous variable. When there is clear information about category intervals, it is recommended that ordered scores be assigned as midpoint intervals (midpoint scores) instead of equally spaced scores (see Graubard and Korn [2] and Senn [3]).

For the analysis of the data sets of type (i), the ordinal quasi-symmetry (OQS) model proposed by Agresti [1] is often used. The OQS model indicates the asymmetric structure of the cell probabilities with respect to the main-diagonals cell of the table. The OQS model assumes that the row category k and column category k are assigned the same known scores \(s_{k}\) for all \(k=1,\dots ,R\). This assumption is natural for square contingency tables with the same row and column ordinal classifications.

Table 1 Cross-classification of 1995 income data (in units of 10,000 yen) for income of individuals (male) and their spouses (female) in Japan are derived from the Social Science Japan Data Archive (available at https://nesstar.iss.u-tokyo.ac.jp/webview/)

We consider the data set in Table 1, that presents the cross-classification of 1995 income data for espoused couples in Japan. Individual and spouse incomes are categorized as “less than 70”, “from 70 to less than 150”, “from 150 to less than 450”, and “450 or more”. We assign 35, 110, and 300 as the ordered scores for the first, second, and third categories, respectively. However, as the fourth category is unbounded above, we could not assign the known ordered score.

Gautam [4] suggests that the ordered scores for a data set with an open-ended category should be assigned as follows: the scores \(s_{1}\) to \(s_{R-1}\) are midpoint scores, and the score \(s_{R}\) is unknown. Therefore, the score \(s_{R}\) can be expressed as \(s_{R}=w_{0}+w\); where \(w_{0}\) is the smallest value of the interval for the open-ended category, and \(w\ge 0\) is unknown. Gautam [4] assumes that the row category k and column category k are assigned the same scores \(s_{k}\) for all \(k=1,\dots ,R\). Additionally, as the ordered scores for a data set with an open-ended category, Aktas and Wu [5] introduces the standardized z-scores for the row and column categories, and proposes a model that indicates the asymmetric structure depending on the standardized z-scores. However, the standardized z-scores assumes that the row category k and column category k are assigned the different scores \(s_{k}\) for all \(k=1,\dots ,R\). Therefore, for the analysis of the data sets of type (ii), we are interested in considering a model the asymmetric structure of the cell probabilities depending on the score proposed by Gautam [4].

We further consider the data set in Table 2 obtained from Agresti [1], that present the cross-classification of occupational status categories for father and son dyads in Britain. In this case, the scores \(s_{1}\) to \(s_{R}\) are treated as unknown because it was difficult to assign an ordered known score to any category.

For the analysis of the data sets of type (iii), the ridit score type quasi-symmetry (RQS) model proposed by Iki et al. [6] is often used. The RQS model indicates the asymmetric structure of the cell probabilities depending on the ridit scores. The ridit scores is the unknown ordered scores, and is defined the average of row and column marginal ridits. Thus, the RQS model also assumes that the row category k and column category k are assigned the same unknown scores \(s_{k}\) for all \(k=1,\dots ,R\).

Bagheban and Zayeri [7] proposes a power parameter score. We modify the power parameter score so that it can be treated as the unknown ordered score, although Bagheban and Zayeri [7] is treated the power parameter score as the known ordered scores. Thus, for the analysis of the data sets of type (iii), we are interested in proposing a model the asymmetric structure of the cell probabilities depending on the power parameter score.

Table 2 Cross-classification of occupational status categories for father and son dyads in Britain, obtained from Agresti [1]

This study proposes two original asymmetry models based on non-integer scores for square contingency tables with the same row and column ordinal classifications. One of the proposed models is useful for the data set with open-ended categories–namely type (ii). The other is applicable to a data set that cannot be assigned the known ordered scores for all categories–namely type (iii).

The remainder of this paper is organized as follows. Sect. 2 proposes two original models based on non-integer scores. Sect. 3 demonstrates the utility of these proposed models as applied to the real-world data presented in Tables 1 and 2. We conclude the paper in Sect. 4.

2 Proposed Models

2.1 Existing Models Based on Non-integer Scores

Let \(p_{ij}\) denote the probability that an observation will fall in the (ij)th cell of the table (\(i=1,\dots ,R;j=1,\dots ,R\)).

This study focuses on a model having the following formula:

$$\begin{aligned} p_{ij} = \delta ^{s_j-s_i}p_{ji} \quad (i<j). \end{aligned}$$

Note that \(s_{k}\) is the ordered score of category k for all \(k=1,\dots ,R\), where \(s_{1}<\cdots < s_{R}\). This model can represent various models depending on how \(s_k\) is set.

As the model based on integer scores (i.e., \(s_k = k\) for all \(k=1,\dots ,R\)), the linear diagonals-parameter symmetry (LDPS) model proposed by Agresti [8] was defined as

$$\begin{aligned} p_{ij} = \delta ^{j-i}p_{ji} \quad (i<j). \end{aligned}$$

We introduce existing models based on non-integer scores (i.e., \(s_k \ne k\) for all \(k=1,\dots ,R\)). We can assign the known ordered scores \(s_{k}\) to the category k for all \(k=1,\dots ,R\). The ordinal quasi-symmetry (OQS) model proposed by Agresti [1] was defined as

$$\begin{aligned} p_{ij} = \delta ^{s_j-s_i}p_{ji} \quad (i<j). \end{aligned}$$

Note that the OQS model corresponds to the type (i) data set. The OQS model with equally spaced scores (\(s_k = s_1 + (k-1)d\) for \(k=1,\dots ,R\)) is equivalent to the LDPS model. The OQS model with \(\delta =1\) is also identical to the symmetry (S) model proposed by Bowker [9]. Kateri and Agresti [10] considered the OQS model based on f-divergence, also see Saigusa et al. [11].

Let X and Y denote the row and column variables, \(p_{k\cdot }=\sum ^{R}_{l=1}p_{kl}\) and \(p_{\cdot k}=\sum ^{R}_{l=1}p_{lk}\) for the marginal probabilities for \(k=1,\dots ,R\). and \(F^{X}_{k} = \sum ^{k}_{l=1}p_{l\cdot }\) and \(F^{Y}_{k} = \sum ^{k}_{l=1}p_{\cdot l}\) for the marginal distribution functions for \(k=1,\dots ,R\), where \(F^{X}_{R}=1\) and \(F^{Y}_{R} = 1\). Then the marginal ridits are defined as,

$$\begin{aligned} r^{X}_k = \sum ^{k-1}_{l=1} p_{l \cdot } + \frac{p_{k \cdot }}{2} \quad \mathrm{and} \quad r^{Y}_k = \sum ^{k-1}_{l=1} p_{\cdot l} + \frac{p_{\cdot k}}{2} \quad (k=1,\dots ,R), \end{aligned}$$

see Bross [12].

When we cannot assign the known ordered scores \(s_{k}\) to the category k for all \(k=1,\dots ,R\), we adopt the RQS model proposed by Iki et al. [6]:

$$\begin{aligned} p_{ij} = \delta ^{s_j-s_i}p_{ji} \quad (i<j), \end{aligned}$$

where \(s_{k}=(r^{X}_{k} + r^{Y}_{k})/2\) for all \(k=1,\dots ,R\). Note that \(\{s_{k}\}\) in the RQS model are unspecified (i.e., the unknown ordered scores), and the RQS model corresponds to the data set of type (iii).

We highlight that a model corresponding to the data set of type (ii) does not exist in a similar form to the OQS and RQS models.

2.2 Proposed Models Based on Non-integer Scores

We propose two original models based on non-integer scores corresponding to the data set of types (ii) and (iii). First, we propose an original model corresponding to the data set of type (ii), defined as

$$\begin{aligned} p_{ij} = \delta ^{s_j-s_i}p_{ji} \quad (i<j), \end{aligned}$$

where \(s_1\) to \(s_{R-1}\) are known, and \(s_R\) is unknown. Therefore, \(s_{1}\) to \(s_{R-1}\) are assigned to known ordered scores (e.g., midpoint scores), \(s_{R}\) is defined as \(s_{R}=w_{0}+w\), where \(w_{0}\) is the smallest value of the interval for the open-ended category, and \(w\ (\ge 0)\) is unspecified. We refer this model as the open-ended category type asymmetry (OEAS) model.

Second, we consider a model corresponding to the data set of type (iii). Bagheban and Zayeri [7] consider a power parameter score as follows:

$$\begin{aligned} k^{a} \quad (k=1,\dots ,R), \end{aligned}$$

where \(a > 0\). The power parameter score has the following properties:

  1. (1)

    if \(a < 1\) then the difference in scores between category \(k+1\) and k decreases as k increases;

  2. (2)

    if \(a > 1\) then the difference in scores between category \(k+1\) and k increases as k increases;

  3. (3)

    if \(a = 1\) then the power parameter score is equivalent to the equally spaced score.

Bagheban and Zayeri [7] treated a as known but did not discuss how to select the optimal value of a. In the OQS model, Ando [13] used the power parameter score as the known ordered scores, selected the optimal value of a by a grid search. In contrast, we propose the following original model treating a as unknown:

$$\begin{aligned} p_{ij} = \delta ^{s_j-s_i}p_{ji} \quad (i<j), \end{aligned}$$

where \(s_{k}=k^a\) for \(k=1,\dots ,R\), and \(a \ (>0)\) are unknown. We refer to this model as the power parameter type asymmetry (PPAS) model. The PPAS model with \(a=1\) is identical to the LDPS model.

Under the OEAS and PPAS models, the following properties hold:

  1. (1)

    if \(\delta >1\) then \(F^{X}_{k} > F^{Y}_{k}\) for all \(k=1,\dots ,R-1\) because \(p_{ij} > p_{ji}\) for all \(i<j\);

  2. (2)

    if \(\delta <1\) then \(F^{X}_{k} < F^{Y}_{k}\) for all \(k=1,\dots ,R-1\) because \(p_{ij} < p_{ji}\) for all \(i<j\);

  3. (3)

    if \(\delta = 1\) then the S model holds because \(p_{ij} = p_{ji}\) for all \(i<j\).

For properties (1) and (2), the parameter \(\delta\) in the OEAS or PPAS models infers whether X is stochastically greater than Y or vice versa.

2.3 Goodness-of-Fit Test

Let \(n_{ij}\) denote the observed frequency in the (ij)th cell of the table (\(i, j = 1,\dots ,R\)). Assume that a multinomial distribution applies to the \(R\times R\) table. The maximum likelihood estimates of expected frequencies under the model can be obtained using the Newton–Raphson method in the log-likelihood equation.

Each model can be tested for goodness-of-fit by, the likelihood ratio and chi-square statistic (denoted by \(G^{2}\)) with the corresponding degrees of freedom. The test statistic \(G^{2}\) of model M is given as

$$\begin{aligned} G^{2}(M)=2\sum ^{R}_{i=1}\sum ^{R}_{j=1} n_{ij} \log \left( \frac{n_{ij}}{\hat{m}_{ij}}\right) , \end{aligned}$$

where \(\hat{m}_{ij}\) is the maximum likelihood estimate (MLE) of the expected frequency \(m_{ij}\) under model M.

The number of degrees of freedom for both the OEAS and PPAS models are \((R^{2}-R-4)/2\). Note that the number of degrees of freedom for the OEAS and PPAS models is one less than that of the LDPS, OQS, and RQS models, and two less than the S model.

Applied economists often use the Akaike information criterion (AIC) as a quick method for choosing the best-fitting model among alternatives. The AIC is defined as

$$\begin{aligned} \text {AIC} = -2(\text {the maximum log likelihood}) + 2(\text {the number of parameters}), \end{aligned}$$

for each model, see Akaike [14]. This criterion recommends a model with minimum AIC as the best-fitting model. When two models are compared, only the difference between AICs is required. It is therefore possible to ignore a common constant AIC, and use a modified AIC defined as

$$\begin{aligned} \text {AIC}^{+} = G^{2} - 2(\text {the number of degrees of freedom}). \end{aligned}$$

Thus, the model with the minimum AIC\(^{+}\) (i.e., the minimum AIC) is the best-fitting model among the applied models.

3 Application to Real-World Data

3.1 Application to Income Data

We apply the S, LDPS, and OEAS models to the data set in Table 1. The ordered scores \(s_1, s_2\), and \(s_3\) of the OEAS model are assigned as 35, 110, and 300 respectively, and \(s_4\) is assigned \(450+w\) (\(w\ge 0\)).

Table 3 shows the MLEs of the expected frequencies under the OEAS model. The goodness-of-fit results in Table 4 reveal that (1) the S and LDPS models fit poorly, (2) the OEAS model fits well, and (3) the OEAS model is significantly better compared to the LDPS model.

Under the OEAS model, the MLEs of \(\delta\) and w are \(\hat{\delta }=0.986\) and \(\hat{w}=39.203\) respectively. Thus, the MLE of \(s_4\) is \(\hat{s}_4 = 489.203\). Since \(s_2-s_1 = 75\), \(s_3-s_2 = 190\), and \(s_4-s_3 = 189.203\), the \(s_{k+1}-s_k\) for \(k=1,\dots ,R-1\) are unlikely to be constant. Since \(\hat{\delta } < 1\), we infer that the male individuals’ incomes tend to be higher than that of their female spouses’ incomes.

Table 3 The maximum likelihood estimates of expected frequencies under the open-ended category type asymmetry model applied to the data set in Table 1 are shown in parentheses in the second line
Table 4 Values of the likelihood ratio, chi-square statistic (\(G^2\)) and the modified Akaike information criterion (\(\text {AIC}^{+}\)), for each model applied to the data are shown in Table 1

3.2 Application to Occupational Status Data

We apply the S, LDPS, and PPAS models to the data set in Table 2.

Table 5 shows the MLEs of expected frequencies under the PPAS model. The results in Table 6 reveal that (1) the S and LDPS models fit poorly, (2) the RQS and PPAS models fit well, and (3) the PPAS model is preferred over the RQS model when values of \(\text {AIC}^{+}\) are compared.

Under the PPAS model, the MLEs of \(\delta\) and a are \(\hat{\delta }=1.000013\) and \(\hat{a}=6.441\) respectively. As \(\hat{a} > 1\), the difference in scores between \(k+1\) and k increases as k increases. We provide evidence that, \(s_2-s_1 = 85.891\), \(s_3-s_2 = 1096.712\), \(s_4-s_3 = 6366.527\), and \(s_5-s_4 = 24230.730\). As \(\hat{\delta } >1\), we infer that the occupational statuses of fathers tend to be higher than those of their sons.

Table 5 The maximum likelihood estimates of expected frequencies under the power parameter type asymmetry model applied to the data set in Table 2 are shown in parentheses in the second line
Table 6 Values of the likelihood ratio chi-square statistic (\(G^2\)) and the modified Akaike information criterion (\(\text {AIC}^{+}\)), for each model applied to the data are shown in Table 2

4 Conclusion

This study introduced three types of data set, namely those that: (i) can be assigned the known ordered scores for all categories, (ii) can be assigned the known ordered scores for all except one category, and (iii) cannot be assigned the known ordered scores for all categories. This study proposed two original asymmetry models based on non-integer scores corresponding to data sets of types (ii) and (iii). The proposed models are simple asymmetry models, and therefore easier to apply and interpret. The findings demonstrate that the proposed models are applicable to real-world data.