Abstract
Square contingency tables with ordinal classifications are used in many disciplines that include but are not limited to data science, engineering, and medical research. This study proposes two original asymmetry models based on non-integer scores for the analysis of square contingency tables. The ordinal quasi-symmetry model applies to data sets that can be assigned to known ordered scores for all categories. When we assign the equally spaced score for categories, the ordinal quasi-symmetry model is equivalent to the linear diagonals-symmetry model. The ordinal quasi-symmetry model, however, is not applicable to data sets that cannot be assigned the known ordered scores for all categories. This study addresses this issue. The proposed models apply to data sets that: (i) can be assigned the known ordered scores for all except one category and (ii) cannot be assigned the known ordered scores for all categories. These two models provide a better fit than existing models for real-world data.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Consider \(R \times R\) square contingency tables with the same row and column ordinal classifications. Square contingency tables are used in many disciplines that include data science, engineering, and medical research, see, for example, Agresti [1].
Let \(s_{k}\) be the ordered score of category k for all \(k=1,\dots ,R\), where \(s_{1}<\cdots < s_{R}\). We consider the data sets that: (i) can be assigned the known ordered scores for all categories, (ii) can be assigned the known ordered scores for all except one category, and (iii) cannot be assigned the known ordered scores for all categories.
Typical examples of types (i) and (ii) are categorical variables set as intervals based on a continuous variable. When there is clear information about category intervals, it is recommended that ordered scores be assigned as midpoint intervals (midpoint scores) instead of equally spaced scores (see Graubard and Korn [2] and Senn [3]).
For the analysis of the data sets of type (i), the ordinal quasi-symmetry (OQS) model proposed by Agresti [1] is often used. The OQS model indicates the asymmetric structure of the cell probabilities with respect to the main-diagonals cell of the table. The OQS model assumes that the row category k and column category k are assigned the same known scores \(s_{k}\) for all \(k=1,\dots ,R\). This assumption is natural for square contingency tables with the same row and column ordinal classifications.
We consider the data set in Table 1, that presents the cross-classification of 1995 income data for espoused couples in Japan. Individual and spouse incomes are categorized as “less than 70”, “from 70 to less than 150”, “from 150 to less than 450”, and “450 or more”. We assign 35, 110, and 300 as the ordered scores for the first, second, and third categories, respectively. However, as the fourth category is unbounded above, we could not assign the known ordered score.
Gautam [4] suggests that the ordered scores for a data set with an open-ended category should be assigned as follows: the scores \(s_{1}\) to \(s_{R-1}\) are midpoint scores, and the score \(s_{R}\) is unknown. Therefore, the score \(s_{R}\) can be expressed as \(s_{R}=w_{0}+w\); where \(w_{0}\) is the smallest value of the interval for the open-ended category, and \(w\ge 0\) is unknown. Gautam [4] assumes that the row category k and column category k are assigned the same scores \(s_{k}\) for all \(k=1,\dots ,R\). Additionally, as the ordered scores for a data set with an open-ended category, Aktas and Wu [5] introduces the standardized z-scores for the row and column categories, and proposes a model that indicates the asymmetric structure depending on the standardized z-scores. However, the standardized z-scores assumes that the row category k and column category k are assigned the different scores \(s_{k}\) for all \(k=1,\dots ,R\). Therefore, for the analysis of the data sets of type (ii), we are interested in considering a model the asymmetric structure of the cell probabilities depending on the score proposed by Gautam [4].
We further consider the data set in Table 2 obtained from Agresti [1], that present the cross-classification of occupational status categories for father and son dyads in Britain. In this case, the scores \(s_{1}\) to \(s_{R}\) are treated as unknown because it was difficult to assign an ordered known score to any category.
For the analysis of the data sets of type (iii), the ridit score type quasi-symmetry (RQS) model proposed by Iki et al. [6] is often used. The RQS model indicates the asymmetric structure of the cell probabilities depending on the ridit scores. The ridit scores is the unknown ordered scores, and is defined the average of row and column marginal ridits. Thus, the RQS model also assumes that the row category k and column category k are assigned the same unknown scores \(s_{k}\) for all \(k=1,\dots ,R\).
Bagheban and Zayeri [7] proposes a power parameter score. We modify the power parameter score so that it can be treated as the unknown ordered score, although Bagheban and Zayeri [7] is treated the power parameter score as the known ordered scores. Thus, for the analysis of the data sets of type (iii), we are interested in proposing a model the asymmetric structure of the cell probabilities depending on the power parameter score.
This study proposes two original asymmetry models based on non-integer scores for square contingency tables with the same row and column ordinal classifications. One of the proposed models is useful for the data set with open-ended categories–namely type (ii). The other is applicable to a data set that cannot be assigned the known ordered scores for all categories–namely type (iii).
The remainder of this paper is organized as follows. Sect. 2 proposes two original models based on non-integer scores. Sect. 3 demonstrates the utility of these proposed models as applied to the real-world data presented in Tables 1 and 2. We conclude the paper in Sect. 4.
2 Proposed Models
2.1 Existing Models Based on Non-integer Scores
Let \(p_{ij}\) denote the probability that an observation will fall in the (i, j)th cell of the table (\(i=1,\dots ,R;j=1,\dots ,R\)).
This study focuses on a model having the following formula:
Note that \(s_{k}\) is the ordered score of category k for all \(k=1,\dots ,R\), where \(s_{1}<\cdots < s_{R}\). This model can represent various models depending on how \(s_k\) is set.
As the model based on integer scores (i.e., \(s_k = k\) for all \(k=1,\dots ,R\)), the linear diagonals-parameter symmetry (LDPS) model proposed by Agresti [8] was defined as
We introduce existing models based on non-integer scores (i.e., \(s_k \ne k\) for all \(k=1,\dots ,R\)). We can assign the known ordered scores \(s_{k}\) to the category k for all \(k=1,\dots ,R\). The ordinal quasi-symmetry (OQS) model proposed by Agresti [1] was defined as
Note that the OQS model corresponds to the type (i) data set. The OQS model with equally spaced scores (\(s_k = s_1 + (k-1)d\) for \(k=1,\dots ,R\)) is equivalent to the LDPS model. The OQS model with \(\delta =1\) is also identical to the symmetry (S) model proposed by Bowker [9]. Kateri and Agresti [10] considered the OQS model based on f-divergence, also see Saigusa et al. [11].
Let X and Y denote the row and column variables, \(p_{k\cdot }=\sum ^{R}_{l=1}p_{kl}\) and \(p_{\cdot k}=\sum ^{R}_{l=1}p_{lk}\) for the marginal probabilities for \(k=1,\dots ,R\). and \(F^{X}_{k} = \sum ^{k}_{l=1}p_{l\cdot }\) and \(F^{Y}_{k} = \sum ^{k}_{l=1}p_{\cdot l}\) for the marginal distribution functions for \(k=1,\dots ,R\), where \(F^{X}_{R}=1\) and \(F^{Y}_{R} = 1\). Then the marginal ridits are defined as,
see Bross [12].
When we cannot assign the known ordered scores \(s_{k}\) to the category k for all \(k=1,\dots ,R\), we adopt the RQS model proposed by Iki et al. [6]:
where \(s_{k}=(r^{X}_{k} + r^{Y}_{k})/2\) for all \(k=1,\dots ,R\). Note that \(\{s_{k}\}\) in the RQS model are unspecified (i.e., the unknown ordered scores), and the RQS model corresponds to the data set of type (iii).
We highlight that a model corresponding to the data set of type (ii) does not exist in a similar form to the OQS and RQS models.
2.2 Proposed Models Based on Non-integer Scores
We propose two original models based on non-integer scores corresponding to the data set of types (ii) and (iii). First, we propose an original model corresponding to the data set of type (ii), defined as
where \(s_1\) to \(s_{R-1}\) are known, and \(s_R\) is unknown. Therefore, \(s_{1}\) to \(s_{R-1}\) are assigned to known ordered scores (e.g., midpoint scores), \(s_{R}\) is defined as \(s_{R}=w_{0}+w\), where \(w_{0}\) is the smallest value of the interval for the open-ended category, and \(w\ (\ge 0)\) is unspecified. We refer this model as the open-ended category type asymmetry (OEAS) model.
Second, we consider a model corresponding to the data set of type (iii). Bagheban and Zayeri [7] consider a power parameter score as follows:
where \(a > 0\). The power parameter score has the following properties:
-
(1)
if \(a < 1\) then the difference in scores between category \(k+1\) and k decreases as k increases;
-
(2)
if \(a > 1\) then the difference in scores between category \(k+1\) and k increases as k increases;
-
(3)
if \(a = 1\) then the power parameter score is equivalent to the equally spaced score.
Bagheban and Zayeri [7] treated a as known but did not discuss how to select the optimal value of a. In the OQS model, Ando [13] used the power parameter score as the known ordered scores, selected the optimal value of a by a grid search. In contrast, we propose the following original model treating a as unknown:
where \(s_{k}=k^a\) for \(k=1,\dots ,R\), and \(a \ (>0)\) are unknown. We refer to this model as the power parameter type asymmetry (PPAS) model. The PPAS model with \(a=1\) is identical to the LDPS model.
Under the OEAS and PPAS models, the following properties hold:
-
(1)
if \(\delta >1\) then \(F^{X}_{k} > F^{Y}_{k}\) for all \(k=1,\dots ,R-1\) because \(p_{ij} > p_{ji}\) for all \(i<j\);
-
(2)
if \(\delta <1\) then \(F^{X}_{k} < F^{Y}_{k}\) for all \(k=1,\dots ,R-1\) because \(p_{ij} < p_{ji}\) for all \(i<j\);
-
(3)
if \(\delta = 1\) then the S model holds because \(p_{ij} = p_{ji}\) for all \(i<j\).
For properties (1) and (2), the parameter \(\delta\) in the OEAS or PPAS models infers whether X is stochastically greater than Y or vice versa.
2.3 Goodness-of-Fit Test
Let \(n_{ij}\) denote the observed frequency in the (i, j)th cell of the table (\(i, j = 1,\dots ,R\)). Assume that a multinomial distribution applies to the \(R\times R\) table. The maximum likelihood estimates of expected frequencies under the model can be obtained using the Newton–Raphson method in the log-likelihood equation.
Each model can be tested for goodness-of-fit by, the likelihood ratio and chi-square statistic (denoted by \(G^{2}\)) with the corresponding degrees of freedom. The test statistic \(G^{2}\) of model M is given as
where \(\hat{m}_{ij}\) is the maximum likelihood estimate (MLE) of the expected frequency \(m_{ij}\) under model M.
The number of degrees of freedom for both the OEAS and PPAS models are \((R^{2}-R-4)/2\). Note that the number of degrees of freedom for the OEAS and PPAS models is one less than that of the LDPS, OQS, and RQS models, and two less than the S model.
Applied economists often use the Akaike information criterion (AIC) as a quick method for choosing the best-fitting model among alternatives. The AIC is defined as
for each model, see Akaike [14]. This criterion recommends a model with minimum AIC as the best-fitting model. When two models are compared, only the difference between AICs is required. It is therefore possible to ignore a common constant AIC, and use a modified AIC defined as
Thus, the model with the minimum AIC\(^{+}\) (i.e., the minimum AIC) is the best-fitting model among the applied models.
3 Application to Real-World Data
3.1 Application to Income Data
We apply the S, LDPS, and OEAS models to the data set in Table 1. The ordered scores \(s_1, s_2\), and \(s_3\) of the OEAS model are assigned as 35, 110, and 300 respectively, and \(s_4\) is assigned \(450+w\) (\(w\ge 0\)).
Table 3 shows the MLEs of the expected frequencies under the OEAS model. The goodness-of-fit results in Table 4 reveal that (1) the S and LDPS models fit poorly, (2) the OEAS model fits well, and (3) the OEAS model is significantly better compared to the LDPS model.
Under the OEAS model, the MLEs of \(\delta\) and w are \(\hat{\delta }=0.986\) and \(\hat{w}=39.203\) respectively. Thus, the MLE of \(s_4\) is \(\hat{s}_4 = 489.203\). Since \(s_2-s_1 = 75\), \(s_3-s_2 = 190\), and \(s_4-s_3 = 189.203\), the \(s_{k+1}-s_k\) for \(k=1,\dots ,R-1\) are unlikely to be constant. Since \(\hat{\delta } < 1\), we infer that the male individuals’ incomes tend to be higher than that of their female spouses’ incomes.
3.2 Application to Occupational Status Data
We apply the S, LDPS, and PPAS models to the data set in Table 2.
Table 5 shows the MLEs of expected frequencies under the PPAS model. The results in Table 6 reveal that (1) the S and LDPS models fit poorly, (2) the RQS and PPAS models fit well, and (3) the PPAS model is preferred over the RQS model when values of \(\text {AIC}^{+}\) are compared.
Under the PPAS model, the MLEs of \(\delta\) and a are \(\hat{\delta }=1.000013\) and \(\hat{a}=6.441\) respectively. As \(\hat{a} > 1\), the difference in scores between \(k+1\) and k increases as k increases. We provide evidence that, \(s_2-s_1 = 85.891\), \(s_3-s_2 = 1096.712\), \(s_4-s_3 = 6366.527\), and \(s_5-s_4 = 24230.730\). As \(\hat{\delta } >1\), we infer that the occupational statuses of fathers tend to be higher than those of their sons.
4 Conclusion
This study introduced three types of data set, namely those that: (i) can be assigned the known ordered scores for all categories, (ii) can be assigned the known ordered scores for all except one category, and (iii) cannot be assigned the known ordered scores for all categories. This study proposed two original asymmetry models based on non-integer scores corresponding to data sets of types (ii) and (iii). The proposed models are simple asymmetry models, and therefore easier to apply and interpret. The findings demonstrate that the proposed models are applicable to real-world data.
Data Availability
The data set of Table 1 is available at https://nesstar.iss.u-tokyo.ac.jp/webview/.
Abbreviations
- OQS:
-
Ordinal quasi-symmetry
- LDPS:
-
Linear diagonals-parameter symmetry
- S:
-
Symmetry
- RQS:
-
Ridit score type quasi-symmetry
- OEAS:
-
Open-ended category type asymmetry
- PPAS:
-
Power parameter type asymmetry
- MLE:
-
Maximum likelihood estimate
- AIC:
-
Akaike information criterion
References
Agresti, A.: Categorical data analysis, 2nd edn. Wiley, New York (2002)
Graubard, B.I., Korn, E.L.: Choice of column scores for testing independence in ordered \(2 \times k\) contingency tables. Biometrics 43, 471–476 (1987)
Senn, S.: Drawbacks to noninteger scoring for ordered categorical data. Biometrics 63, 296–299 (2007)
Gautam, S.: Test for linear trend in \(2 \times k\) ordered tables with open-ended categories. Biometrics 53, 1163–1169 (1997)
Aktas, S., Wu, S.: Marginal homogeneity model for ordered categories with open ends in square contingency tables. REVSTAT-Statistical Journal 13(3), 233–243 (2015)
Iki, K., Tahata, K., Tomizawa, S.: Ridit score type quasi-symmetry and decomposition of symmetry for square contingency tables with ordered categories. Austrian Journal of Statistics 38, 183–192 (2009)
Bagheban, A.A., Zayeri, F.: A generalization of the uniform association model for assessing rater agreement in ordinal scales. Journal of Applied Statistics 37, 1265–1273 (2010)
Agresti, A.: A simple diagonals-parameter symmetry and quasi-symmetry model. Statistics & Probability Letters 1, 313–316 (1983)
Bowker, A.H.: A test for symmetry in contingency tables. Journal of the American Statistical Association 43, 572–574 (1948)
Kateri, M., Agresti, A.: A class of ordinal quasi-symmetry models for square contingency tables. Statistics & Probability Letters 77(6), 598–603 (2007)
Saigusa, Y., Tahata, K., Tomizawa, S.: Orthogonal decomposition of symmetry model using the ordinal quasi-symmetry model based on f-divergence for square contingency tables. Statistics & Probability Letters 101, 33–37 (2015)
Bross, I.D.J.: How to use ridit analysis. Biometrics 14, 18–38 (1958)
Ando, S.: Asymmetry models based on ordered score and separations of symmetry model for square contingency tables. Biometrical Letters 58(1), 27–39 (2021)
Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control AC19, 716–723 (1983)
Acknowledgements
The author would like to thank the anonymous reviewers and the editors for their comments and suggestions to improve this paper. The data set in Table 1 used in this analysis–namely Social Stratification and Mobility (SSM95A)–was provided by the Social Science Japan Data Archive, Center for Social Research and Data Archives, Institute of Social Science, and the University of Tokyo.
Funding
The authors have solely funded the research by themselves.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical Approval
Not applicable.
Consent for Publication
Not applicable.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ando, S. Asymmetry Models Based on Non-integer Scores for Square Contingency Tables. J Stat Theory Appl 21, 21–30 (2022). https://doi.org/10.1007/s44199-022-00039-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s44199-022-00039-z