Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm

Abstract

We design a probability distribution for ordinal data by modeling the process generating data, which is assumed to rely only on order comparisons between categories. Contrariwise, most competitors often either forget the order information or add a non-existent distance information. The data generating process is assumed, from optimality arguments, to be a stochastic binary search algorithm in a sorted table. The resulting distribution is natively governed by two meaningful parameters (position and precision) and has very appealing properties: decrease around the mode, shape tuning from uniformity to a Dirac, identifiability. Moreover, it is easily estimated by an EM algorithm since the path in the stochastic binary search algorithm can be considered as missing values. Using then the classical latent class assumption, the previous univariate ordinal model is straightforwardly extended to model-based clustering for multivariate ordinal data. Parameters of this mixture model are estimated by an AECM algorithm. Both simulated and real data sets illustrate the great potential of this model by its ability to parsimoniously identify particularly relevant clusters which were unsuspected by some traditional competitors.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    Ranking data occur when some judges are asking to sort m objects, quoted by \(1,\ldots ,m\), according to a preference order. Thus, a ranking data is a permutation of the whole data set \(\{1,\ldots ,m\}\). Ordinal data occur when some judges are asking to give a note among the set of ordered notes \(1<\cdots <m\). Thus, an ordinal data is a unique element of \(\{1,\ldots ,m\}\).

  2. 2.

    We will do the slight abuse of notation \(\hbox {p}(x;\mu ,\pi )=\hbox {p}\Big (\{x\};\mu ,\pi \Big )\).

  3. 3.

    For avoiding the label cluster dependency, we keep the best distance over all label permutations.

  4. 4.

    Note that a two component multinomial model is not performed in this univariate situation since it is not identifiable.

  5. 5.

    http://www.aeres-evaluation.com/

  6. 6.

    All partitions are better displayed in the first and the third axes than in the first and second axes.

  7. 7.

    We conjecture that all propositions hold for any m value. In particular, we have proved the simplest of them (Propositions (5.1), (5.2), (5.4)) in this general situation (these general proofs are not given in this paper).

References

  1. Agresti, A.: Analysis of Ordinal Categorical Data. Wiley Series in Probability and Statistics. Wiley-Interscience, New York (2010)

    Google Scholar 

  2. Allman, E.S., Matias, C., Rhodes, J.A.: Identifiability of parameters in latent structure models with many observed variables. Ann. Stat. 37(6A), 3099–3132 (2009)

    MathSciNet  Article  MATH  Google Scholar 

  3. Bartholomew, D.J., Knott, M., Moustaki, I.: Latent Variable Models and Factor Analysis: A Unified Approach. Wiley, Hoboken (2011)

    Google Scholar 

  4. Biernacki, C., Jacques, J.: A generative model for rank data based on sorting algorithm. Comput. Stat. Data Anal. 58, 162–176 (2013)

    MathSciNet  Article  Google Scholar 

  5. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. J. Pattern Recogn. Soc. 28, 781–793 (1995)

    Article  Google Scholar 

  6. D’Elia, A., Piccolo, D.: A mixture model for preferences data analysis. Comput. Stat. Data Anal. 49(3), 917–934 (2005)

    MathSciNet  Article  MATH  Google Scholar 

  7. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977). With discussion

    MathSciNet  MATH  Google Scholar 

  8. Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Soc. 97, 611–612 (2002)

    MathSciNet  Article  MATH  Google Scholar 

  9. Giordan, M., Diana, G.: A clustering method for categorical ordinal data. Commun. Stat. Theory Methods 40, 1315–1334 (2011)

    MathSciNet  Article  MATH  Google Scholar 

  10. Goodman, L.A., Kruskal, W.H.: Measures of association for cross classifications. J. Am. Stat. Assoc. 49, 732–764 (1954)

    MATH  Google Scholar 

  11. Goodman, L.A.: Explanatory latent structure models using both identifiable and unidentifiable models. Biometrika 61, 215–231 (1974)

    MathSciNet  Article  MATH  Google Scholar 

  12. Gouget, C.: Utilisation des modèles de mélange pour la classification automatique de données ordinales. Ph.D. thesis, Université de Technologie de Compiègne (2006)

  13. Hartigan, J.A., Wong, M.A.: Algorithm as 1326: a k-means clustering algorithm. Appl. Stat. 28, 100–108 (1978)

    Article  MATH  Google Scholar 

  14. Iannario, M., Piccolo, D.: A program in r for cub models inference. Technical report, Universita di Napoli Federico II, Version 2.0, www.dipstat.unina.it (2009)

  15. Jacques, J., Biernacki, C.: Model-based clustering for multivariate partial ranking data. J. Stat. Plan. Inference 149, 201–217 (2014)

    MathSciNet  Article  MATH  Google Scholar 

  16. Jacques, J., Grimonprez, Q., Biernacki, C.: Rankcluster: an r package for clustering multivariate partial rankings. R J. (in press) (2014)

  17. Jacques, J., Preda, C.: Functional data clustering: a survey. Adv. Data Anal. Classif. 8(3), 231–255 (2014)

    MathSciNet  Article  Google Scholar 

  18. Jollois, F-X., Nadif, M.: Classification de données ordinales : modèles et algorithmes. In: Proceedings of the 41th Conference of the French Statistical Society, Bordeaux, France (2011)

  19. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, Hoboken (1990)

    Google Scholar 

  20. Kendall, M.G.: The treatment of ties in ranking problems. Biometrika 33, 239–251 (1945)

    MathSciNet  Article  MATH  Google Scholar 

  21. Knuth, D.E.: Sorting and Searching: The Art of Computer Programming, vol. 3, 2nd edn. Addison-Wesley Professional, Boston (1998)

    Google Scholar 

  22. Lewis, S.J.G., Foltynie, T., Blackwell, A.D., Robbins, T.W., Owen, A.M., Barker, R.A.: Heterogeneity of parkinson’s disease in the early clinical stages using a data driven approach. J. Neurol. Neurosurg. Psychiatry 76, 343–348 (2003)

    Article  Google Scholar 

  23. Lipsitz, S.R., Kim, K., Zhao, L.: Analysis of repeated categorical data using generalized estimating equations. Stat. Med. 13, 1149–1163 (1994)

    Article  Google Scholar 

  24. Manisera, M., Zuccolotto, P.: Modeling rating data with nonlinear CUB models. Comput. Stat. Data Anal. 78, 100–118 (2014)

  25. Marbac, M., Biernacki, C., Vandewalle, V.: Model-based clustering of Gaussian copulas for mixed data. ArXiv e-prints, May (2014)

  26. Matechou, E., Liu, I., Pledger, S., Arnold, R.: Biclustering models for ordinal data. In: NZSA 2011 Conference, Auckland, New-Zealand (2011)

  27. McCullagh, P.: Regression models for ordinal data. J. R. Stat. Soc. Ser. B 42, 109–142 (1980)

    MathSciNet  MATH  Google Scholar 

  28. McLachlan, G., Peel, D.: Applied Probability and Statistics. Finite Mixture Models. Wiley Series in Probability and Statistics. Wiley-Interscience, New York (2000)

    Google Scholar 

  29. McParland, D., Gormley, C.: Algorithms from and for Nature and Life: Studies in Classification, Data Analysis, and Knowledge Organization, chapter Clustering Ordinal Data via Latent Variable Models. Springer, New York (2013)

    Google Scholar 

  30. Melnykov, V., Maitra, R.: Finite mixture models and model-based clustering. Stat. Surv. 4, 80–116 (2010)

    MathSciNet  Article  MATH  Google Scholar 

  31. Nelsen, R.B.: An Introduction to Copulas. Springer, New York (1999)

    Google Scholar 

  32. Podani, J.: Braun-blanquet’s legacy and data analysis in vegetation science. J. Veg. Sci. 17, 113–117 (2006)

    Article  Google Scholar 

  33. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    MathSciNet  Article  MATH  Google Scholar 

  34. Somers, R.H.: A new assymmetric measure of association for ordinal variables. Am. Sociol. Rev. 27, 799–811 (1962)

    Article  Google Scholar 

  35. Stevens, S.S.: On the theory of scales of measurement. Science 103(2684), 677–680 (1946)

    Article  MATH  Google Scholar 

  36. Vermunt, J.K.: A general class of nonparametric models for ordinal categorical data. Sociol. Methodol. 29, 187–223 (1999)

    Article  Google Scholar 

  37. Vermunt, J.K., Magidson, J.: Technical Guide for Latent GOLD 4.0: Basic and Advanced. Statistical Innovations Inc., Belmont (2005)

    Google Scholar 

  38. Wolfe, J.H.: Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res. 5, 329–350 (1970)

    Article  Google Scholar 

  39. Xu, R., Wunsch, D.C.: Clustering. Wiley, Hoboken (2009)

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Julien Jacques.

Appendices

Appendix 1: The BOS properties: statements and proofs

From Equations (7), (8) and (10) in Sect. 2.2, it is easily seen that the BOS model has a polynomial expression related to \(\pi \) of degree less than or equal to \(m-1\):

$$\begin{aligned} \hbox {p}(x ; \mu ,\pi ) = \sum _{j=0}^{m-1} a_j(m,\mu ,x) \pi ^{j}. \end{aligned}$$
(18)

From this statement, we have developed a formal Matlab code computing the exact expression of each coefficient \(a_j(m,\mu ,x)\) for any x, \(\mu \), and m values. For instance, for \(m=5\), \(\mu =2\), and \(x=4\), we obtain the following exact expression:

$$\begin{aligned} \hbox {p}(4 ; 2,\pi )= & {} a_0(5,2,4) + a_1(5,2,4)\pi + a_2(5,2,4)\pi ^{2}\\&+\, a_3(5,2,4)\pi ^{3} + a_4(5,2,4)\pi ^4\\= & {} \frac{1}{5} - \frac{33}{200}\pi - \frac{457}{7200}\pi ^{2} + \frac{2}{75}\pi ^{3} + \frac{13}{7200}\pi ^4. \end{aligned}$$

This formal calculus tool is used in the proofs of all the following properties. Its advantage is to provide very straightforward proofs. Its drawback is to provide proofs for only some selected values of m. However, we have chosen \(m\in \{1,\ldots ,8\}\) since it answers to most practical situationsFootnote 7. For instance in the reference book Agresti (2010) all ordinal data have less than 8 categories.

Proposition 5.1

(\(\pi =0\): Uniformity.) If \(\pi =0\), then \(\forall (\mu ,x) \in \{1,\ldots ,m\}^2\), \(\hbox {p}(x;\mu ,\pi )=m^{-1}\).

Proof

The formal calculus leads to \(a_0(m,\mu ,x)=m^{-1}\) for all \(m\in \{1,\ldots ,8\}\). Conclusion follows by setting \(\pi =0\) in (18).

Proposition 5.2

(\(\pi =1\): Dirac in \(\mu \).) If \(\pi =1\), then \(\forall (x,\mu ) \in \{1,\ldots ,m\}^2\), \(\hbox {p}(x;\mu ,\pi )={\mathbb I}(x=\mu )\).

Proof

The formal calculus leads to \(\sum _{j=0}^{m-1}a_j(m,\mu ,\mu )=1\) for all \(m\in \{1,\ldots ,8\}\). Conclusion follows by setting \(\pi =1\) in (18).

Proposition 5.3

(\(\mu \): mode if \(\pi >0\).) If \(\pi >0\), then \(\forall (\mu ,x) \in \{1,\ldots ,m\}^2\) such that \(\mu \ne x\), \(\hbox {p}(\mu ;\mu ,\pi ) > \hbox {p}(x;\mu ,\pi )\).

Proof

For \(\pi =1\), Proposition 5.2 already established that \(\mu \) is the mode. We consider now \(0<\pi <1\). From (18), \(\hbox {p}(\mu ;\mu ,\pi ) - \hbox {p}(x;\mu ,\pi )\) is a polynomial of degree less than or equal to \(m-1\). Moreover, from Proposition 5.1, \(\pi =0\) is a root of \(\hbox {p}(\mu ;\mu ,\pi ) - \hbox {p}(x;\mu ,\pi )\), thus we can factorize it by \(\pi \)

$$\begin{aligned} \hbox {p}(\mu ;\mu ,\pi ) - \hbox {p}(x;\mu ,\pi ) = \pi \left( \sum _{j=0}^{m-2} b_j(m,\mu ,x) \pi ^j \right) .\nonumber \\ \end{aligned}$$
(19)

Since \(0<\pi < 1\) implies that \(0<\pi ^j< 1\) for any \(j>1\), the following lower bound is directly obtained

$$\begin{aligned}&\hbox {p}(\mu ;\mu ,\pi ) - \hbox {p}(x;\mu ,\pi ) \nonumber \\&\quad > \pi \left( b_0(m,\mu ,x) + \sum _{j=1}^{m-2} b_j^-(m,\mu ,x) \right) , \end{aligned}$$
(20)

where \(b_j^-(m,\mu ,x)=\min (0,b_j(m,\mu ,x))\). All \(b_j(m,\mu ,x)\) are obtained from a formal polynomial Euclidian division for all \(m\in \{1,\ldots ,8\}\), allowing to show by formal calculus that \( b_0(m,\mu ,x) + \sum _{j=1}^{m-2} b_j^-(m,\mu ,x) > 0\) for all \(m\in \{1,\ldots ,8\}\). Conclusion follows.

Proposition 5.4

(\(\mu \): absolute growing of its probability with \(\pi \). ) \(\forall \mu \in \{1,\ldots ,m\}\), \(\hbox {p}(\mu ;\mu ,\pi )\) is an increasing function of \(\pi \).

Proof

The formal calculus leads to \(a_j(m,\mu ,\mu )\ge 0\) for all \(m\in \{1,\ldots ,8\}\) and for all \(j\in \{1,\ldots ,m-1\}\). Conclusion follows directly from (18).

Proposition 5.5

( \(\mu \): relative growing of its probability with \(\pi \).) \(\forall (\mu ,x) \in \{1,\ldots ,m\}^2\) such that \(\mu \ne x\), \(\hbox {p}(\mu ;\mu ,\pi )- \hbox {p}(x;\mu ,\pi )\) is an increasing function of \(\pi \).

Proof

The key is to prove the differential (in \(\pi \)) expression \(\hbox {p}'(\mu ;\mu ,\pi )- \hbox {p}'(x;\mu ,\pi )>0\). From (18), \(\hbox {p}'(\mu ;\mu ,\pi ) - \hbox {p}'(x;\mu ,\pi )\) is a polynomial of degree less than or equal to \(m-2\)

$$\begin{aligned} \hbox {p}'(\mu ;\mu ,\pi ) - \hbox {p}'(x;\mu ,\pi ) = \sum _{j=0}^{m-2} b_j(m,\mu ,x) \pi ^j. \end{aligned}$$
(21)

Since \(0<\pi < 1\) implies that \(0<\pi ^j< 1\) for any \(j>1\), the following lower bound is directly obtained

$$\begin{aligned} \hbox {p}'(\mu ;\mu ,\pi ) - \hbox {p}'(x;\mu ,\pi ) > b_0(m,\mu ,x) + \sum _{j=1}^{m-2} b_j^-(m,\mu ,x),\nonumber \\ \end{aligned}$$
(22)

where \(b_j^-(m,\mu ,x)=\min (0,b_j(m,\mu ,x))\). All \(b_j(m,\mu ,x)\) are obtained from a formal derivation of a polynomial for all \(m\in \{1,\ldots ,8\}\), allowing to show by formal calculus again that \( b_0(m,\mu ,x) + \sum _{j=1}^{m-2} b_j^-(m,\mu ,x) > 0\) for all \(m\in \{1,\ldots ,8\}\). Conclusion follows.

Proposition 5.6

(Decreasing around \(\mu \) if \(0<\pi <1\).) \(\forall (x,x') \in \{1,\ldots ,m\}^2\) such that \(x'<x<\mu \) or \(\mu >x>x'\), \(\hbox {p}(x;\mu ,\pi )> \hbox {p}(x';\mu ,\pi )\).

Proof

From (18), \(\hbox {p}(x;\mu ,\pi ) - \hbox {p}(x';\mu ,\pi )\) is a polynomial of degree less than or equal to \(m-1\). Moreover, from Propositions 5.1 and 5.2 respectively, \(\pi =0\) and \(\pi =1\) are roots of \(\hbox {p}(x;\mu ,\pi ) - \hbox {p}(x';\mu ,\pi )\), thus we can factorize it by \(\pi (1-\pi )\)

$$\begin{aligned}&\hbox {p}(x;\mu ,\pi ) - \hbox {p}(x';\mu ,\pi ) \nonumber \\&\quad = \pi (1-\pi )\left( \sum _{j=0}^{m-3} b_j(m,\mu ,x,x') \pi ^j \right) . \end{aligned}$$
(23)

Since \(0<\pi < 1\) implies that \(0<\pi ^j< 1\) for any \(j>1\), the following lower bound is directly obtained

$$\begin{aligned}&\hbox {p}(x;\mu ,\pi ) - \hbox {p}(x';\mu ,\pi ) \nonumber \\&\quad > \pi (1-\pi )\left( b_0(m,\mu ,x,x') + \sum _{j=1}^{m-3} b_j^-(m,\mu ,x,x') \right) ,\nonumber \\ \end{aligned}$$
(24)

where \(b_j^-(m,\mu ,x,x')=\min (0,b_j(m,\mu ,x,x'))\). All \(b_j\) \((m,\mu ,x,x')\) are obtained from a formal polynomial Euclidian division for all \(m\in \{1,\ldots ,8\}\), allowing to show by formal calculus again that \( b_0(m,\mu ,x,x') + \sum _{j=1}^{m-2} b_j^-(m,\mu ,x,\) \(x') > 0\) for all \(m\in \{1,\ldots ,8\}\). Conclusion follows.

Proposition 5.7

(Identifiability) If \(0<\pi \le 1\), identifiability holds.

Proof

The identifiability problem could concern \(\mu \) and/or \(\pi \).

  • First, there exists no couple \((\mu ,\mu ')\in \{1,\ldots ,m\}^2\) with \(\mu \ne \mu '\) such that \(\hbox {p}(x;\mu ,\pi )=\hbox {p}(x;\mu ',\pi ')\) for any \(x\in \{1,\ldots ,m\}\) and any \((\pi ,\pi ')\in (0,1]^2\). Indeed, otherwise both \(\mu \) and \(\mu '\) should be modes of both distributions, which is impossible by unicity of the mode (Propositions 5.3 and 5.6).

  • Second, there exists no couple \((\pi ,\pi ')\in (0,1]^2\) with \(\pi \ne \pi '\) such that \(\hbox {p}(x;\mu ,\pi )=\hbox {p}(x;\mu ,\pi ')\) for any \(x\in \{1,\ldots ,m\}\) and any \(\mu \in \{1,\ldots ,m\}\) since \(\hbox {p}(x;\mu ,\pi )\) is a polynomial of order at least one. This last statement comes from the fact that \(\hbox {p}(x;\mu ,\pi )\) varies from 1 / m to 0 or 1 when \(\pi \) varies (respectively when \(x\ne \mu \) and \(x=\mu \)) from Propositions 5.1 and 5.2, thus depends on the value of \(\pi \).

Appendix 2: AERES data

See Table 11.

Table 11 Bachelor’s degrees evaluations (March 2011) of 23 French universities on the ordinal scale A\(+\), A, B and C. The four evaluation criteria are described in Sect. 4.2

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Biernacki, C., Jacques, J. Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm. Stat Comput 26, 929–943 (2016). https://doi.org/10.1007/s11222-015-9585-2

Download citation

Keywords

  • Ordinal data
  • Binary search algorithm
  • Latent variables
  • AECM algorithm