Advertisement

Statistics and Computing

, Volume 29, Issue 1, pp 139–160 | Cite as

Robust clustering tools based on optimal transportation

  • E. del Barrio
  • J. A. Cuesta-Albertos
  • C. MatránEmail author
  • A. Mayo-Íscar
Article

Abstract

A robust clustering method for probabilities in Wasserstein space is introduced. This new ‘trimmed k-barycenters’ approach relies on recent results on barycenters in Wasserstein space that allow intensive computation, as required by clustering algorithms to be feasible. The possibility of trimming the most discrepant distributions results in a gain in stability and robustness, highly convenient in this setting. As a remarkable application, we consider a parallelized clustering setup in which each of m units processes a portion of the data, producing a clustering report, encoded as k probabilities. We prove that the trimmed k-barycenter of the \(m\times k\) reports produces a consistent aggregation which we consider the result of a ‘wide consensus’. We also prove that a weighted version of trimmed k-means algorithms based on k-barycenters in the space of Wasserstein keeps the descending character of the concentration step, guaranteeing convergence to local minima. We illustrate the methodology with simulated and real data examples. These include clustering populations by age distributions and analysis of cytometric data.

Keywords

Cluster prototypes k-barycenter Trimmed barycenter Robust aggregation Wasserstein distance Monge–Kantorovich problem Transport maps Trimmed distributions Parallelized inference Bragging Subragging Trimmed k-means algorithm 

Mathematics Subject Classification

Primary 62H30 62G35 Secondary 62G20 62P99 

References

  1. Agueh, M., Carlier, G.: Barycenters in the Wasserstein space. SIAM J. Math. Anal. 43(2), 904–924 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  2. Anderes, E., Borgwardt, S., Miller, J.: Discrete wasserstein barycenters: optimal transport for discrete data. Math. Methods Oper. Res. 84, 389–409 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  3. Álvarez-Esteban, P.C., del Barrio, E., Cuesta-Albertos, J.A., Matrán, C.: A fixed-point approach to barycenters in Wasserstein space. J. Math. Anal. Appl. 441(2), 744–762 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  4. Álvarez-Esteban, P.C., del Barrio, E., Cuesta-Albertos, J.A., Matrán, C.: Wide Consensus aggregation in the Wasserstein Space. Application to location-scatter families. Bernoulli (2017) (to appear)Google Scholar
  5. Benamou, J.D., Carlier, G., Cuturi, M., Nenna, L., Peyre, G.: Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 37(2), 1111–1138 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  6. Bigot, J., Klein, T.: Consistent estimation of a population barycenter in the Wasserstein space (2015) ArXiv e-prints, arXiv:1212.2562v5, March 2015
  7. Bigot, J., Gouet, R., Klein, T., López, A.: Geodesic PCA in the Wasserstein space by Convex PCA. Ann Inst. Henri Poincaré Probab. Stat. 53(1), 1–26 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  8. Boissard, E., Le Gouic, T., Loubes, J.-M.: Distribution’s template estimate with Wasserstein metrics. Bernoulli 21(2), 740–759 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  9. Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)zbMATHGoogle Scholar
  10. Bühlmann, P.: Bagging, boosting and ensemble methods. In: Gentle, E.J., Härdle, K.W., Mori, Y. (eds.) Handbook of Computational Statistics: Concepts and Methods, pp. 985–1022. Springer, Berlin (2012)CrossRefGoogle Scholar
  11. Carlier, G., Oberman, A., Oudet, E.: Numerical methods for matching for teams and Wasserstein barycenters. ESAIM Math. Model. Numer. Anal. 49(6), 1621–1642 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  12. Carlier, G., Chernozhukov, V., Galichon, A.: Vector quantile regression: an optimal transport approach. Ann. Stat. 44(3), 1165–1192 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  13. Chernozhukov, V., Galichon, A., Hallin, M., Henry, M.: Monge-Kantorovich depth, quantiles, ranks, and signs. Ann. Stat. 45(1), 223–256 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  14. Cuesta-Albertos, J.A., Fraiman, R.: Impartial trimmed k-means for functional data. Comput. Stat. Data Anal. 51(10), 4864–4877 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  15. Cuesta-Albertos, J.A., Matrán, C.: The strong law of large numbers for \(k\)-means and best possible nets of Banach valued random variables. Probab. Theor. Related Fields 78, 523–534 (1988)MathSciNetCrossRefzbMATHGoogle Scholar
  16. Cuesta-Albertos, J.A., Gordaliza, A., Matrán, C.: Trimmed k-means: an attempt to robustify quantizers. Ann. Stat. 25(2), 553–576 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
  17. Cuturi, M., Doucet, A.: Fast computation of Wasserstein barycenters. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP, vol. 32 (2014)Google Scholar
  18. del Barrio, E., Lescornel, H., Loubes, J.M.: A statistical analysis of a deformation model with Wasserstein barycenters: estimation procedure and goodness of fit test (2015). Preprint http://arxiv.org/abs/1508.06465
  19. del Barrio, E., Cuesta-Albertos, J.A., Matrán, C.: Profiles of pyramid ages in American countries: a trimmed \(k\)-barycenters approach. Technical Report (2016)Google Scholar
  20. Delicado, P.: Dimensionality reduction when data are density functions. Comput. Stat. Data Anal. 55(1), 401–420 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  21. Dobric, V., Yukich, J.E.: Asymptotics for transportation cost in high dimensions. J. Theor. Probab. 8, 97–118 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  22. Dudley, R.M.: Real Analysis and Probability. Cambridge University Press, Cambridge (2004)Google Scholar
  23. Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9), 1090–1099 (2003)CrossRefGoogle Scholar
  24. Flury, B.: Estimation of principal points. Appl. Stat. 42(1), 139–151 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  25. Fritz, H., García-Escudero, L.A., Mayo-Iscar, A.: tclust: an R package for a trimming approach to cluster analysis. J. Stat. Softw. 47(12), 1–26 (2012)CrossRefGoogle Scholar
  26. Gallegos, M.T., Ritter, G.: A robust method for cluster analysis. Ann. Stat. 33, 347–380 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  27. García-Escudero, L.A., Gordaliza, A.: A proposal for robust curve clustering. J. Classif. 22(2), 185–201 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  28. García-Escudero, L.A., Gordaliza, A., Matrán, C.: Trimming tools in exploratory data analysis. J. Comput. Graph. Stat. 12(2), 434–449 (2003)MathSciNetCrossRefGoogle Scholar
  29. García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36(3), 1324–1345 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  30. García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Exploring the number of groups in robust model-based clustering. Stat. Comput. 21, 585–599 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  31. García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Avoiding spurious local maximizers in mixture modeling. Stat. Comput. 25, 619–633 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  32. Hennig, C., Meila, M., Murtagh, F., Rocci, R. (eds.): Handbook of Cluster Analysis. Chapman and Hall/CRC, Cambridge (2016)Google Scholar
  33. Kneip, A., Gasser, T.: Statistical tools to analyze data representing a sample of curves. Ann. Stat. 20(3), 1266–1305 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  34. Le Gouic, T., Loubes, J.M.: Existence and consistency of Wasserstein barycenters. Probab. Theor. Related Fields 168(3–4), 901–917 (2017)MathSciNetzbMATHGoogle Scholar
  35. Leisch, F.: Bagged clustering. Technical report. (1999) http://www.ci.tuwien.ac.at/?leisch/papers/fl-techrep.html
  36. Lember, J.: On minimizing sequences for k-centres. J. Approx. Theory 120(1), 20–35 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  37. Lo, K., Brinkman, R.R., Gottardo, R.: Automated gating of flow cytometry data via robust model-based clustering. Cytom. Part A J. Int. Soc. Anal. Cytol. 73(4), 32132 (2008).  https://doi.org/10.1002/cyto.a.20531 Google Scholar
  38. Luschgy, H., Pagès, G.: Functional quantization of Gaussian processes. J. Funct. Anal. 196, 486–531 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  39. Pärna, K.: Strong consistency of k-means clustering criterion. Acta Comm. Univ. Tartuensis 733, 86–96 (1986)zbMATHGoogle Scholar
  40. Pärna, K.: On the existence and weak convergence of k-centres in Banach spaces. Acta Comm. Univ. Tartuensis 893, 17–28 (1990)MathSciNetGoogle Scholar
  41. Pyne, S., Hu, X., Wang, K., et al.: Automated high-dimensional flow cytometric data analysis. Proc. Natl. Acad. Sci. USA 106(21), 8519–8524 (2009)CrossRefGoogle Scholar
  42. Pyne, S., Lee, S.X., Wang, K., Irish, J., Tamayo, P., Nazaire, M.D., Duong, T., Ng, S.K., Hafler, D., Levy, R., Nolan, G.P.: Joint modeling and registration of cell populations in cohorts of high-dimensional flow cytometric data. PLoS ONE 9(7), e100334 (2014)CrossRefGoogle Scholar
  43. Sverdrup-Thygeson, H.: Strong law of large numbers for measures of central tendency and dispersion of random variables in compact metric spaces. Ann. Stat. 9(1), 141–145 (1981)MathSciNetCrossRefzbMATHGoogle Scholar
  44. Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Berlin (2008)zbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Departamento de Estadística e Investigación Operativa and IMUVAUniversidad de ValladolidValladolidSpain
  2. 2.Departamento de Matemáticas, Estadística y ComputaciónUniversidad de CantabriaSantanderSpain

Personalised recommendations