Noise-free latent block model for high dimensional data

Abstract

Co-clustering is known to be a very powerful and efficient approach in unsupervised learning because of its ability to partition data based on both the observations and the variables of a given dataset. However, in high-dimensional context co-clustering methods may fail to provide a meaningful result due to the presence of noisy and/or irrelevant features. In this paper, we tackle this issue by proposing a novel co-clustering model which assumes the existence of a noise cluster, that contains all irrelevant features. A variational expectation-maximization-based algorithm is derived for this task, where the automatic variable selection as well as the joint clustering of objects and variables are achieved via a Bayesian framework. Experimental results on synthetic datasets show the efficiency of our model in the context of high-dimensional noisy data. Finally, we highlight the interest of the approach on two real datasets which goal is to study genetic diversity across the world.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    https://archive.ics.uci.edu/ml/datasets.html.

  2. 2.

    The datasets can be found here: https://github.com/laclauc/NFLB and the code will be available upon publication.

  3. 3.

    https://rosenberglab.stanford.edu/data/rosenbergEtAl2002/diversitydata.stru.

  4. 4.

    https://rosenberglab.stanford.edu/nativedata.html.

References

  1. Baudry JP, Celeux G, Marin JM (2008) Selecting models focussing on the modeller purpose. In: COMPSTAT 2008, Springer, pp 337–348

  2. Ben-David S, Haghtalab N (2014) Clustering in the presence of background noise. In: Proceedings of ICML, pp 280–288

  3. Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. PAMI 22(7):719–725

    Article  Google Scholar 

  4. Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78

    MathSciNet  Article  MATH  Google Scholar 

  5. Brault V, Keribin C, Mariadassou M (2017) Consistency and asymptotic normality of latent blocks model estimators. arXiv preprint arXiv:1704.06629

  6. Celeux G, Martin-Magniette ML, Maugis C, Raftery AE (2011) Letter to the editor: “a framework for feature selection in clustering”. J Am Stat Assoc 106:383

    Article  MATH  Google Scholar 

  7. Cuesta-Albertos JA, Gordaliza A, Matràn C (1997) Trimmed \(k\)-means: an attempt to robustify quantizers. Ann Stat 25(2):553–576

    MathSciNet  Article  MATH  Google Scholar 

  8. Dave RN (1991) Characterization and detection of noise in clustering. Pattern Recognit Lett 12(11):657–664

    Article  Google Scholar 

  9. Dave RN (1993) Robust fuzzy clustering algorithms. In: [Proceedings 1993] Second IEEE international conference on fuzzy systems, vol 2, pp 1281–1286

  10. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, AAAI Press, pp 226–231

  11. Frühwirth-Schnatter S (2011) Dealing with label switching under model uncertainty. In: Mengersen KL, Robert CP, Titterington DM (eds) Mixtures: estimation and applications. Chap 10. Wiley, Hoboken, pp 213–239

    Google Scholar 

  12. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324–1345

    MathSciNet  Article  MATH  Google Scholar 

  13. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4(2):89–109

    MathSciNet  Article  MATH  Google Scholar 

  14. Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recognit 36:463–473

    Article  MATH  Google Scholar 

  15. Govaert G, Nadif M (2008) Block clustering with Bernoulli mixture models: comparison of different approaches. Comput Stat Data Anal 52(6):3233–3245

    MathSciNet  Article  MATH  Google Scholar 

  16. Govaert G, Nadif M (2013) Co-clustering. Wiley, Hoboken

    Google Scholar 

  17. Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337):123–129

    Article  Google Scholar 

  18. Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347

    MathSciNet  MATH  Google Scholar 

  19. Keribin C, Brault V, Celeux G, Govaert G (2015) Estimation and selection for the latent block model on categorical data. Stat Comput 25(6):1201–1216

    MathSciNet  Article  MATH  Google Scholar 

  20. Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26:1154–1166

    Article  Google Scholar 

  21. Li M, Zhang L (2008) Multinomial mixture model with feature selection for text clustering. Knowl Based Syst 21(7):704–708

    Article  Google Scholar 

  22. Maugis C, Celeux G, Martin-Magniette ML (2009) Variable selection for clustering with gaussian mixture models. Biometrics 65(3):701–709

    MathSciNet  Article  MATH  Google Scholar 

  23. Mirkin BG (1996) Mathematical classification and clustering. Nonconvex optimization and its applications. Kluwer academic publishers, Dordrecht

    Google Scholar 

  24. Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164

    MATH  Google Scholar 

  25. Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Trans Knowl Data Eng 18(7):902–916

    Article  Google Scholar 

  26. Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101:168–178

    MathSciNet  Article  MATH  Google Scholar 

  27. Robert V, Vasseur Y (2017) Comparing high dimensional partitions, with the co-clustering adjusted rand index. arXiv:1705.06760

  28. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW (2002) Genetic structure of human populations. Science 298(5602):2381–2385

    Article  Google Scholar 

  29. Wang S, Zhu J (2008) Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64(2):440–448

    MathSciNet  Article  MATH  Google Scholar 

  30. Wang S, Lewis CM, Jakobsson M, Ramachandran S, Ray N, Bedoya G, Rojas W, Parra MV, Molina JA, Gallo C, Mazzotti G, Poletti G, Hill K, Hurtado AM, Labuda D, Klitz W, Barrantes R, Bortolini MC, Salzano FM, Petzl-Erler ML, Tsuneto LT, Llop E, Rothhammer F, Excoffier L, Feldman MW, Rosenberg NA, Ruiz-Linares A (2007) Genetic variation and population structure in native Americans. PLoS Genet 3(11):e185

    Article  Google Scholar 

  31. Wang X, Kabán A (2005) Finding uninformative features in binary data. Intell Data Eng Autom Learn IDEAL 2005:40–47

    Google Scholar 

  32. Wyse J, Friel N (2012) Block clustering with collapsed latent block models. Stat Comput 22(2):415–428

    MathSciNet  Article  MATH  Google Scholar 

  33. Wyse J, Friel N, Latouche P (2017) Inferring structure in bipartite networks using the latent blockmodel and exact ICL. Netw Sci 5(1):45–69. https://doi.org/10.1017/nws.2016.25

    Article  Google Scholar 

  34. Zhou H, Pan W, Shen X (2009) Penalized model-based clustering with unconstrained covariance matrices. Electron J Stat 3:1473–1496

    MathSciNet  Article  MATH  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Charlotte Laclau.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Responsible editor: Jesse Davis, Elisa Fromont, Derek Greene, Björn Bringmann.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Laclau, C., Brault, V. Noise-free latent block model for high dimensional data. Data Min Knowl Disc 33, 446–473 (2019). https://doi.org/10.1007/s10618-018-0597-3

Download citation

Keywords

  • Latent block model
  • Feature selection
  • Clustering
  • Biclustering
  • High dimensional data