Advertisement

Data Mining and Knowledge Discovery

, Volume 26, Issue 2, pp 217–254 | Cite as

Parameter-less co-clustering for star-structured heterogeneous data

  • Dino IencoEmail author
  • Céline Robardet
  • Ruggero G. Pensa
  • Rosa Meo
Article

Abstract

The availability of data represented with multiple features coming from heterogeneous domains is getting more and more common in real world applications. Such data represent objects of a certain type, connected to other types of data, the features, so that the overall data schema forms a star structure of inter-relationships. Co-clustering these data involves the specification of many parameters, such as the number of clusters for the object dimension and for all the features domains. In this paper we present a novel co-clustering algorithm for heterogeneous star-structured data that is parameter-less. This means that it does not require either the number of row clusters or the number of column clusters for the given feature spaces. Our approach optimizes the Goodman–Kruskal’s τ, a measure for cross-association in contingency tables that evaluates the strength of the relationship between two categorical variables. We extend τ to evaluate co-clustering solutions and in particular we apply it in a higher dimensional setting. We propose the algorithm CoStar which optimizes τ by a local search approach. We assess the performance of CoStar on publicly available datasets from the textual and image domains using objective external criteria. The results show that our approach outperforms state-of-the-art methods for the co-clustering of heterogeneous data, while it remains computationally efficient.

Keywords

Co-clustering Star-structured data Multi-view data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anagnostopoulos A, Dasgupta A, Kumar R (2008) Approximation algorithms for co-clustering. In: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’08. ACM, New York, pp 201–210Google Scholar
  2. Banerjee A, Dhillon I, Ghosh J, Merugu S, Modha DS (2007) A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. J Mach Learn Res 8: 1919–1986MathSciNetzbMATHGoogle Scholar
  3. Bekkerman R, Jeon J (2007) Multi-modal clustering for multimedia collections. In: Computer vision and pattern recognition, IEEE Computer Society conference on vision and pattern recognition, pp 1–8Google Scholar
  4. Bickel S, Scheffer T (2004) Multi-view clustering. In: ICDM, pp 19–26Google Scholar
  5. Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: KDD, pp 79–88Google Scholar
  6. Chen Y, Wang L, Dong M (2009) Semi-supervised document clustering with simultaneous text representation and categorization. In: ECML/PKDD (1), pp 211–226Google Scholar
  7. Chen Y, Wang L, Dong M (2010) Non-negative matrix factorization for semisupervised heterogeneous data coclustering. IEEE Trans Knowl Data Eng 22(10): 1459–1474CrossRefGoogle Scholar
  8. Cho H, Dhillon IS, Guan Y, Sra S (2004) Minimum sum-squared residue co-clustering of gene expression data. In: Proceedings of SIAM SDM 2004Google Scholar
  9. Cleuziou G, Exbrayat M, Martin L, Sublemontier JH (2009) CoFKM: a centralized method for multiple-view clustering. In: ICDM, pp 752–757Google Scholar
  10. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30MathSciNetzbMATHGoogle Scholar
  11. Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ACM SIGKDD 2003. ACM Press, Washington, pp 89–98Google Scholar
  12. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3: 1289–1305zbMATHGoogle Scholar
  13. Gao B, Liu TY, Ma WY (2006) Star-structured high-order heterogeneous data co-clustering based on consistent information theory. In: ICDM ’06: proceedings of the sixth international conference on data mining, pp 880–884Google Scholar
  14. Goodman LA, Kruskal WH (1954) Measures of association for cross classification. J Am Stat Assoc 49: 732–764zbMATHGoogle Scholar
  15. Greco G, Guzzo A, Pontieri L (2009) Co-clustering multiple heterogeneous domains: linear combinations and agreements. IEEE Trans Knowl Data Eng 22:1649–1663Google Scholar
  16. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1): 193–218CrossRefGoogle Scholar
  17. Ienco D, Pensa RG, Meo R (2009) Parameter-free hierarchical co-clustering by n-ary splits. In: Proceedings of ECML/PKDD 2009. Lecture notes in computer science, vol 5781. Springer, Berlin, pp 580–595Google Scholar
  18. Jaszkiewicz A (2002) Genetic local search for multi-objective combinatorial optimization. Eur J Oper Res 137(1): 50–71MathSciNetzbMATHCrossRefGoogle Scholar
  19. Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’04. ACM, New York, pp 206–215Google Scholar
  20. Knowles J, Thiele L, Zitzler E (2006) A tutorial on the performance assessment of stochastic multiobjective optimizers. TIK Report 214, Computer Engineering and Networks Laboratory (TIK), ETH ZurichGoogle Scholar
  21. Lee DD, Seung DD (2001) Algorithms for non-negative matrix factorization. In: Leen TK, Dietterich TG, Tresp V (eds) Advances in neural information processing systems, vol 13. MIT Press, Cambridge, pp 556–562Google Scholar
  22. Liefooghe A, Humeau J, Mesmoudi S, Jourdan L, Talbi EG (2011) On dominance-based multiobjective local search: design, implementation and experimental analysis on scheduling and traveling salesman problems. J Heuristics 1–36Google Scholar
  23. Long B, Zhang ZM, Wú X, Yu PS (2006) Spectral clustering for multi-type relational data. In: ICML ’06: proceedings of the 23rd international conference on machine learning, pp 585–592Google Scholar
  24. Long B, Zhang ZM, Yu PS (2007) A probabilistic framework for relational clustering. In: KDD ’07: proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 470–479Google Scholar
  25. Paquete L (2006) Stochastic local search algorithms for multiobjective combinatorial optimization: methods and analysis, vol 295. AKA Verlag/IOS Press, BerlinGoogle Scholar
  26. Paquete L, Stützle T (2006) A study of stochastic local search algorithms for the biobjective qap with correlated flow matrices. Eur J Oper Res 169(3): 943–959zbMATHCrossRefGoogle Scholar
  27. Pensa RG, Boulicaut JF (2008) Constrained co-clustering of gene expression data. In: Proceedings of SIAM SDM 2008, pp 25–36Google Scholar
  28. Ramage D, Heymann P, Manning CD, Garcia-Molina H (2009) Clustering the tagged web. In: WSDM, pp 54–63Google Scholar
  29. Robardet C, Feschet F (2001a) Comparison of three objective functions for conceptual clustering. In: Proceedings PKDD’01, LNAI, vol 2168. Springer, Heidelberg, pp 399–410Google Scholar
  30. Robardet C, Feschet F (2001b) Efficient local search in conceptual clustering. In: Proceedings DS’01, LNCS, vol 2226. Springer, Heidelberg, pp 323–335Google Scholar
  31. Sheskin DJ (2007) Handbook of parametric and nonparametric statistical procedures, 4 edn. Chapman & Hall/CRC, Boca RatonzbMATHGoogle Scholar
  32. Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617MathSciNetGoogle Scholar
  33. Zar JH (1998) Biostatistical analysis, 4th edn. Prentice Hall, Englewood CliffsGoogle Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  • Dino Ienco
    • 1
    • 3
    Email author
  • Céline Robardet
    • 2
  • Ruggero G. Pensa
    • 1
  • Rosa Meo
    • 1
  1. 1.Department of Computer ScienceUniversity of TorinoTorinoItaly
  2. 2.Université de Lyon, CNRS, INSA-Lyon, LIRIS UMR5205VilleurbanneFrance
  3. 3.IRSTEA Montpellier, UMR TETISMontpellierFrance

Personalised recommendations