Skip to main content
Log in

Hierarchical co-clustering: off-line and incremental approaches

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Clustering data is challenging especially for two reasons. The dimensionality of the data is often very high which makes the cluster interpretation hard. Moreover, with high-dimensional data the classic metrics fail in identifying the real similarities between objects. The second challenge is the evolving nature of the observed phenomena which makes the datasets accumulating over time. In this paper we show how we propose to solve these problems. To tackle the high-dimensionality problem, we propose to apply a co-clustering approach on the dataset that stores the occurrence of features in the observed objects. Co-clustering computes a partition of objects and a partition of features simultaneously. The novelty of our co-clustering solution is that it arranges the clusters in a hierarchical fashion, and it consists of two hierarchies: one on the objects and one on the features. The two hierarchies are coupled because the clusters at a certain level in one hierarchy are coupled with the clusters at the same level of the other hierarchy and form the co-clusters. Each cluster of one of the two hierarchies thus provides insights on the clusters of the other hierarchy. Another novelty of the proposed solution is that the number of clusters is possibly unlimited. Nevertheless, the produced hierarchies are still compact and therefore more readable because our method allows multiple splits of a cluster at the lower level. As regards the second challenge, the accumulating nature of the data makes the datasets intractably huge over time. In this case, an incremental solution relieves the issue because it partitions the problem. In this paper we introduce an incremental version of our algorithm of hierarchical co-clustering. It starts from an intermediate solution computed on the previous version of the data and it updates the co-clustering results considering only the added block of data. This solution has the merit of speeding up the computation with respect to the original approach that would recompute the result on the overall dataset. In addition, the incremental algorithm guarantees approximately the same answer than the original version, but it saves much computational load. We validate the incremental approach on several high-dimensional datasets and perform an accurate comparison with both the original version of our algorithm and with the state of the art competitors as well. The obtained results open the way to a novel usage of the co-clustering algorithms in which it is advantageous to partition the data into several blocks and process them incrementally thus “incorporating” data gradually into an on-going co-clustering solution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Anagnostopoulos A, Dasgupta A, Kumar R (2008) Approximation algorithms for co-clustering. In: Proceedings of PODS 2008. ACM Press, Vancouver, pp 201–210

  • Banerjee A, Dhillon I, Ghosh J, Merugu S, Modha D (2007) A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. J Mach Learn Res 8: 1919–1986

    MATH  MathSciNet  Google Scholar 

  • Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of the ACM SIGKDD 2004. ACM Press, Seattle, pp 79–88

  • Cherukuri VS, Candan K (2008) Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees. In: Proceeding of LSDS-IR 2008. ACM Press, Napa Valley, pp 3–10

  • Cho H, Dhillon IS, Guan Y, Sra S (2004) Minimum sum-squared residue co-clustering of gene expression data. In: Proceedings of SIAM SDM 2004, Lake Buena Vista, USA

  • Costa G, Manco G, Ortale R (2008) A hierarchical model-based approach to co-clustering high-dimensional data. In: Proceedings of ACM SAC 2008. ACM Press, Fortaleza, pp 886–890

  • Deodhar M, Ghosh J (2010) SCOAL: a framework for simultaneous co-clustering and learning from complex data. Trans Knowl Discov Data 4(3): 11:1–11:31

    Google Scholar 

  • Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of ACM SIGKDD 2003. ACM Press, Washington, pp 89–98

  • Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of VLDB 1998. Morgan Kaufmann, New York, pp 323–333

  • Fisher D (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2): 139–172

    Google Scholar 

  • Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3: 1289–1305

    MATH  Google Scholar 

  • George T, Merugu S (2005) A scalable collaborative filtering framework based on co-clustering. In: Proceedings of ICDM 2005. IEEE Computer Society, Houston, pp 625–628

  • Gong J, Oard D (2009) Selecting hierarchical clustering cut points for web person-name disambiguation. In: Proceedings of ACM SIGIR 2009. ACM Press, Boston, pp 778–779

  • Goodman LA, Kruskal WH (1954) Measures of association for cross classification. J Am Stat Assoc 49: 732–764

    MATH  Google Scholar 

  • Goodman LA, Kruskal WH (1959) Measure of association for cross classification II: further discussion and references. J Am Stat Assoc 54: 123–163

    Article  MATH  Google Scholar 

  • Han J, Kamber M (2000) Data mining: concepts and techniques. The Morgan Kaufmann series in data management systems. Morgan Kaufmann, Burlington

    Google Scholar 

  • Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337): 123–129

    Article  Google Scholar 

  • Heard NA, Holmes CC, Stephens DA, Hand DJ, Dimopoulos G (2005) Bayesian coclustering of anopheles gene expression time series: study of immune defense response to multiple experimental challenges. Proc Natl Acad Sci 102(47): 16939–16944

    Article  Google Scholar 

  • Hosseini M, Abolhassani H (2007) Hierarchical co-clustering for web queries and selected urls. In: Proceedings of WISE 2007, LNCS, vol 4831. Springer, Nancy, pp 653–662

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1): 193–218

    Article  Google Scholar 

  • Ienco D, Pensa R, Meo R (2009) Parameter-free hierarchical co-clustering by n-ary splits. In: Proceedings of ECML PKDD 2009, part I. Lecture notes in computer science, vol 5781. Springer, Bled, pp 580–595

  • Ienco D, Robardet C, Pensa R, Meo R (2012) Parameter-less co-clustering for star-structured heterogeneous data. Data Min Knowl Discov. doi:10.1007/s10618-012-0248-z

  • Jaszkiewicz A (2002) Genetic local search for multi-objective combinatorial optimization. Eur J Oper Res 137(1): 50–71

    Article  MATH  MathSciNet  Google Scholar 

  • Khoshneshin M, Street W (2010) Incremental collaborative filtering via evolutionary co-clustering. In: Proceedings of ACM RecSys 2010. ACM Press, Barcelona, pp 325–328

  • Kim J, Candan K (2006) CP/CV: concept similarity mining without frequency information from domain describing taxonomies. In: Proceedings of ACM CIKM 2006. ACM Press, Arlington, pp 483–492

  • Kluger Y, Basri R, Chang J, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13: 703–716

    Article  Google Scholar 

  • Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1): 1:1– 1:58

    Article  Google Scholar 

  • Paquete L, Stützle T (2006) A study of stochastic local search algorithms for the biobjective qap with correlated flow matrices. Eur J Oper Res 169(3): 943–959

    Article  MATH  Google Scholar 

  • Robardet C (2002) Contribution à à la classification non supervisée: proposition d’une methode de bi-partitionnement. Ph.D. thesis, Université Claude Bernard, Lyon 1

  • Robardet C, Feschet F (2001a) Comparison of three objective functions for conceptual clustering. In: Proceedings of PKDD 2001, LNCS, vol 2168. Springer, Freiburg, pp 399–410

  • Robardet C, Feschet F (2001b) Efficient local search in conceptual clustering. In: Proceedings of DS 2001, LNCS, vol 2226. Springer, Washington, DC, pp 323–335

  • Sahoo N, Callan J, Krishnan R, Duncan G, Padman R (2006) Incremental hierarchical clustering of text documents. In: Proceedings of ACM CIKM 2006. ACM Press, Arlington, pp 357–366

  • Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Proceedings of ACM SIGIR 2000. ACM Press, Athens, pp 208–215

  • Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617

    MathSciNet  Google Scholar 

  • Zhang Y, Li T (2011) Extending consensus clustering to explore multiple clustering views. In: Proceedings of SIAM SDM 2011, pp 920–931

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruggero G. Pensa.

Additional information

Responsible editor: Ian Davidson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pensa, R.G., Ienco, D. & Meo, R. Hierarchical co-clustering: off-line and incremental approaches. Data Min Knowl Disc 28, 31–64 (2014). https://doi.org/10.1007/s10618-012-0292-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-012-0292-8

Keywords

Navigation