HICC: an entropy splitting-based framework for hierarchical co-clustering

Cheng, Wei; Zhang, Xiang; Pan, Feng; Wang, Wei

doi:10.1007/s10115-015-0823-x

HICC: an entropy splitting-based framework for hierarchical co-clustering

Regular Paper
Published: 10 February 2015

Volume 46, pages 343–367, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Wei Cheng¹,
Xiang Zhang²,
Feng Pan³ &
…
Wei Wang⁴

441 Accesses
11 Citations
Explore all metrics

Abstract

Two-dimensional contingency tables or co-occurrence matrices arise frequently in various important applications such as text analysis and web-log mining. As a fundamental research topic, co-clustering aims to generate a meaningful partition of the contingency table to reveal hidden relationships between rows and columns. Traditional co-clustering algorithms usually produce a predefined number of flat partition of both rows and columns, which do not reveal relationship among clusters. To address this limitation, hierarchical co-clustering algorithms have attracted a lot of research interests recently. Although successful in various applications, the existing hierarchical co-clustering algorithms are usually based on certain heuristics and do not have solid theoretical background. In this paper, we present a new co-clustering algorithm, HICC, with solid theoretical background. It simultaneously constructs a hierarchical structure of both row and column clusters, which retains sufficient mutual information between rows and columns of the contingency table. An efficient and effective greedy algorithm is developed, which grows a co-cluster hierarchy by successively performing row-wise or column-wise splits that lead to the maximal mutual information gain. Extensive experiments on both synthetic and real datasets demonstrate that our algorithm can reveal essential relationships of row (and column) clusters and has better clustering precision than existing algorithms. Moreover, the experiments on real dataset show that HICC can effectively reveal hidden relationships between rows and columns in the contingency table.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html.

References

Ailon N, Charikar M (2005) Fitting tree metrics: hierarchical clustering and phylogeny. In: FOCS: IEEE symposium on foundations of computer science (FOCS)
Anagnostopoulos A et al (2008) Approximation algorithms for co-clustering. In: Lenzerini M, Lembo D (eds.) Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems: PODS’08, Vancouver, BC, Canada, 9–11 June 2008, pp 201–210. ACM Press
Bandyopadhyay S, Coyle EJ (2003) An energy efficient hierarchical clustering algorithm for wireless sensor networks. In: INFOCOM 2003
Banerjee A et al (2004) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In: SIGKDD’04 conference proceedings
Banerjee A et al (2007) A generalized maximum entropy approach to bregman co-clustering and matrix approximation
Brunet JP et al (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 101(12):4164–4169
Article Google Scholar
Chakrabarti D et al (2004) Fully automatic cross-associations. In: ACM SIGKDD’04 conference proceedings
Choi DS, Wolfe PJ (2012) Co-clustering separately exchangeable network data. arXiv:1212.4093 CoRR
Deodhar M, Ghosh J (2010) SCOAL: a framework for simultaneous co-clustering and learning from complex data. ACM Trans Knowl Discov Data 4(3):11:1–11:31
Dhillon IS et al (2003a) A divisive information-theoretic feature clustering algorithm for text classification. J Mach Learn Res 3:1265–1287
MathSciNet MATH Google Scholar
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: SIGKDD ’01 conference proceedings
Dhillon IS et al (2003b) Information-theoretic co-clustering. In: SIGKDD ’03 conference proceedings
El-Yaniv R, Souroujon O (2001) Iterative double clustering for unsupervised and semi-supervised learning. In: ECML, pp 121–132
Goldberger J, Roweis ST (2004) Hierarchical clustering of a mixture model. In: NIPS
Hochreiter S et al (2010) FABIA: factor analysis for bicluster acquisition. Bioinformatics 26(12):1520–1527
Article Google Scholar
Hosseini M, Abolhassani H (2007) Hierarchical co-clustering for web queries and selected urls. In: Web information systems engineering-WISE 2007, pp 653–662
Ienco RPD, Meo R (2009) Parameter-free hierarchical co-clustering by n-ary splits. Mach Learn Knowl Discov Databases 5781:580–595
Karayannidis N, Sellis T (2008) Hierarchical clustering for OLAP: the CUBE file approach. VLDB J Very Large Data Bases 17(4):621–655
Article Google Scholar
Lee D, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791
Article Google Scholar
Li J et al (2011) Hierarchical co-clustering: a new way to organize the music data. IEEE Trans Multimed 14:1–25
MATH Google Scholar
Li J, Li T (2010) HCC: a hierarchical co-clustering algorithm. In: SIGIR’10, pp 861–862
Long B et al (2005a) Co-clustering by block value decomposition. In: SIGKDD’05 conference proceedings
Long B et al (2005b) Co-clustering by block value decomposition. In: KDD, pp 635–640. ACM
Mishra N et al (2003) On finding large conjunctive clusters. In: COLT’03 conference proceedings
Murtach F (1983) A survey of recent advances in hierarchical clustering algorithms. Comput J 26(4): 354–359
Article Google Scholar
Olson CF (1995) Parallel algorithms for hierarchical clustering. PARCOMP Parallel Comput 21: 1313–1325
Article MathSciNet MATH Google Scholar
Pan F et al (2008) CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition. In: SIGMOD conference, pp 173–184. ACM
Pensa RG, Lenco D, Meo R (2014) Hierarchical co-clustering: off-line and incremental approaches. Data Min Knowl Discov 28(1):31–64
Pensa RG, Boulicaut J-F (2008) Constrained co-clustering of gene expression data. In: SDM, pp 25–36. SIAM
Schkolnick M (1977) Clustering algorithm for hierarchical structures. ACM Trans Database Syst 2(1):27
Article MathSciNet Google Scholar
Segal E, Koller D (2002) Probabilistic hierarchical clustering for biological data. In: RECOMB, pp 273–280
Shan H, Banerjee A (2008) Bayesian co-clustering. In: ICDM, pp 530–539. IEEE Computer Society
Shao B et al (2008) Quantify music artist similarity based on style and mood. In: WIDM ’08, pp 119–124
Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: ACM SIGIR
Song Y et al (2013) Constrained text coclustering with supervised and unsupervised constraints. IEEE Trans Knowl Data Eng 25:1227–1239
Article Google Scholar
Székely GJ, Rizzo ML (2005) Hierarchical clustering via joint between-within distances: extending Ward’s minimum variance method. J Classif 22(2):151–183
Article Google Scholar
Wang P et al (2011) Nonparametric Bayesian co-clustering ensembles. In: SDM, pp 331–342. SIAM/Omnipress
Wu M-L et al (2013) Co-clustering with augmented matrix. Appl Intell 39(1):153–164
Article Google Scholar
Xu G, Ma WY (2006) Building implicit links from content for forum search. In: SIGIR ’06, pp 300–307
Zhang L (2012) Locally discriminative coclustering. IEEE Trans Knowl Data Eng 24(6):1025–1035
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by NSF through Grants NSF IIS-1218036 and NSF IIS-1162374.

Author information

Authors and Affiliations

Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Wei Cheng
Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, Ohio
Xiang Zhang
Microsoft, Raymond, MS, USA
Feng Pan
Department of Computer Science, University of California, Los Angeles, CA, 90095-1596, USA
Wei Wang

Authors

Wei Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Feng Pan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Cheng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, W., Zhang, X., Pan, F. et al. HICC: an entropy splitting-based framework for hierarchical co-clustering. Knowl Inf Syst 46, 343–367 (2016). https://doi.org/10.1007/s10115-015-0823-x

Download citation

Received: 06 February 2014
Revised: 17 September 2014
Accepted: 31 January 2015
Published: 10 February 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s10115-015-0823-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

HICC: an entropy splitting-based framework for hierarchical co-clustering

Abstract

Access this article

Similar content being viewed by others

Mutual information, phi-squared and model-based co-clustering for contingency tables

Hierarchical high-order co-clustering algorithm by maximizing modularity

Fast Information-Theoretic Agglomerative Co-clustering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HICC: an entropy splitting-based framework for hierarchical co-clustering

Abstract

Access this article

Similar content being viewed by others

Mutual information, phi-squared and model-based co-clustering for contingency tables

Hierarchical high-order co-clustering algorithm by maximizing modularity

Fast Information-Theoretic Agglomerative Co-clustering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation