On context-aware co-clustering with metadata support

Schifanella, Claudio; Sapino, Maria Luisa; Candan, K. Selçuk

doi:10.1007/s10844-011-0151-x

On context-aware co-clustering with metadata support

Published: 08 February 2011

Volume 38, pages 209–239, (2012)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Claudio Schifanella¹,
Maria Luisa Sapino¹ &
K. Selçuk Candan²

285 Accesses
5 Citations
Explore all metrics

Abstract

In traditional co-clustering, the only basis for the clustering task is a given relationship matrix, describing the strengths of the relationships between pairs of elements in the different domains. Relying on this single input matrix, co-clustering discovers relationships holding among groups of elements from the two input domains. In many real life applications, on the other hand, other background knowledge or metadata about one or more of the two input domain dimensions may be available and, if leveraged properly, such metadata might play a significant role in the effectiveness of the co-clustering process. How additional metadata affects co-clustering, however, depends on how the process is modified to be context-aware. In this paper, we propose, compare, and evaluate three alternative strategies (metadata-driven, metadata-constrained, and metadata-injected co-clustering) for embedding available contextual knowledge into the co-clustering process. Experimental results show that it is possible to leverage the available metadata in discovering contextually-relevant co-clusters, without significant overheads in terms of information theoretical co-cluster quality or execution cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Co-occurrence Based Approach for Mining Overlapped Co-clusters in Binary Data

A Sequential Three-Way Approach to Constructing a Co-association Matrix in Consensus Clustering

Co-clustering of multi-view datasets

Article 17 July 2015

Notes

http://archive.ics.uci.edu/ml/datasets/Bag+of+Words.
http://www.dmoz.org
The matrix is re-normalized after the application of the combination function to ensure that information-theoretic co-clustering, which treats the values in the matrix as probability distributions, can be applied. Due to this renormalization, the combination function sum() is equivalent to the average() (the two functions would differ for a scaling factor 2, which is absorbed by re-normalization).

References

Alp Aslandogan, Y., Thier, C., Yu, C. T., Liu, C., & Nair, K. R. (1995). Design, implementation and evaluation of score (a system for content based retrieval of pictures). In ICDE ’95: Proceedings of the eleventh international conference on data engineering (pp. 280–287). Washington: IEEE.
Chapter Google Scholar
Baier, D., Gaul, W., & Schader, M. (1997). Two-mode overlapping clustering with applications to simultaneous benefit segmentation and market structuring. In R. Klar, & O. Opitz (Eds.), Classification and knowledge organization: Recent advances and applications (pp. 557–566). Springer.
Banerjee, A., Dhillon, I., Ghosh, J., Merugu, S., & Modha, D. S. (2007). A generalized maximum entropy approach to bregman co-clustering and matrix approximation. Journal of Machine Learning Research, 8, 1919–1986.
MATH MathSciNet Google Scholar
Basu, S., Banerjee, A., & Mooney, R. J. (2002). Semi-supervised clustering by seeding. In ICML’02: Proceedings of the 9th international conference on machine learning (pp. 27–34). San Francisco: Morgan Kaufmann.
Google Scholar
Basu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilistic framework for semi-supervised clustering. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 59–68). New York: ACM.
Chapter Google Scholar
Bilenko, M., Basu, S., & Mooney, R. J. (2004). Integrating constraints and metric learning in semi-supervised clustering. In ICML (pp. 81–88).
Bilenko, M., & Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 39–48). New York: ACM.
Chapter Google Scholar
Bishop, C. M. (2006). Pattern recognition and machine learning (Information science and statistics). New York: Springer.
Google Scholar
Candan, K. S., Cataldi, M., Sapino, M. L., & Schifanella, C. (2008). Structure- and extension-informed taxonomy alignment. In Proceedings of the 4th international VLDB workshop on ontology-based techniques for databases in information systems and knowledge systems, ODBIS 2008, Auckland, New Zealand, 23 August 2008, co-located with the 34th international conference on very large data bases (pp. 1–8).
Candan, K. S., & Li, W.-S. (2001). On similarity measures for multimedia database applications. Knowledge and Information Systems, 3(1), 30–51.
Article MATH Google Scholar
Cataldi, M., Schifanella, C., Candan, K. S., Sapino, M. L., & Di Caro, L. (2009). Cosena: A context-based search and navigation system. In The first international acm conference on management of emergent digital ecosystems (MEDES). Lyon: ACM.
Google Scholar
Chen, Y., Dong, M., & Wan, W. (2009). Image co-clustering with multi-modality features and user feedbacks. In Proceedings of the seventeen ACM international conference on multimedia, MM ’09 (pp. 689–692). New York: ACM. ISBN 978-1-60558-608-3. doi:10.1145/1631272.1631389. URL http://doi.acm.org/10.1145/1631272.1631389.
Chapter Google Scholar
Chen, Y., Wang, L., & Dong, M. (2008). A matrix-based approach for semi-supervised document co-clustering. In Proceeding of the 17th ACM conference on information and knowledge management, CIKM ’08 (pp. 1523–1524). New York: ACM. ISBN 978-1-59593-991-3.
Chapter Google Scholar
Chen, Y., Wang, L., & Dong, M. (2010). Non-negative matrix factorization for semisupervised heterogeneous data coclustering. IEEE Transactions on Knowledge and Data Engineering, 22, 1459–1474. ISSN 1041-4347. doi:10.1109/TKDE.2009.169.
Article Google Scholar
Cheng, Y., & Church, G. M. (2000). Biclustering of expression data. In Proceedings of the eighth international conference on intelligent systems for molecular biology (pp. 93–103). AAAI Press.
Cho, H., Dhillon, I. S., Guan, Y., & Sra, S. (2004). Minimum sum-squared residue co-clustering of gene expression data. In M. W. Berry, U. Dayal, C. Kamath, & D. B. Skillicorn (Eds.), SDM. SIAM.
Demiriz, A., Bennett, K. P., & Embrechts, M. J. (1999). Semi-supervised clustering using genetic algorithms. In Artificial neural networks in engineering (ANNIE-99) (pp. 809–814). ASME Press.
Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 269–274). New York: ACM.
Chapter Google Scholar
Dhillon, I. S., Subramanyam, M., & Modha Dharmendra, S. (2003). Information-theoretic co-clustering. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 89–98). New York: ACM.
Chapter Google Scholar
Freitag, D. (2004). Trained named entity recognition using distributional clusters. In Proceedings of the conference on empirical methods in natural language processing, EMNLP (pp. 262–269). Barcelona, Spain.
Gao, B., Liu, T.-Y., & Ma, W.-Y. (2006). Star-structured high-order heterogeneous data co-clustering based on consistent information theory. In Proceedings of the 6th IEEE international conference on data mining (ICDM 2006), 18–22 December 2006, Hong Kong, China (pp. 880–884). IEEE Computer Society.
Gao, B., Liu, T.-Y., Zheng, X., Cheng, Q.-S., & Ma, W.-Y. (2005). Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In KDD ’05: Proceedings of the 11th ACM SIGKDD int. conference on knowledge discovery in data mining (pp. 41–50). New York: ACM.
Chapter Google Scholar
Gaul, W., & Schader, M. (1996). A new algorithm for two-mode clustering. In H. Hermann, & W. Polasek (Eds.), Data analysis and information systems (pp. 15–23). Springer.
George, T., & Merugu, S. (2005). A scalable collaborative filtering framework based on co-clustering. In ICDM ’05: Proceedings of the fifth IEEE international conference on data mining (pp. 625–628). Washington: IEEE.
Chapter Google Scholar
Hanisch, D., Zien, A., Zimmer, R., & Lengauer, T. (2002). Co-clustering of biological networks and gene expression data. In ISMB (Supplement of bioinformatics) (pp. 145–154).
Hartigan, J. A. (1972). Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337), 123–129.
Article Google Scholar
Hofmann, T., & Puzicha, J. (1999). Latent class models for collaborative filtering. In Proceedings of the sixteenth international joint conference on artificial intelligence, IJCAI ’99 (pp. 688–693). San Francisco: Morgan Kaufmann. ISBN 1-55860-613-0.
Google Scholar
Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., & Ueda, N. (2006). Learning systems of concepts with an infinite relational model. In Proceedings of the 21st national conference on artificial intelligence (Vol. 1, pp. 381–388). AAAI Press. ISBN 978-1-57735-281-5.
Kim, J. W., & Candan, K. S. (2006). Cp/cv: Concept similarity mining without frequency information from domain describing taxonomies. In CIKM ’06 (pp. 483–492).
Klein, D., Kamvar, S. D., & Manning, C. D. (2002). From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In ICML ’02: Proceedings of the nineteenth international conference on machine learning (pp. 307–314). San Francisco: Morgan Kaufmann.
Google Scholar
Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500.
Article MATH MathSciNet Google Scholar
Lee, D. D., & Seung, H. S. (2000). Algorithms for non-negative matrix factorization. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems 13, papers from neural information processing systems (NIPS) 2000, Denver, CO, USA (pp. 556–562). MIT Press.
Li, H., & Abe, N. (1998). Word clustering and disambiguation based on co-occurrence data. In Proceedings of the 17th international conference on computational linguistics (pp. 749–755). Morristown: Association for Computational Linguistics.
Chapter Google Scholar
Long, B., Zhang, Z. M., Wú, X., & Yu, P. S. (2006). Spectral clustering for multi-type relational data. In Proceedings of the 23rd international conference on machine learning, ICML ’06 (pp. 585–592). New York: ACM. ISBN 1-59593-383-2. doi:10.1145/1143844.1143918. URL http://doi.acm.org/10.1145/1143844.1143918.
Google Scholar
Long, B., Zhang, Z. M., & Yu, P. S. (2007). A probabilistic framework for relational clustering. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’07 (pp. 470–479). New York: ACM. ISBN 978-1-59593-609-7. doi:10.1145/1281192.1281244. URL http://doi.acm.org/10.1145/1281192.1281244.
Chapter Google Scholar
Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416.
Article MathSciNet Google Scholar
Ma, H., Zhao, W., Tan, Q., & Shi, Z. (2010). Orthogonal nonnegative matrix tri-factorization for semi-supervised document co-clustering. In M. Zaki, J. Yu, B. Ravindran, & V. Pudi (Eds.), Advances in knowledge discovery and data mining. Lecture Notes in Computer Science (Vol. 6119, pp. 189–200). Berlin: Springer.
Chapter Google Scholar
Madeira, S. C., & Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1), 24–45.
Article Google Scholar
Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems 14 (pp. 849–856). MIT Press.
Pensa, R. G., & Boulicaut, J.-F. (2008). Constrained co-clustering of gene expression data. In Proceedings of the SIAM international conference on data mining, SDM 2008, 24–26 April 2008, Atlanta, Georgia, USA (pp. 25–36). SIAM.
Ruiz, C., Spiliopoulou, M., & Ruiz, E. M. (2007). C-dbscan: Density-based clustering with constraints. In A. An, J. Stefanowski, S. Ramanna, C. J. Butz, W. Pedrycz, & G. Wang (Eds.), RSFDGrC. LNCS (Vol. 4482, pp. 216–223). Springer.
Shan, H., & Banerjee, A. (2008). Bayesian co-clustering. In Proceedings of the 2008 eighth IEEE international conference on data mining (pp. 530–539). Washington: IEEE Computer Society. ISBN 978-0-7695-3502-9.
Chapter Google Scholar
Song, Y., Pan, S., Liu, S., Wei, F., Zhou, M. X., & Qian, W. (2010). Constrained coclustering for textual documents. In M. Fox, & D. Poole (Eds.), AAAI. AAAI Press.
Struyf, J., & Dzeroski, S. (2007). Clustering trees with instance level constraints. In J. N. Kok, J. Koronacki, R. López de Mántaras, S. Matwin, D. Mladenic, & A. Skowron (Eds.), ECML. LNCS (Vol. 4701, pp. 359–370). Springer.
Tucker, L. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3), 279–311.
Article MathSciNet Google Scholar
Valtchev, P., & Euzenat, J. (1997). Dissimilarity measure for collections of objects and values. In X. Liu, P. R. Cohen, & M. R. Berthold (Eds.), IDA. LNCS (Vol. 1280, pp. 259–272). Springer.
Vichi, M. (2001). Double k-means clustering for simultaneous classification of objects and variables. In Advances in classification and data analysis (pp. 43–52). Springer.
Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001). Constrained k-means clustering with background knowledge. In C. E. Brodley, & A. P. Danyluk (Eds.), ICML (pp. 577–584). Morgan Kaufmann.
Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2002). Distance metric learning, with application to clustering with side-information. Advances in neural information processing systems (pp. 505–512). MIT Press
Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference (pp. 267–273). New York: ACM.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Torino, Corso Svizzera 185, 10149, Torino, Italy
Claudio Schifanella & Maria Luisa Sapino
Arizona State University, 699 S. Mill Avenue 553, Tempe, AZ, 85281, USA
K. Selçuk Candan

Authors

Claudio Schifanella
View author publications
You can also search for this author in PubMed Google Scholar
Maria Luisa Sapino
View author publications
You can also search for this author in PubMed Google Scholar
K. Selçuk Candan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claudio Schifanella.

Additional information

This work is partially supported by NSF Grant NSF-III1016921. “One Size Does Not Fit All: Empowering the User with User-Driven Integration.”

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schifanella, C., Sapino, M.L. & Candan, K.S. On context-aware co-clustering with metadata support. J Intell Inf Syst 38, 209–239 (2012). https://doi.org/10.1007/s10844-011-0151-x

Download citation

Received: 26 August 2010
Revised: 20 January 2011
Accepted: 25 January 2011
Published: 08 February 2011
Issue Date: February 2012
DOI: https://doi.org/10.1007/s10844-011-0151-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On context-aware co-clustering with metadata support

Abstract

Access this article

Similar content being viewed by others

A Co-occurrence Based Approach for Mining Overlapped Co-clusters in Binary Data

A Sequential Three-Way Approach to Constructing a Co-association Matrix in Consensus Clustering

Co-clustering of multi-view datasets

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On context-aware co-clustering with metadata support

Abstract

Access this article

Similar content being viewed by others

A Co-occurrence Based Approach for Mining Overlapped Co-clusters in Binary Data

A Sequential Three-Way Approach to Constructing a Co-association Matrix in Consensus Clustering

Co-clustering of multi-view datasets

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation