Non-redundant data clustering

Gondek, David; Hofmann, Thomas

doi:10.1007/s10115-006-0009-7

Non-redundant data clustering

Regular Paper
Published: 09 May 2006

Volume 12, pages 1–24, (2007)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

David Gondek¹ &
Thomas Hofmann¹

383 Accesses
29 Citations
Explore all metrics

Abstract

Data clustering is a popular approach for automatically finding classes, concepts, or groups of patterns. In practice, this discovery process should avoid redundancies with existing knowledge about class structures or groupings, and reveal novel, previously unknown aspects of the data. In order to deal with this problem, we present an extension of the information bottleneck framework, called coordinated conditional information bottleneck, which takes negative relevance information into account by maximizing a conditional mutual information score subject to constraints. Algorithmically, one can apply an alternating optimization scheme that can be used in conjunction with different types of numeric and non-numeric attributes. We discuss extensions of the technique to the tasks of semi-supervised classification and enumeration of successive non-redundant clusterings. We present experimental results for applications in text mining and computer vision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bucila C, Gehrke J, Kifer D, White W (2002) Dualminer: A dual-pruning algorithm for itemsets with constraints. In: Proceedings of the Eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 241–272
Chechik G, Tishby N (2002) Extracting relevant structures with side information. Adv Neural Inf Process Syst 15:857–864
Google Scholar
Craven M, DiPasquo D, Freitag D, McCallum AK, Mitchell TM, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the 15th conference on artificial intelligence, pp 509–516
Friedman N, Mosenzon O, Slonim N, Tishby N (2001) Multivariate information bottleneck. In: Proceedings of the 17th conference on uncertainty in artificial intelligence, pp 152–161
Gondek D, Hofmann T (2003) Conditional information bottleneck clustering. In: Proceedings of the 3rd IEEE international conference on data mining, workshop on clustering large data sets
Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In: Proceedings of the 19th international conference on machine learning, pp 307–314
McCullagh P, Nelder J (1989) Generalized linear models. Chapman & Hall, London, UK
MATH Google Scholar
Ng RT, Lakshmanan LV, Han J, Pang A (1998) Exploratory mining and pruning optimizations of constrained association rule. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, pp 13–24
Nigam K, McCallum AK, Thrun S, Mitchell TM (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134
Article MATH Google Scholar
Rose K (1998) Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc IEEE 80:2210–2239
Article Google Scholar
Slonim N, Friedman N, Tishby N (2002) Unsupervised document classification using sequential information maximization. In: Proceedings of the 25th ACM SIGIR international conference on research and development in information retrieval, pp 129–136
Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th annual allerton conference on communication, control and computing, pp 368–377
Tung A, Ng R, Han J, Lakshmanan L (2001) Constraint-based clustering in large databases. In: Proceedings of the 8th international conference on database theory, pp 405–419
Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of 17th international conference on machine learning, pp 1103–1110
Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. Adv Neural Inf Process Syst 15:505–512
Google Scholar
Zhong S, Ghosh J (2003) Model-based clustering with soft balancing. In: Proceedings of the 3rd IEEE international conference on data mining, pp 459–466

Download references

Author information

Authors and Affiliations

Department of Computer Science, Brown University, Providence, RI, USA
David Gondek & Thomas Hofmann

Authors

David Gondek
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Hofmann
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gondek, D., Hofmann, T. Non-redundant data clustering. Knowl Inf Syst 12, 1–24 (2007). https://doi.org/10.1007/s10115-006-0009-7

Download citation

Received: 01 June 2004
Revised: 29 November 2004
Accepted: 30 January 2005
Published: 09 May 2006
Issue Date: May 2007
DOI: https://doi.org/10.1007/s10115-006-0009-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Non-redundant data clustering

Abstract

Access this article

Similar content being viewed by others

Combinatorial Optimization Approaches for Data Clustering

Constrained Clustering: Current and New Trends

Data Mining Paradigms

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Non-redundant data clustering

Abstract

Access this article

Similar content being viewed by others

Combinatorial Optimization Approaches for Data Clustering

Constrained Clustering: Current and New Trends

Data Mining Paradigms

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation