Advertisement

Knowledge and Information Systems

, Volume 12, Issue 1, pp 1–24 | Cite as

Non-redundant data clustering

  • David Gondek
  • Thomas Hofmann
Regular Paper

Abstract

Data clustering is a popular approach for automatically finding classes, concepts, or groups of patterns. In practice, this discovery process should avoid redundancies with existing knowledge about class structures or groupings, and reveal novel, previously unknown aspects of the data. In order to deal with this problem, we present an extension of the information bottleneck framework, called coordinated conditional information bottleneck, which takes negative relevance information into account by maximizing a conditional mutual information score subject to constraints. Algorithmically, one can apply an alternating optimization scheme that can be used in conjunction with different types of numeric and non-numeric attributes. We discuss extensions of the technique to the tasks of semi-supervised classification and enumeration of successive non-redundant clusterings. We present experimental results for applications in text mining and computer vision.

Keywords

Non-redundant clustering Exploratory data mining Information bottleneck 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bucila C, Gehrke J, Kifer D, White W (2002) Dualminer: A dual-pruning algorithm for itemsets with constraints. In: Proceedings of the Eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 241–272Google Scholar
  2. 2.
    Chechik G, Tishby N (2002) Extracting relevant structures with side information. Adv Neural Inf Process Syst 15:857–864Google Scholar
  3. 3.
    Craven M, DiPasquo D, Freitag D, McCallum AK, Mitchell TM, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the 15th conference on artificial intelligence, pp 509–516Google Scholar
  4. 4.
    Friedman N, Mosenzon O, Slonim N, Tishby N (2001) Multivariate information bottleneck. In: Proceedings of the 17th conference on uncertainty in artificial intelligence, pp 152–161Google Scholar
  5. 5.
    Gondek D, Hofmann T (2003) Conditional information bottleneck clustering. In: Proceedings of the 3rd IEEE international conference on data mining, workshop on clustering large data setsGoogle Scholar
  6. 6.
    Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In: Proceedings of the 19th international conference on machine learning, pp 307–314Google Scholar
  7. 7.
    McCullagh P, Nelder J (1989) Generalized linear models. Chapman & Hall, London, UKzbMATHGoogle Scholar
  8. 8.
    Ng RT, Lakshmanan LV, Han J, Pang A (1998) Exploratory mining and pruning optimizations of constrained association rule. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, pp 13–24Google Scholar
  9. 9.
    Nigam K, McCallum AK, Thrun S, Mitchell TM (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134zbMATHCrossRefGoogle Scholar
  10. 10.
    Rose K (1998) Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc IEEE 80:2210–2239CrossRefGoogle Scholar
  11. 11.
    Slonim N, Friedman N, Tishby N (2002) Unsupervised document classification using sequential information maximization. In: Proceedings of the 25th ACM SIGIR international conference on research and development in information retrieval, pp 129–136Google Scholar
  12. 12.
    Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th annual allerton conference on communication, control and computing, pp 368–377Google Scholar
  13. 13.
    Tung A, Ng R, Han J, Lakshmanan L (2001) Constraint-based clustering in large databases. In: Proceedings of the 8th international conference on database theory, pp 405–419Google Scholar
  14. 14.
    Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of 17th international conference on machine learning, pp 1103–1110Google Scholar
  15. 15.
    Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. Adv Neural Inf Process Syst 15:505–512Google Scholar
  16. 16.
    Zhong S, Ghosh J (2003) Model-based clustering with soft balancing. In: Proceedings of the 3rd IEEE international conference on data mining, pp 459–466Google Scholar

Copyright information

© Springer-Verlag London Limited 2006

Authors and Affiliations

  • David Gondek
    • 1
  • Thomas Hofmann
    • 1
  1. 1.Department of Computer ScienceBrown UniversityProvidenceUSA

Personalised recommendations