Abstract
This paper studies the Iterative Double Clustering (IDC) meta-clustering algorithm, a new extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [1]. Using synthetically gener ated data we empirically demonstrate that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a dramatically better classification, with minor additional computational resources.We demonstrate that the IDC algorithm is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a natural extension of IDC for (semi-supervised) transductive learning where we are given both labeled and unlabeled examples, and present preliminary empirical results showing the plausibility of the extended method in a semi-supervised setting.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Noam Slonim and Naftali Tishby. Document clustering using word clusters via the information bottleneck method. In ACM SIGIR 2000, 2000.
A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice-Hall, New Jersey, 1988.
F.C. Pereira N. Tishby and W. Bialek. Information bottleneck method. In 37-th Allerton Conference on Communication and Computation, 1999.
N. Slonim and N. Tishby. Agglomerative information bottleneck. In NIPS99, 1999.
L. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In Proceedings of SIGIR’98, 1998.
N. Slonim and N. Tishby. The power of word clustering for text classification. To appear in the European Colloquium on IR Research, ECIR, 2001.
T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., 1991.
K. Rose. Deterministic annealing for clustering, compression, classification, regression and related optimization problems. Proceedings of the IEEE, 86(11):2210–2238, 1998.
J. Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151, 1991.
R. El-Yaniv, S. Fine, and N. Tishby. Agnostic classification of markovian sequences.In NIPS97, 1997.
I.D. Guedalia, M. London, and M. Werman. A method for on-line clustering of non-stationary data. Neural Computation, 11:521–540, 1999.
20 newsgroup data set. http://www.ai.mit.edu/jrennie/20_newsgroups/.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
El-Yaniv, R., Souroujon, O. (2003). Iterative Double Clustering for Unsupervised and Semi-supervised Learning. In: De Raedt, L., Flach, P. (eds) Machine Learning: ECML 2001. ECML 2001. Lecture Notes in Computer Science(), vol 2167. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44795-4_11
Download citation
DOI: https://doi.org/10.1007/3-540-44795-4_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42536-6
Online ISBN: 978-3-540-44795-5
eBook Packages: Springer Book Archive