Contextual Document Clustering

Dobrynin, Vladimir; Patterson, David; Rooney, Niall

doi:10.1007/978-3-540-24752-4_13

Vladimir Dobrynin⁶,
David Patterson⁷ &
Niall Rooney⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2997))

Included in the following conference series:

European Conference on Information Retrieval

819 Accesses
7 Citations

Abstract

In this paper we present a novel algorithm for document clustering. This approach is based on distributional clustering where subject related words, which have a narrow context, are identified to form meta-tags for that subject. These contextual words form the basis for creating thematic clusters of documents. In a similar fashion to other research papers on document clustering, we analyze the quality of this approach with respect to document categorization problems and show it to outperform the information theoretic method of sequential information bottleneck.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)
Google Scholar
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 96–103 (1998)
Google Scholar
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, pp. 146–153 (2001)
Google Scholar
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research 1, 1–48 (2002)
Google Scholar
Cutting, D., Pedersen, J., Karger, D., Tukey, J.: Scatter/Gather: Cluster-based Approach to Browsing Large Document Collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
Google Scholar
Dhillon, Y., Manella, S., Kumar, R.: Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 1265–1287 (2003)
Article MATH Google Scholar
El-Yaniv, R., Souroujon, O.: Iterative double clustering for unsupervised and semisupervised learning. In: Proceedings of ECML 2001, 12th European Conference on Machine Learning, pp. 121–132 (2001)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd ACM-SIGIR Intemational Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 26423 (1999)
Article Google Scholar
Joachims, T.: A statistical learning model for Support Vector Machines. In: SIGIR 2001, New Orleans, USA (2001)
Google Scholar
Karipis, G., Han, E.H.: Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval and categorisation, University of Minnesota, Technical Report TR-00-0016 (2000)
Google Scholar
Lang, K.: Learning to Filter netnews. In: Proceedings of 12th International Conference on Machine Learning, pp. 331–339 (1995)
Google Scholar
Lin, J.: Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory 37(1), 145–151 (1991)
Article MATH Google Scholar
Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: Proceedings of SIGIR 2002, 25th ACM International Conference on Research and Development in Information Retrieval, pp. 191–198 (2002)
Google Scholar
Pantel, P., Lin, D.: Document clustering with committees. In: The 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR) (2002)
Google Scholar
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 30th Annual Meeting of the Association for Computational Linguistics, Columbus. Ohio, pp. 183–190 (1993)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computer Surveys 34(1), 1–47 (2002)
Article Google Scholar
Slonim, N., Tishby, N.: Document Clustering using word clusters via the Information Bottleneck method. In: The 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2000)
Google Scholar
Slonim, N., Friedman, N., Tishby, N.: Unsupervised document classification using sequential information maximization. In: The 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR) (2002)
Google Scholar
Tishby, N., Pereira, F., Bialek, W.: The Information bottleneck method. In: The 37th annual Allerton Conference on Communication, Control, and Computing (1999) (invited paper to)
Google Scholar
Van Rijsbergen, C.J.: Information retrieval. Butterworth-Heinemann, Butterworths (1979)
Google Scholar
Zamir, O., Etzioni, O.: Web document Clustering. In: A feasibility demonstration in ACM SIGIR 1998, pp. 46–54 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Applied Mathematics and Control Processes, St Petersburg State University, 198904, Petrodvoretz, St. Petersburg, Russia
Vladimir Dobrynin
Nikel, Faculty of Informatics, University of Ulster,
David Patterson & Niall Rooney

Authors

Vladimir Dobrynin
View author publications
You can also search for this author in PubMed Google Scholar
David Patterson
View author publications
You can also search for this author in PubMed Google Scholar
Niall Rooney
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing and Technology, David Goldman Informatics Centre, University of Sunderland, St. Peter’s Campus, SR6 0DD, Sunderland, UK
Sharon McDonald
School of Computing and Technology, University of Sunderland, St. Peter’s Campus, St. Peter’s Way, SR6 0DD, Sunderland, United Kingdom
John Tait

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dobrynin, V., Patterson, D., Rooney, N. (2004). Contextual Document Clustering. In: McDonald, S., Tait, J. (eds) Advances in Information Retrieval. ECIR 2004. Lecture Notes in Computer Science, vol 2997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24752-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-540-24752-4_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21382-6
Online ISBN: 978-3-540-24752-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics