Skip to main content

Contextual Document Clustering

  • Conference paper
Advances in Information Retrieval (ECIR 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2997))

Included in the following conference series:

Abstract

In this paper we present a novel algorithm for document clustering. This approach is based on distributional clustering where subject related words, which have a narrow context, are identified to form meta-tags for that subject. These contextual words form the basis for creating thematic clusters of documents. In a similar fashion to other research papers on document clustering, we analyze the quality of this approach with respect to document categorization problems and show it to outperform the information theoretic method of sequential information bottleneck.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)

    Google Scholar 

  2. Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 96–103 (1998)

    Google Scholar 

  3. Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, pp. 146–153 (2001)

    Google Scholar 

  4. Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research 1, 1–48 (2002)

    Google Scholar 

  5. Cutting, D., Pedersen, J., Karger, D., Tukey, J.: Scatter/Gather: Cluster-based Approach to Browsing Large Document Collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)

    Google Scholar 

  6. Dhillon, Y., Manella, S., Kumar, R.: Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 1265–1287 (2003)

    Article  MATH  Google Scholar 

  7. El-Yaniv, R., Souroujon, O.: Iterative double clustering for unsupervised and semisupervised learning. In: Proceedings of ECML 2001, 12th European Conference on Machine Learning, pp. 121–132 (2001)

    Google Scholar 

  8. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd ACM-SIGIR Intemational Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)

    Google Scholar 

  9. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 26423 (1999)

    Article  Google Scholar 

  10. Joachims, T.: A statistical learning model for Support Vector Machines. In: SIGIR 2001, New Orleans, USA (2001)

    Google Scholar 

  11. Karipis, G., Han, E.H.: Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval and categorisation, University of Minnesota, Technical Report TR-00-0016 (2000)

    Google Scholar 

  12. Lang, K.: Learning to Filter netnews. In: Proceedings of 12th International Conference on Machine Learning, pp. 331–339 (1995)

    Google Scholar 

  13. Lin, J.: Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory 37(1), 145–151 (1991)

    Article  MATH  Google Scholar 

  14. Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: Proceedings of SIGIR 2002, 25th ACM International Conference on Research and Development in Information Retrieval, pp. 191–198 (2002)

    Google Scholar 

  15. Pantel, P., Lin, D.: Document clustering with committees. In: The 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR) (2002)

    Google Scholar 

  16. Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 30th Annual Meeting of the Association for Computational Linguistics, Columbus. Ohio, pp. 183–190 (1993)

    Google Scholar 

  17. Sebastiani, F.: Machine learning in automated text categorization. ACM Computer Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  18. Slonim, N., Tishby, N.: Document Clustering using word clusters via the Information Bottleneck method. In: The 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2000)

    Google Scholar 

  19. Slonim, N., Friedman, N., Tishby, N.: Unsupervised document classification using sequential information maximization. In: The 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR) (2002)

    Google Scholar 

  20. Tishby, N., Pereira, F., Bialek, W.: The Information bottleneck method. In: The 37th annual Allerton Conference on Communication, Control, and Computing (1999) (invited paper to)

    Google Scholar 

  21. Van Rijsbergen, C.J.: Information retrieval. Butterworth-Heinemann, Butterworths (1979)

    Google Scholar 

  22. Zamir, O., Etzioni, O.: Web document Clustering. In: A feasibility demonstration in ACM SIGIR 1998, pp. 46–54 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dobrynin, V., Patterson, D., Rooney, N. (2004). Contextual Document Clustering. In: McDonald, S., Tait, J. (eds) Advances in Information Retrieval. ECIR 2004. Lecture Notes in Computer Science, vol 2997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24752-4_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24752-4_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21382-6

  • Online ISBN: 978-3-540-24752-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics