Skip to main content

A Study on Optimal Parameter Tuning for Rocchio Text Classifier

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2633))

Abstract

Current trend in operational text categorization is the designing of fast classification tools. Several studies on improving accuracy of fast but less accurate classifiers have been recently carried out. In particular, enhanced versions of the Rocchio text classifier, characterized by high performance, have been proposed. However, even in these extended formulations the problem of tuning its parameters is still neglected. In this paper, a study on parameters of the Rocchio text classifier has been carried out to achieve its maximal accuracy. The result is a model for the automatic selection of parameters. Its main feature is to bind the searching space so that optimal parameters can be selected quickly. The space has been bound by giving a feature selection interpretation of the Rocchio parameters. The benefit of the approach has been assessed via extensive cross evaluation over three corpora in two languages. Comparative analysis shows that the performances achieved are relatively close to the best TC models (e.g. Support Vector Machines).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pivoted document length normalization. Technical Report TR95-1560, Cornell University, Computer Science, 1995.

    Google Scholar 

  2. Avi Arampatzis, Jean Beney, C. H. A. Koster, and T. P. van der Weide. Incrementality, half-life, and threshold optimization for adaptive document filtering. In the Nineth Text REtrieval Conference (TREC-9), Gaithersburg, Maryland, 2000.

    Google Scholar 

  3. Christopher Buckley and Gerald Salton. Optimization of relevance feedback weights. In Proceedings of SIGIR-95, pages 351–357, Seattle, US, 1995.

    Google Scholar 

  4. Wesley T. Chuang, Asok Tiyyagura, Jihoon Yang, and Giovanni Giuffrida. A fast algorithm for hierarchical text classification. In Proceedings of DaWaK-00, 2000.

    Google Scholar 

  5. William W. Cohen and Yoram Singer. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2):141–173, 1999.

    Article  Google Scholar 

  6. Harris Drucker, Vladimir Vapnik, and Dongui Wu. Automatic text categorization and its applications to text retrieval. IEEE Transactions on Neural Networks, 10(5), 1999.

    Google Scholar 

  7. Norbert Gövert, Mounia Lalmas, and Norbert Fuhr. A probabilistic description-oriented approach for categorising Web documents. In Proceedings of CIKM-99.

    Google Scholar 

  8. David J. Ittner, David D. Lewis, and David D. Ahn. Text categorization of low quality images. In Proceedings of SDAIR-95, pages 301–315, Las Vegas, US, 1995.

    Google Scholar 

  9. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In In Proceedings of ECML-98, pages 137–142, 1998.

    Google Scholar 

  10. Thorsten Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In Proceedings of ICML97 Conference. Morgan Kaufmann, 1997.

    Google Scholar 

  11. Ron Kohavi and George H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1–2):273–324, 1997.

    Article  MATH  Google Scholar 

  12. Wai Lam and Chao Y. Ho. Using a generalized instance set for automatic text categorization. In Proceedings of SIGIR-98, 1998.

    Google Scholar 

  13. G: Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.

    Article  Google Scholar 

  14. Robert E. Schapire, Yoram Singer, and Amit Singhal. Boosting and Rocchio applied to text filtering. In W. Bruce Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors, Proceedings of SIGIR-98, pages 215–223, Melbourne, AU, 1998. ACM Press, New York, US.

    Chapter  Google Scholar 

  15. Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.

    Article  Google Scholar 

  16. Amit Singhal, John Choi, Donald Hindle, and Fernando C. N. Pereira. ATT at TREC-6: SDR track. In Text REtrieval Conference, pages 227–232, 1997.

    Google Scholar 

  17. Amit Singhal, Mandar Mitra, and Christopher Buckley. Learning routing queries in a query zone. In Proceedings of SIGIR-97, pages 25–32, Philadelphia, US, 1997.

    Google Scholar 

  18. K. Tzeras and S. Artman. Automatic indexing based on bayesian inference networks. In SIGIR 93, pages 22–34, 1993.

    Google Scholar 

  19. Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval Journal, 1999.

    Google Scholar 

  20. Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, pages 412–420, Nashville, US, 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Moschitti, A. (2003). A Study on Optimal Parameter Tuning for Rocchio Text Classifier. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_30

Download citation

  • DOI: https://doi.org/10.1007/3-540-36618-0_30

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-01274-0

  • Online ISBN: 978-3-540-36618-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics