Skip to main content

Document Classification and Routing

A Probabilistic Approach

  • Chapter

Part of the Text, Speech and Language Technology book series (TLTB,volume 7)

Abstract

A document classification and routing system is described which uses a probabilistic approach to determine the “flavor” of a text. The necessary probabilities are determined from the relevant training documents. Development, refinement, and testing of the system’s ability to route 120,000 documents into 50 topics are discussed as well as the mathematical model on which it is based.

Keywords

  • Relevant Document
  • Multinomial Distribution
  • Document Frequency
  • Boolean Expression
  • Document Classification

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
EUR   29.95
Price includes VAT (Finland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR   85.59
Price includes VAT (Finland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR   109.99
Price includes VAT (Finland)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
EUR   109.99
Price includes VAT (Finland)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Allan, J., Ballesteros, L., Callan, J., Croft, W.B., and Lu, Z. (1995) Recent Experiments with INQUERY. Proceedings of the Fourth Text REtrieval Conference (TREC-4), NIST Special Publication 500–236, pp. 49–64.

    Google Scholar 

  • Buckley, C., Singhal, A., Mitra, M., and Salton, G. (1995) New Retrieval Approaches Using SMART: TREC 4. Proceedings of the Fourth Text REtrieval Conference (TREC4), NIST Special Publication 500–236, pp. 25–48.

    Google Scholar 

  • Cleverdon, C., and Keen, E. (1966) Factors Determining the Performance of Indexing Systems, Cranfield, England. Aslib Cranfield Research Project.

    Google Scholar 

  • Fuhr, N. (1989) Models for Retrieval with Proba bilistic Indexing, Information Processing and Management, 25, pp. 55–72.

    CrossRef  Google Scholar 

  • Guthrie, L., and Leistensnider, J. (1996) A Simple Probabilistic Approach to Classification and Routing. Proceedings of the Tipster Text Program (Phase II), Morgan Kaufmann Publishers, San Francisco, CA, pp. 167–178.

    Google Scholar 

  • Guthrie, L., Walker, E., and Guthrie, J. (1994) Document Classification by Machine: Theory and Practice. Proceedings of the 16th International Conference on Computational Linguistics (COLING 94), Kyoto, Japan, pp. 1059–1063.

    Google Scholar 

  • Harman, D. (Ed.) (1995) Proceedings of the Fourth Text REtrieval Conference (TREC-4) NIST Special Publication 500–236.

    Google Scholar 

  • Harman, D. (Ed.) (1996) Proceedings of the Fifth Text REtrieval Conference (TREC-5),NIST Special Publication 500–238.

    Google Scholar 

  • Kwok, K., and Grunfeld, L. (1995) TREC-4 Ad-Hoc, Routing Retrieval and Filtering Experiments using PIRCS. Proceedings of the Fourth Text REtrieval Conference (TREC4), NIST Special Publication 500–236, pp. 145–152.

    Google Scholar 

  • Mr. Showbiz, (1996) Starware Corporation, http://www.mrshowbiz.com.

    Google Scholar 

  • Robertson, S. (1977) The Probability Ranking Principle in IR.. Journal of Documentation, 33, pp. 294–304.

    CrossRef  Google Scholar 

  • Salton, G. (1968) Automatic Information Organization and Retrieval. McGraw-Hill, New York.

    Google Scholar 

  • Sportsline (1996) SportsTicker Enterprises L. P., http://www.sportsticker.com.

    Google Scholar 

  • Strzalkowski, T., and Carballo, J. (1995) Natural Language Information Retrieval:

    Google Scholar 

  • TREC-4 Report. Proceedings of the Fourth Text REtrieval Conference (TREC-4)NIST Special Publication 500–236, pp. 245–258.

    Google Scholar 

  • Turtle, H., and Croft, W.B. (1990) Inference Networks for Document Retrieval. Proceedings of the 13th International Conference on Research and Development in Information Retrieval, ACM, pp. 1–24.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 1999 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Guthrie, L., Guthrie, J., Leistensnider, J. (1999). Document Classification and Routing. In: Strzalkowski, T. (eds) Natural Language Information Retrieval. Text, Speech and Language Technology, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2388-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-94-017-2388-6_12

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-5209-4

  • Online ISBN: 978-94-017-2388-6

  • eBook Packages: Springer Book Archive