Abstract
A document classification and routing system is described which uses a probabilistic approach to determine the “flavor” of a text. The necessary probabilities are determined from the relevant training documents. Development, refinement, and testing of the system’s ability to route 120,000 documents into 50 topics are discussed as well as the mathematical model on which it is based.
Keywords
- Relevant Document
- Multinomial Distribution
- Document Frequency
- Boolean Expression
- Document Classification
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Allan, J., Ballesteros, L., Callan, J., Croft, W.B., and Lu, Z. (1995) Recent Experiments with INQUERY. Proceedings of the Fourth Text REtrieval Conference (TREC-4), NIST Special Publication 500–236, pp. 49–64.
Buckley, C., Singhal, A., Mitra, M., and Salton, G. (1995) New Retrieval Approaches Using SMART: TREC 4. Proceedings of the Fourth Text REtrieval Conference (TREC4), NIST Special Publication 500–236, pp. 25–48.
Cleverdon, C., and Keen, E. (1966) Factors Determining the Performance of Indexing Systems, Cranfield, England. Aslib Cranfield Research Project.
Fuhr, N. (1989) Models for Retrieval with Proba bilistic Indexing, Information Processing and Management, 25, pp. 55–72.
Guthrie, L., and Leistensnider, J. (1996) A Simple Probabilistic Approach to Classification and Routing. Proceedings of the Tipster Text Program (Phase II), Morgan Kaufmann Publishers, San Francisco, CA, pp. 167–178.
Guthrie, L., Walker, E., and Guthrie, J. (1994) Document Classification by Machine: Theory and Practice. Proceedings of the 16th International Conference on Computational Linguistics (COLING 94), Kyoto, Japan, pp. 1059–1063.
Harman, D. (Ed.) (1995) Proceedings of the Fourth Text REtrieval Conference (TREC-4) NIST Special Publication 500–236.
Harman, D. (Ed.) (1996) Proceedings of the Fifth Text REtrieval Conference (TREC-5),NIST Special Publication 500–238.
Kwok, K., and Grunfeld, L. (1995) TREC-4 Ad-Hoc, Routing Retrieval and Filtering Experiments using PIRCS. Proceedings of the Fourth Text REtrieval Conference (TREC4), NIST Special Publication 500–236, pp. 145–152.
Mr. Showbiz, (1996) Starware Corporation, http://www.mrshowbiz.com.
Robertson, S. (1977) The Probability Ranking Principle in IR.. Journal of Documentation, 33, pp. 294–304.
Salton, G. (1968) Automatic Information Organization and Retrieval. McGraw-Hill, New York.
Sportsline (1996) SportsTicker Enterprises L. P., http://www.sportsticker.com.
Strzalkowski, T., and Carballo, J. (1995) Natural Language Information Retrieval:
TREC-4 Report. Proceedings of the Fourth Text REtrieval Conference (TREC-4)NIST Special Publication 500–236, pp. 245–258.
Turtle, H., and Croft, W.B. (1990) Inference Networks for Document Retrieval. Proceedings of the 13th International Conference on Research and Development in Information Retrieval, ACM, pp. 1–24.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Guthrie, L., Guthrie, J., Leistensnider, J. (1999). Document Classification and Routing. In: Strzalkowski, T. (eds) Natural Language Information Retrieval. Text, Speech and Language Technology, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2388-6_12
Download citation
DOI: https://doi.org/10.1007/978-94-017-2388-6_12
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5209-4
Online ISBN: 978-94-017-2388-6
eBook Packages: Springer Book Archive
