Abstract
This paper introduces a robust, portable system for categorizing unknown words. It is based on a multi- component architecture where each component is responsible for identifying one class of unknown words. The focus of this paper is the component that identifies spelling errors. The misspelling identifier uses a decision tree architecture to combine multiple types of evidence about the unknown word. The misspelling identifier is evaluated using data from live closed captions - a genre replete with a wide variety of unknown words.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Agirre, E. Gojenola, K., Sarasola, K., and Voutilainen, A. (1998). Towards a single proposal in spelling correction. In the Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montreal 1998: 22–28.
Baluja, S., Mittal, V., and Sukthankar, R. (1999). Applying machine learning for high performance named-entity extraction. In the Proceedings of the Conference of the Pacific Association for Computational Linguistics, Waterloo 1999: 365–378.
Damerau, F. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM 7,3: 171–176.
Elmi, M., and Evens, M. (1998). Spelling correction using context. In the Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montreal: 360–364.
Elworthy, D. (1998). Language identification with confidence limits. In Charniak (ed.) Proceedings of the 6th Workshop on Very large Corpora. August 15–16, Montreal.
Granger, R. (1983). The NOMAD system: expectation-based detection and correction of errors during understanding of syntactically and semantically ill-formed text. American Journal of Computational Linguistics, 9: 188–198.
van Halteren, H., Zavrel, J., and Daelemans, W. (1998). Improving data driven word class tagging by system combination. In the Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montreal 1998: 491–497.
Huang, X. and Miller, W. (1991). Advanced Applied Mathematics, 12: 337–57.
Hull, J., and Srihari, S. (1982). Experiments in text recognition with binary n-gram and Viterbi algorithms. IEEE Trans. Patt. Anal. Machine Intell. PAMI-4,5: 520–530.
Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys, vol 24 No 4: 377–439.
Mani, I., McMillan, R., Luperfoy, S., Lusher, E., and Laskowski, S. (1996). Identifying unknown proper names in newswire text. In Bran Boguraev and James Pustejovsky (eds.) Corpus Processing for Lexical Acquisition. MIT Press, Cambridge.
McDonald, David. (1996). Internal and external evidence in the identification and semantic categorization of proper names. In Bran Boguraev and James Pustejovsky (eds.) Corpus Processing for Lexical Acquisition. MIT Press, Cambridge.
Min, K. (1996). Hierarchical Error Recovery Based on Bidirectional Chart Parsing Techniques. Ph.D. Dissertation, University of NSW, Sydney, Australia.
Min, K. and Wilson, W. (1998). Integrated control of chart items for error repair. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, August 10–14, Montreal, Canada.
Mitton, R. (1987). Spelling checkers, spelling correctors, and the misspellings of poor spellers. Inf. Process. Manage., 23,5: 495–505.
Vosse, T. (1992). Detecting and correcting morpho-syntactic errors in real texts. In Proceedings of the 3rd Conference on Applied Natural Language Processing, Trento Italy: 111–118.
Weiss, S. and Indurkhya, N. (1998). Predictive Data Mining. San Francisco, Morgan Kauffman Publishers.
Zamora, E., Pollock, J., and Zamora, A. (1981). The use of tri-gram analysis for spelling error detection. Inf. Process. Manage. 17,6: 305–316.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Toole, J. (1999). Categorizing Unknown Words: A Decision Tree-Based Misspelling Identifier. In: Foo, N. (eds) Advanced Topics in Artificial Intelligence. AI 1999. Lecture Notes in Computer Science(), vol 1747. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46695-9_11
Download citation
DOI: https://doi.org/10.1007/3-540-46695-9_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66822-0
Online ISBN: 978-3-540-46695-6
eBook Packages: Springer Book Archive