Categorizing Unknown Words: A Decision Tree-Based Misspelling Identifier

Toole, Janine

doi:10.1007/3-540-46695-9_11

Categorizing Unknown Words: A Decision Tree-Based Misspelling Identifier

Janine Toole²

Conference paper

1225 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1747))

Abstract

This paper introduces a robust, portable system for categorizing unknown words. It is based on a multi- component architecture where each component is responsible for identifying one class of unknown words. The focus of this paper is the component that identifies spelling errors. The misspelling identifier uses a decision tree architecture to combine multiple types of evidence about the unknown word. The misspelling identifier is evaluated using data from live closed captions - a genre replete with a wide variety of unknown words.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agirre, E. Gojenola, K., Sarasola, K., and Voutilainen, A. (1998). Towards a single proposal in spelling correction. In the Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montreal 1998: 22–28.
Google Scholar
Baluja, S., Mittal, V., and Sukthankar, R. (1999). Applying machine learning for high performance named-entity extraction. In the Proceedings of the Conference of the Pacific Association for Computational Linguistics, Waterloo 1999: 365–378.
Google Scholar
Damerau, F. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM 7,3: 171–176.
Article Google Scholar
Elmi, M., and Evens, M. (1998). Spelling correction using context. In the Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montreal: 360–364.
Google Scholar
Elworthy, D. (1998). Language identification with confidence limits. In Charniak (ed.) Proceedings of the 6th Workshop on Very large Corpora. August 15–16, Montreal.
Google Scholar
Granger, R. (1983). The NOMAD system: expectation-based detection and correction of errors during understanding of syntactically and semantically ill-formed text. American Journal of Computational Linguistics, 9: 188–198.
Google Scholar
van Halteren, H., Zavrel, J., and Daelemans, W. (1998). Improving data driven word class tagging by system combination. In the Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montreal 1998: 491–497.
Google Scholar
Huang, X. and Miller, W. (1991). Advanced Applied Mathematics, 12: 337–57.
Article MATH MathSciNet Google Scholar
Hull, J., and Srihari, S. (1982). Experiments in text recognition with binary n-gram and Viterbi algorithms. IEEE Trans. Patt. Anal. Machine Intell. PAMI-4,5: 520–530.
Article Google Scholar
Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys, vol 24 No 4: 377–439.
Article Google Scholar
Mani, I., McMillan, R., Luperfoy, S., Lusher, E., and Laskowski, S. (1996). Identifying unknown proper names in newswire text. In Bran Boguraev and James Pustejovsky (eds.) Corpus Processing for Lexical Acquisition. MIT Press, Cambridge.
Google Scholar
McDonald, David. (1996). Internal and external evidence in the identification and semantic categorization of proper names. In Bran Boguraev and James Pustejovsky (eds.) Corpus Processing for Lexical Acquisition. MIT Press, Cambridge.
Google Scholar
Min, K. (1996). Hierarchical Error Recovery Based on Bidirectional Chart Parsing Techniques. Ph.D. Dissertation, University of NSW, Sydney, Australia.
Google Scholar
Min, K. and Wilson, W. (1998). Integrated control of chart items for error repair. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, August 10–14, Montreal, Canada.
Google Scholar
Mitton, R. (1987). Spelling checkers, spelling correctors, and the misspellings of poor spellers. Inf. Process. Manage., 23,5: 495–505.
Article Google Scholar
Vosse, T. (1992). Detecting and correcting morpho-syntactic errors in real texts. In Proceedings of the 3rd Conference on Applied Natural Language Processing, Trento Italy: 111–118.
Google Scholar
Weiss, S. and Indurkhya, N. (1998). Predictive Data Mining. San Francisco, Morgan Kauffman Publishers.
MATH Google Scholar
Zamora, E., Pollock, J., and Zamora, A. (1981). The use of tri-gram analysis for spelling error detection. Inf. Process. Manage. 17,6: 305–316.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Natural Language Lab, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, V5A 1S6
Janine Toole

Authors

Janine Toole
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, 2052, Australia
Norman Foo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Toole, J. (1999). Categorizing Unknown Words: A Decision Tree-Based Misspelling Identifier. In: Foo, N. (eds) Advanced Topics in Artificial Intelligence. AI 1999. Lecture Notes in Computer Science(), vol 1747. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46695-9_11

Download citation

DOI: https://doi.org/10.1007/3-540-46695-9_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66822-0
Online ISBN: 978-3-540-46695-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics