Skip to main content

Categorizing Unknown Words: A Decision Tree-Based Misspelling Identifier

  • Conference paper
  • 1225 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1747))

Abstract

This paper introduces a robust, portable system for categorizing unknown words. It is based on a multi- component architecture where each component is responsible for identifying one class of unknown words. The focus of this paper is the component that identifies spelling errors. The misspelling identifier uses a decision tree architecture to combine multiple types of evidence about the unknown word. The misspelling identifier is evaluated using data from live closed captions - a genre replete with a wide variety of unknown words.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agirre, E. Gojenola, K., Sarasola, K., and Voutilainen, A. (1998). Towards a single proposal in spelling correction. In the Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montreal 1998: 22–28.

    Google Scholar 

  2. Baluja, S., Mittal, V., and Sukthankar, R. (1999). Applying machine learning for high performance named-entity extraction. In the Proceedings of the Conference of the Pacific Association for Computational Linguistics, Waterloo 1999: 365–378.

    Google Scholar 

  3. Damerau, F. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM 7,3: 171–176.

    Article  Google Scholar 

  4. Elmi, M., and Evens, M. (1998). Spelling correction using context. In the Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montreal: 360–364.

    Google Scholar 

  5. Elworthy, D. (1998). Language identification with confidence limits. In Charniak (ed.) Proceedings of the 6th Workshop on Very large Corpora. August 15–16, Montreal.

    Google Scholar 

  6. Granger, R. (1983). The NOMAD system: expectation-based detection and correction of errors during understanding of syntactically and semantically ill-formed text. American Journal of Computational Linguistics, 9: 188–198.

    Google Scholar 

  7. van Halteren, H., Zavrel, J., and Daelemans, W. (1998). Improving data driven word class tagging by system combination. In the Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montreal 1998: 491–497.

    Google Scholar 

  8. Huang, X. and Miller, W. (1991). Advanced Applied Mathematics, 12: 337–57.

    Article  MATH  MathSciNet  Google Scholar 

  9. Hull, J., and Srihari, S. (1982). Experiments in text recognition with binary n-gram and Viterbi algorithms. IEEE Trans. Patt. Anal. Machine Intell. PAMI-4,5: 520–530.

    Article  Google Scholar 

  10. Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys, vol 24 No 4: 377–439.

    Article  Google Scholar 

  11. Mani, I., McMillan, R., Luperfoy, S., Lusher, E., and Laskowski, S. (1996). Identifying unknown proper names in newswire text. In Bran Boguraev and James Pustejovsky (eds.) Corpus Processing for Lexical Acquisition. MIT Press, Cambridge.

    Google Scholar 

  12. McDonald, David. (1996). Internal and external evidence in the identification and semantic categorization of proper names. In Bran Boguraev and James Pustejovsky (eds.) Corpus Processing for Lexical Acquisition. MIT Press, Cambridge.

    Google Scholar 

  13. Min, K. (1996). Hierarchical Error Recovery Based on Bidirectional Chart Parsing Techniques. Ph.D. Dissertation, University of NSW, Sydney, Australia.

    Google Scholar 

  14. Min, K. and Wilson, W. (1998). Integrated control of chart items for error repair. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, August 10–14, Montreal, Canada.

    Google Scholar 

  15. Mitton, R. (1987). Spelling checkers, spelling correctors, and the misspellings of poor spellers. Inf. Process. Manage., 23,5: 495–505.

    Article  Google Scholar 

  16. Vosse, T. (1992). Detecting and correcting morpho-syntactic errors in real texts. In Proceedings of the 3rd Conference on Applied Natural Language Processing, Trento Italy: 111–118.

    Google Scholar 

  17. Weiss, S. and Indurkhya, N. (1998). Predictive Data Mining. San Francisco, Morgan Kauffman Publishers.

    MATH  Google Scholar 

  18. Zamora, E., Pollock, J., and Zamora, A. (1981). The use of tri-gram analysis for spelling error detection. Inf. Process. Manage. 17,6: 305–316.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Toole, J. (1999). Categorizing Unknown Words: A Decision Tree-Based Misspelling Identifier. In: Foo, N. (eds) Advanced Topics in Artificial Intelligence. AI 1999. Lecture Notes in Computer Science(), vol 1747. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46695-9_11

Download citation

  • DOI: https://doi.org/10.1007/3-540-46695-9_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-66822-0

  • Online ISBN: 978-3-540-46695-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics