Advertisement

High Performance Part-of-Speech Tagging of Bulgarian

  • Veselka Doychinova
  • Stoyan Mihov
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3192)

Abstract

This paper presents an accurate and highly efficient rule-based part-of-speech tagger for Bulgarian. All four stages – tokenization, dictionary application, unknown words guessing and contextual part-of-speech disambiguation – are implemented as a pipeline of a couple deterministic finite state bimachines and transducers. We present a description of the Bulgarian ambiguity classes and a detailed evaluation and error analysis of our tagger. The overall precision of the tagger is over 98.4% for full disambiguation and the processing speed is over 34K words/sec on a personal computer. The same methodology has been applied for English as well. The presented realization conforms to the specific demands of the semantic web.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abney, S.P.: Part-of-Speech Tagging and Partial Parsing. In: Church, K., Young, S., Bloothooft, G. (eds.) Corpus-Based Methods in Language and Speech, Kluwer Academic Publishers, Dordrecht (1996)Google Scholar
  2. 2.
    Brill, E.: Some advances in rule-based part of speech tagging. In: Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI 1994), Seattle, Wa (1994)Google Scholar
  3. 3.
    Chanod, J.-P., Tapanainen, P.: Tagging French - comparing a statistical and a constraint-based method. In: Proceedings of Seventh Conference of the European Chapter of the Association for Computational Linguistics (1995)Google Scholar
  4. 4.
    Church, K.: A stochastic parts program and noun phrase parser for unrestricted texts. In: Proceedings of the Second Conference on Applied Natural Language Processing, Austin, Texas (1988)Google Scholar
  5. 5.
    Cutting, D., Kupiec, J., Pedersen, J., Sibun, P.: A practical part-of-speech tagger. In: Proceedings of Third Conference on Applied Natural Language Processing (ANLP 1992), pp. 133–140 (1992)Google Scholar
  6. 6.
    Ganchev, H., Mihov, S., Schulz, K.U.: One-Letter Automata: How to Reduce k Tapes to One. CIS-Bericht, Centrum fur Informations- und Sprachver-arbeitung, Universitat Munchen (2003)Google Scholar
  7. 7.
    Gerdemann, D., van Noord, G.: Transducers from Rewrite Rules with Backreferences. In: Proceedings of EACL 1999, Bergen Norway (1999)Google Scholar
  8. 8.
    Kaplan, R., Kay, M.: Regular Models of Phonological Rule Systems. Computational Linguistics 20(3), 331–378 (1994)Google Scholar
  9. 9.
    Koeva, S.: Grammar Dictionary of the Bulgarian Language Description of the principles of organization of the linguistic data, Bulgarian language magazine, book 6 (1998)Google Scholar
  10. 10.
    Mihov, S., Schulz, K.U.: Efficient Dictionary-Based Text Rewriting using Sequential Transducers, CIS-Bericht, Centrum fur Informations- und Sprachverarbeitung, Universitat Munchen (2004) (to appear)Google Scholar
  11. 11.
    Roche, E., Schabes, Y.: Deterministic Part-of-Speech Tagging with Finite-State Transducers. Computational Linguistics 21(2) (June 1995)Google Scholar
  12. 12.
    Roche, E., Schabes, Y.: Introduction. In: Roche, E., Schabes, Y. (eds.) Finite-State language processing, MIT Press, Cambridge (1997)Google Scholar
  13. 13.
    Simov, K., Osenova, P.: A Hybrid System for MorphoSyntactic Disambiguation in Bulgarian. In: Proceedings of the RANLP 2001 Conference, Tzigov Chark, Bulgaria (September 5-7, 2001)Google Scholar
  14. 14.
    Tanev, H., Mitkov, R.: Shallow Language Processing Architecture for Bulgarian. In: Proceedings of COLING 2002: The 17th International Conference on Computational Linguistics (2002)Google Scholar
  15. 15.
    Voutilainen, A.: A syntax-based part-of-speech analyser. In: Proceedings of Seventh Conference of the European Chapter of the Association for Computational Linguistics (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Veselka Doychinova
    • 1
  • Stoyan Mihov
    • 1
  1. 1.Institute for Parallel ProcessingBulgarian Academy of Sciences 

Personalised recommendations