Advertisement

Named Entity Recognition for Highly Inflectional Languages: Effects of Various Lemmatization and Stemming Approaches

  • Michal Konkol
  • Miloslav Konopík
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8655)

Abstract

In this paper, we study the effects of various lemmatization and stemming approaches on the named entity recognition (NER) task for Czech, a highly inflectional language. Lemmatizers are seen as a necessary component for Czech NER systems and they were used in all published papers about Czech NER so far. Thus, it has an utmost importance to explore their benefits, limits and differences between simple and complex methods. Our experiments are evaluated on the standard Czech Named Entity Corpus 1.1 as well as the newly created 2.0 version.

Keywords

Named Entity Recognition Lemmatization Stemming 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Mollá, D., Van Zaanen, M., Smith, D.: Named entity recognition for question answering. In: Cavedon, L., Zukerman, I. (eds.) Proceedings of the 2006 Australasian Language Technology Workshop, pp. 51–58 (2006)Google Scholar
  2. 2.
    Babych, B., Hartley, A.: Improving machine translation quality with automatic named entity recognition. In: Proceedings of the 7th International EAMT Workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT. EAMT 2003, pp. 1–8. Association for Computational Linguistics, Stroudsburg (2003)CrossRefGoogle Scholar
  3. 3.
    Kabadjov, M., Steinberger, J., Steinberger, R.: Multilingual statistical news summarization. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing, vol. 2013, pp. 229–252. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  4. 4.
    Ševčíková, M., Žabokrtský, Z., Krůza, O.: Named entities in Czech: annotating data and developing NE tagger. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 188–195. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  5. 5.
    Kravalová, J., Žabokrtský, Z.: Czech named entity corpus and SVM-based recognizer. In: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, NEWS 2009, pp. 194–201. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  6. 6.
    Konkol, M., Konopík, M.: Maximum entropy named entity recognition for Czech language. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 203–210. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  7. 7.
    Král, P.: Features for Named Entity Recognition in Czech Language. In: KEOD, pp. 437–441 (2011)Google Scholar
  8. 8.
    Konkol, M., Konopík, M.: Crf-based czech named entity recognizer and consolidation of czech ner research. In: Habernal, I. (ed.) TSD 2013. LNCS, vol. 8082, pp. 153–160. Springer, Heidelberg (2013)Google Scholar
  9. 9.
    Straková, J., Straka, M., Hajič, J.: A new state-of-the-art czech named entity recognizer. In: Habernal, I. (ed.) TSD 2013. LNCS, vol. 8082, pp. 68–75. Springer, Heidelberg (2013)Google Scholar
  10. 10.
    Hajič, J.: Disambiguation of Rich Inflection (Computational Morphology of Czech). Karolinum, Charles University Press, Prague (2004)Google Scholar
  11. 11.
    Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, EMNLP 2002, vol. 10, pp. 1–8. Association for Computational Linguistics, Stroudsburg (2002)CrossRefGoogle Scholar
  12. 12.
    Kanis, J., Skorkovská, L.: Comparison of different lemmatization approaches through the means of information retrieval performance. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS, vol. 6231, pp. 93–100. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  13. 13.
    Brown, P.F., de Souza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18, 467–479 (1992)Google Scholar
  14. 14.
    Kupiec, J.: Robust part-of-speech tagging using a hidden markov model. Computer Speech & Language 6, 225–242 (1992)CrossRefGoogle Scholar
  15. 15.
    Chen, S.F., Goodman, J.T.: An empirical study of smoothing techniques for language modeling. Technical report, Computer Science Group, Harvard University (1998)Google Scholar
  16. 16.
    Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M., Žabokrtský, Z., Ševčíková Razímová, M.: Prague dependency treebank 2.0, PDT 2.0 (2006)Google Scholar
  17. 17.
    Šmerk, P.: Fast morphological analysis of czech. In: Proceedings of the Raslan Workshop 2009. Masarykova Univerzita, Brno (2009)Google Scholar
  18. 18.
    Spoustová, D.J., Hajič, J., Raab, J., Spousta, M.: Semi-Supervised Training for the Averaged Perceptron POS Tagger. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pp. 763–771. Association for Computational Linguistics, Athens (2009)Google Scholar
  19. 19.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  20. 20.
    Konkol, M.: Brainy: A machine learning library. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2014, Part II. LNCS, vol. 8468, pp. 490–499. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  21. 21.
    Ciaramita, M., Altun, Y.: Named-Entity Recognition in Novel Domains with External Lexical Knowledge. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Michal Konkol
    • 1
  • Miloslav Konopík
    • 1
  1. 1.Department of Computer Science and Engineering, Faculty of Applied SciencesUniversity of West BohemiaPlzeňCzech Republic

Personalised recommendations