Advertisement

Language Resources and Evaluation

, Volume 50, Issue 4, pp 863–878 | Cite as

FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish

  • Miikka SilfverbergEmail author
  • Teemu Ruokolainen
  • Krister Lindén
  • Mikko Kurimo
Project Notes

Abstract

This paper describes FinnPos, an open-source morphological tagging and lemmatization toolkit for Finnish. The morphological tagging model is based on the averaged structured perceptron classifier. Given training data, new taggers are estimated in a computationally efficient manner using a combination of beam search and model cascade. The lemmatization is performed employing a combination of a rule-based morphological analyzer, OMorFi, and a data-driven lemmatization model. The toolkit is readily applicable for tagging and lemmatization of running text with models learned from the recently published Finnish Turku Dependency Treebank and FinnTreeBank. Empirical evaluation on these corpora shows that FinnPos performs favorably compared to reference systems in terms of tagging and lemmatization accuracy. In addition, we demonstrate that our system is highly competitive with regard to computational efficiency of learning new models and assigning analyses to novel sentences.

Keywords

Morphological tagging Data-driven lemmatization Averaged perceptron Finnish Open-source 

References

  1. Bohnet, B., Nivre, J., Boguslavsky, I., Ginter, R. F. F., & Hajič, J. (2013). Joint morphological and syntactic analysis for richly inflected languages. Transactions of the Association for Computational Linguistics, 1, 415–428.Google Scholar
  2. Brants, T. (2000). TnT: A statistical part-of-speech tagger. In Proceedings of the 6th conference on applied natural language processing (ANLP 2000) (pp. 224–231). Washington, USA: Seattle.Google Scholar
  3. Charniak, E., & Johnson, M. (2005). Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd annual meeting on association for computational linguistics (ACL 2005) (pp. 173–180). Ann Arbor: Michigan, USA.Google Scholar
  4. Chrupala, G., Dinu, G., & van Genabith, J. (2008). Learning morphology with Morfette. In Proceedings of the 6th international conference on language resources and evaluation (LREC 2008) (pp. 2362–2367). Morocco: Marrakech.Google Scholar
  5. Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002) (Vol. 10, pp. 1–8). Philadelphia, Pennsylvania, USA.Google Scholar
  6. Freund, Y., & Schapire, R. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296.CrossRefGoogle Scholar
  7. Hakulinen, A., Korhonen, R., Vilkuna, M., & Koivisto, V. (2004). Iso suomen kielioppi. Suomalaisen kirjallisuuden seura, http://scripta.kotus.fi/visk.
  8. Halácsy, P., Kornai, A., & Oravecz, C. (2007). HunPos: An open source trigram tagger. In Proceedings of the 45th annual meeting of the association of computational linguistics (ACL 2007) (pp. 209–212). Prague: Czech Republic.Google Scholar
  9. Haverinen, K., Ginter, F., Laippala, V., Viljanen, T., & Salakoski, T. (2009). Dependency annotation of Wikipedia: First steps towards a Finnish treebank. In The 8th international workshop on treebanks and linguistic theories (TLT 2009) (pp. 95–105). Milan: Italy.Google Scholar
  10. Haverinen, K., Nyblom, J., Viljanen, T., Laippala, V., Kohonen, S., Missilä, A., Ojala, S., Salakoski, T., & Ginter, F. (2014). Building the essential resources for Finnish: The Turku Dependency Treebank. Language Resources and Evaluation, 48(3), 493–531.CrossRefGoogle Scholar
  11. Huang, L., Fayong, S., & Guo, Y. (2012). Structured perceptron with inexact search. In Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (NAACL HLT 2012) (pp. 142–151). Canada: Montreal.Google Scholar
  12. Karlsson, F. (1990). Constraint grammar as a framework for parsing running text. In Proceedings of the 13th conference on computational linguistics (COLING 1990) (pp. 168–173). Finland: Helsinki.Google Scholar
  13. Lindén, K., Axelson, E., Hardwick, S., Pirinen, T., & Silfverberg, M. (2011). HFST—Framework for compiling and applying morphologies. Systems and Frameworks for Computational Morphology (SFCM 2011) (pp. 67–85). Switzerland: Zurich.Google Scholar
  14. Müller, T., Schmid, H., & Schütze, H. (2013). Efficient higher-order CRFs for morphological tagging. In Proceedings of 2013 empirical methods in natural language processing (EMNLP 2013) (pp. 322–332). Washington, USA: Seattle.Google Scholar
  15. Pal, C., Sutton, C., & McCallum, A. (2006). Sparse forward-backward using minimum divergence beams for fast training of conditional random fields. In Internation conference on acoustics, speech and signal processing (ICASP 2006) (Vol. 5, pp. 581–584). Toulouse, France.Google Scholar
  16. Pirinen, T. (2008). Automatic finite state morphological analysis of Finnish language using open source resources (in Finnish). Master’s thesis, University of Helsinki.Google Scholar
  17. Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of the 1996 conference on empirical methods in natural language processing (EMNLP 1996) (Vol.1, pp. 133–142). New Brunswick, New Jersey, USA.Google Scholar
  18. Rush, A. M., & Petrov, S. (2012). Vine pruning for efficient multi-pass dependency parsing. In Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (NAACL HLT 2012) (pp. 498–507). Canada: Montreal.Google Scholar
  19. Silfverberg, M., & Linden, K. (2011). Combining statistical models for POS tagging using finite-state calculus. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011) (pp. 183–190). Latvia: Riga.Google Scholar
  20. Silfverberg, M., Ruokolainen, T., Lindén, K., & Kurimo, M. (2014). Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (ACL 2014) (pp. 259–264). Maryland: Baltimore.Google Scholar
  21. Sutton, C., & McCallum, A. (2011). An introduction to conditional random fields. Machine Learning, 4(4), 267–373.CrossRefGoogle Scholar
  22. Voutilainen, A. (2011). FinnTreeBank: Creating a research resource and service for language researchers with Constraint Grammar. In Proceedings of the NODALIDA 2011 workshop constraint grammar applications (pp. 41–49). Latvia: Riga.Google Scholar
  23. Weiss, D., & Taskar, B. (2010). Structured prediction cascades. In International conference on artificial intelligence and statistics (AISTATS 2010) (pp. 916–923). Italy: Sardinia.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  • Miikka Silfverberg
    • 1
    Email author
  • Teemu Ruokolainen
    • 2
  • Krister Lindén
    • 1
  • Mikko Kurimo
    • 2
  1. 1.University of HelsinkiHelsinkiFinland
  2. 2.Aalto UniversityHelsinkiFinland

Personalised recommendations