Advertisement

Morphosyntactic Disambiguation and Segmentation for Historical Polish with Graph-Based Conditional Random Fields

  • Jakub Waszczuk
  • Witold Kieraś
  • Marcin Woliński
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)

Abstract

The paper presents a system for joint morphosyntactic disambiguation and segmentation of Polish based on conditional random fields (CRFs). The system is coupled with Morfeusz, a morphosyntactic analyzer for Polish, which represents both morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG). We rely on constrained linear-chain CRFs generalized to work directly on DAGs, which allows us to perform segmentation as a by-product of morphosyntactic disambiguation. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. We evaluate our system on historical corpora of Polish, where segmentation ambiguities are more prominent than in contemporary Polish, and show that our system significantly outperforms several baseline segmentation methods.

Keywords

Word segmentation Morphosyntactic tagging Historical Polish Conditional random fields 

Notes

Acknowledgements

The work being reported was partially supported by a National Science Centre, Poland grant DEC-2014/15/B/HS2/03119.

References

  1. 1.
    Acedański, S.: A morphosyntactic brill tagger for inflectional languages. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) NLP 2010. LNCS (LNAI), vol. 6233, pp. 3–14. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-14770-8_3CrossRefGoogle Scholar
  2. 2.
    Calzolari, N., et al., (eds.): Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014. ELRA, Reykjavík, Iceland (2014). http://www.lrec-conf.org/proceedings/lrec2014/index.html
  3. 3.
    Chen, X., Qiu, X., Zhu, C., Liu, P., Huang, X.: Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206. ACL (2015). http://www.aclweb.org/anthology/D15-1141
  4. 4.
    Dębowski, L.: Trigram morphosyntactic tagger for Polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining, pp. 409–413. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-39985-8_43CrossRefGoogle Scholar
  5. 5.
    Kieraś, W., Komosińska, D., Modrzejewski, E., Woliński, M.: Morphosyntactic annotation of historical texts. The making of the baroque corpus of Polish. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 308–316. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-64206-2_35CrossRefGoogle Scholar
  6. 6.
    Kieraś, W., Woliński, M.: Manually annotated corpus of Polish texts published between 1830 and 1918. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2018. ELRA, Miyazaki, Japan (2018)Google Scholar
  7. 7.
    Kobyliński, Ł., Ogrodniczuk, M.: Results of the PolEval 2017 competition: part-of-speech tagging shared task. In: Vetulani and Paroubek [17], pp. 362–366Google Scholar
  8. 8.
    Kobyliński, Ł.: PoliTa: A multitagger for Polish. In: Calzolari et al. [2], pp. 2949–2954. http://www.lrec-conf.org/proceedings/lrec2014/index.html
  9. 9.
    Krasnowska-Kieraś, K.: Morphosyntactic disambiguation for Polish with bi-LSTM neural networks. In: Vetulani and Paroubek [17], pp. 367–371Google Scholar
  10. 10.
    Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to Japanese morphological analysis. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (2004). http://www.aclweb.org/anthology/W04-3230
  11. 11.
    Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics (2004). http://www.aclweb.org/anthology/C04-1081
  12. 12.
    Piasecki, M., Wardyński, A.: Multiclassifier approach to tagging of Polish. In: Proceedings of the International Multiconference on ISSN, vol. 1896, p. 7094Google Scholar
  13. 13.
    Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform, pp. 215–230. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-35647-6_16CrossRefGoogle Scholar
  14. 14.
    Radziszewski, A., Acedański, S.: Taggers gonna tag: an argument against evaluating disambiguation capacities of morphosyntactic taggers. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 81–87. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-32790-2_9CrossRefGoogle Scholar
  15. 15.
    Radziszewski, A., Śniatowski, T.: Maca-a configurable tool to integrate Polish morphological data. In: Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation (2011)Google Scholar
  16. 16.
    Sutton, C., McCallum, A.: An introduction to conditional random fields. Found. Trends® Mach. Learn. 4(4), 267–373 (2012)CrossRefGoogle Scholar
  17. 17.
    Vetulani, Z., Paroubek, P. (eds.): Proceedings of the 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu, Poznań, Poland (2017)Google Scholar
  18. 18.
    Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008)zbMATHGoogle Scholar
  19. 19.
    Walentynowicz, W.: MorphoDiTa-based tagger for Polish language (2017), CLARIN-PL digital repository. http://hdl.handle.net/11321/425
  20. 20.
    Waszczuk, J.: Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language. In: Proceedings of COLING 2012, pp. 2789–2804 (2012). http://www.aclweb.org/anthology/C12-1170
  21. 21.
    Woliński, M.: Morfeusz reloaded. In: Calzolari et al. [2], pp. 1106–1111. http://www.lrec-conf.org/proceedings/lrec2014/index.html
  22. 22.
    Wróbel, K.: KRNNT: Polish recurrent neural network tagger. In: Vetulani, Paroubek [17], pp. 386–391Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Jakub Waszczuk
    • 1
  • Witold Kieraś
    • 2
  • Marcin Woliński
    • 2
  1. 1.Heinrich Heine University DüsseldorfDüsseldorfGermany
  2. 2.Institute of Computer Science, Polish Academy of SciencesWarsawPoland

Personalised recommendations