Exploiting languages proximity for part-of-speech tagging of three French regional languages

  • Pierre Magistry
  • Anne-Laure LigozatEmail author
  • Sophie Rosset
Original Paper


This paper presents experiments in part-of-speech tagging of low-resource languages. It addresses the case when no labeled data in the targeted language and no parallel corpus are available. We only rely on the proximity of the targeted language to a better-resourced language. We conduct experiments on three French regional languages. We try to exploit this proximity with two main strategies: delexicalization and transposition. The general idea is to learn a model on the (better-resourced) source language, which will then be applied to the (regional) target language. Delexicalization is used to deal with the difference in vocabulary, by creating abstract representations of the data. Transposition consists in modifying the target corpus to be able to use the source models. We compare several methods and propose different strategies to combine them and improve the state-of-the-art of part-of-speech tagging in this difficult scenario.


Part-of-speech tagging Low-resource languages Picard Occitan Alsatian 



This work was supported by the French National Research Agency (ANR) under projet RESTAURE (ANR-14-CE24-0003-01). We also thank our colleagues from the RESTAURE project for their help in describing the languages and corpora.


  1. Agic, Z., Hovy, D., & Søgaard, A. (2015). If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing, ACL 2015, July 26–31, 2015, Beijing, China, Volume 2: Short Papers, pp. 268–272.Google Scholar
  2. Allauzen, A. & Bonneau-Maynard, H. (2008). Training and evaluation of POS taggers on the French MULTITAG corpus. In LREC.Google Scholar
  3. Artetxe, M., Labaka, G., & Agirre, E. (2017). Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long papers), volume 1, pp. 451–462.Google Scholar
  4. Berg-Kirkpatrick, T., Bouchard-Côté, A., DeNero, J., & Klein, D. (2010). Painless unsupervised learning with features. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, HLT ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 582–590.Google Scholar
  5. Bernhard, D. (2014). Adding dialectal lexicalisations to linked open data resources: The example of Alsatian. In Proceedings of the workshop on collaboration and computing for under resourced languages in the linked open data era (CCURL 2014). Reykjavík, Iceland, pp. 23–29.Google Scholar
  6. Bernhard, D. & Ligozat, A.-L. (2013). Es esch fàscht wie Ditsch, oder net? Étiquetage morphosyntaxique de l’alsacien en passant par l’allemand. In TALARE 2013. Les Sables d’Olonne, France, pp. 209–220.Google Scholar
  7. Bernhard, D., Ligozat, A.-L., Martin, F., Bras, M., Magistry, P., Vergez-Couret, M., Steiblé, L., Erhart, P., Hathout, N., Huck, D., Rey, C., Reynés, P., Rosset, S., Sibille, J., & Lavergne, T. (2018). Corpora with part-of-speech annotations for three regional languages of France: Alsatian, Occitan and Picard. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) Miyazaki, Japan: European Language Resources Association (ELRA).Google Scholar
  8. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.CrossRefGoogle Scholar
  9. Boudin, F. & Hernandez, N. (2012). Détection et correction automatique d’erreurs d’annotation morpho-syntaxique du French TreeBank. In Traitement Automatique des Langues Naturelles (TALN), pp. 281–291.Google Scholar
  10. Brants, S., Dipper, S., Eisenberg, P., Hansen-Schirra, S., König, E., Lezius, W., et al. (2004). TIGER: Linguistic interpretation of a German corpus. Research on Language and Computation, 2(4), 597–620.CrossRefGoogle Scholar
  11. Bras, M., & Vergez-Couret, M. (2016). BaTelÒc: A text base for the Occitan language. In V. Ferreira & P. Bouda (Eds.), Language documentation and conservation in Europe (pp. 133–149). Honolulu: University of Hawaï Press.Google Scholar
  12. Candito, M., & Seddah, D. (2012). Le corpus Sequoia: Annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical. In TALN 2012—19e conférence sur le Traitement Automatique des Langues Naturelles Grenoble, France.Google Scholar
  13. Garrette, D., & Baldridge, J. (2013). Learning a part-of-speech tagger from two hours of annotation. In HLT-NAACL, pp. 138–147.Google Scholar
  14. Hovy, D., Plank, B., & Søgaard, A. (2014). Experiments with crowdsourced re-annotation of a POS tagging data set. In ACL (2), pp. 377–382.Google Scholar
  15. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. CoRR, arXiv:abs/1603.01360.
  16. Majlis, M., & Zabokrtský, Z. (2012). Language richness of the web. In Proceedings of the eighth international conference on language resources and evaluation, LREC 2012, Istanbul, Turkey, May 23–25, 2012, pp. 2927–2934.Google Scholar
  17. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119.Google Scholar
  18. Millour, A., Fort, K., Bernhard, D., & Steiblé, L. (2017). Vers une solution légère de production de données pour le TAL: création d’un tagger de l’alsacien par crowdsourcing bénévole. In Traitement Automatique des Langues Naturelles (TALN).Google Scholar
  19. Scherrer, Y. (2014). Unsupervised adaptation of supervised part-of-speech taggers for closely related languages. In Proceedings of the first workshop on applying NLP tools to similar languages, varieties and dialects. Dublin, Ireland: Association for Computational Linguistics and Dublin City University, pp. 30–38.Google Scholar
  20. Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT-workshop, pp. 47–50.Google Scholar
  21. Taulé, M., Martí, M. A., & Recasens, M. (2008). AnCora: Multilevel annotated corpora for Catalan and Spanish. In LREC.Google Scholar
  22. Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology—Volume 1, NAACL ’03. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 173–180.Google Scholar
  23. Vergez-Couret, M. & Urieli, A. (2015). Analyse morphosyntaxique de l’occitan languedocien : l’amitié entre un petit languedocien et un gros catalan. In TALARE 2015 Caen, France.Google Scholar
  24. Yarowsky, D., Ngai, G., & Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the first international conference on Human language technology research. Association for Computational Linguistics, pp. 1–8.Google Scholar
  25. Yu, Z., Marecek, D., Zabokrtský, Z., & Zeman, D. (2016). If you even don’t have a bit of Bible: Learning delexicalized POS taggers. In Proceedings of the tenth international conference on language resources and evaluation LREC 2016, Portorož, Slovenia, May 23–28, 2016.Google Scholar
  26. Zhang, Y., Gaddy, D., Barzilay, R., & Jaakkola, T. (2016). Ten Pairs to tag—Multilingual POS tagging via coarse mapping between embeddings. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies. San Diego, California: Association for Computational Linguistics, pp. 1307–1317.Google Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.LIMSI, CNRSUniversité Paris-SaclayOrsayFrance
  2. 2.LIMSI, CNRS, ENSIIEUniversité Paris-SaclayOrsayFrance

Personalised recommendations