Language Resources and Evaluation

, Volume 51, Issue 4, pp 989–1017 | Cite as

Assisting non-expert speakers of under-resourced languages in assigning stems and inflectional paradigms to new word entries of morphological dictionaries

  • Miquel Esplà-Gomis
  • Rafael C. Carrasco
  • Víctor M. Sánchez-Cartagena
  • Mikel L. Forcada
  • Felipe Sánchez-Martínez
  • Juan Antonio Pérez-Ortiz
Original Paper


This paper presents a new method with which to assist individuals with no background in linguistics to create monolingual dictionaries such as those used by the morphological analysers of many natural language processing applications. The involvement of non-expert users is especially critical for under-resourced languages which either lack or cannot afford the recruitment of a skilled workforce. Adding a word to a morphological dictionary usually requires identifying its stem along with the inflection paradigm that can be used in order to generate all the word forms of the new entry. Our method works under the assumption that the average speakers of a language can successfully answer the polar question “is x a valid form of the word w to be inserted?”, where x represents tentative alternative (inflected) forms of the new word w. The experiments show that with a small number of polar questions the correct stem and paradigm can be obtained from non-experts with high success rates. We study the impact of different heuristic and probabilistic approaches on the actual number of questions.


Enlargement of morphological dictionaries Knowledge elicitation Resource development for under-resourced languages Machine translation 



This work has been partially funded by the Spanish Ministry of Science & Innovation through project TIN2009-14009-C02-01, by the Spanish Ministry of Economy & Competitiveness through Project TIN2012-32615, by the Generalitat Valenciana through grant ACIF/2010/174 from VALi+d programme, and by the European Commission through Project PIAP-GA-2012-324414 (Abu-MaTran).


  1. Ahlberg, M., Forsberg, M., & Hulden, M. (2014). Semi-supervised learning of morphological paradigms and lexicons. In Proceedings of the 14th conference of the European chapter of the association for computational linguistics, Gothenburg, Sweden, pp. 569–578. Google Scholar
  2. Ambati, V., Vogel, S., & Carbonell, J. (2010). Active learning and crowd-sourcing for machine translation. In Proceedings of the 7th international conference on language resources and evaluation, Valletta, Malta, LREC’10, pp. 2169–2174.Google Scholar
  3. Bartusková, D., & Sedlácek, R. (2002). Tools for semi-automatic assignment of Czech nouns to declination patterns. In Proceedings of the 5th international conference on text. Speech and Dialogue, Brno, Czech Republic, pp. 159–164.Google Scholar
  4. Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. Inequalities, 3, 1–8.Google Scholar
  5. Belz, A. (2000). Multi-syllable phonotactic modelling. In Finite-state phonology: Proceedings of the 5th workshop of the ACL special interest group in computational phonology, Luxembourg, pp. 569–578.Google Scholar
  6. Bojar, O., Buck, C., Callison-Burch, C., Federmann, C., Haddow, B., Koehn, P., et al. (2013). Findings of the 2013 workshop on statistical machine translation. In Proceedings of the 8th workshop on statistical machine translation, pp. 1–44.Google Scholar
  7. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.CrossRefGoogle Scholar
  8. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  9. Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A practical part-of-speech tagger. In Proceedings of the 3rd conference on applied natural language processing, pp. 133–140.Google Scholar
  10. Desai, S., Pawar, J., & Bhattacharyya, P. (2012). Automated paradigm selection for FSA based Konkani verb morphological analyzer. In 24th International conference on computational linguistics: Demonstration papers, pp. 103–110.Google Scholar
  11. Détrez, G., & Ranta, A. (2012). Smart paradigms and the predictability and complexity of inflectional morphology. In Proceedings of EACL, pp. 645–653.Google Scholar
  12. Esplà-Gomis, M., Sánchez-Cartagena, V. M., & Pérez-Ortiz, J. A. (2011). Enlarging monolingual dictionaries for machine translation with active learning and non-expert users. In Proceedings of the international conference recent advances in natural language processing 2011, Hissar, Bulgaria, pp. 339–346.Google Scholar
  13. Esplà-Gomis, M., Sánchez-Cartagena, V. M., Sánchez-Martínez, F., Carrasco, R.C., Forcada, M. L., & Pérez-Ortiz, J. A. (2014). An efficient method to assist non-expert users in extending dictionaries by assigning stems and inflectional paradigms to unknown words. In Proceedings of the 17th annual conference of the European association for machine translation, pp. 19–26.Google Scholar
  14. Font-Llitjós, A. (2007). Automatic improvement of machine translation systems. Ph.D. thesis, Carnegie Mellon University.Google Scholar
  15. Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Sánchez-Martínez, F., Ramírez-Sánchez, G., Tyers, F. M. (2011). Apertium: A free/open-source platform for rule-based machine translation. Machine Translation, 25(2), 127–144. (special Issue: Free/Open-Source Machine Translation).Google Scholar
  16. Haffari, G., Roy, M., & Sarkar, A. (2009). Active learning for statistical phrase-based machine translation. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, Boulder, Colorado, NAACL ’09, pp. 415–423.Google Scholar
  17. Hutchins, W. J., & Somers, H. L. (1992). An introduction to machine translation (Vol. 362). New York: Academic Press.Google Scholar
  18. Kilbury, J., Naerger, P., & Renz, I. (1992). New lexical entries for unknown words. Theorie des Lexikons: Arbeiten des Sonderforschungsbereichs, 282(29). Google Scholar
  19. Koehn, P. (2010). Statistical machine translation. Cambridge: Cambridge University Press.Google Scholar
  20. McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. Journal of the Association for Computing Machinery, 23(2), 262–272.CrossRefGoogle Scholar
  21. McShane, M., Nirenburg, S., Cowie, J., & Zacharski, R. (2002). Embedding knowledge elicitation and MT systems within a single architecture. Machine Translation, 17, 271–305.CrossRefGoogle Scholar
  22. Monson, C. (2009). ParaMor: From paradigm structure to natural language morphology induction. Ph.D. thesis, Carnegie Mellon University.Google Scholar
  23. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.Google Scholar
  24. Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the Institute of Electrical and Electronics Engineers (IEEE), 77(2), 257–286.CrossRefGoogle Scholar
  25. Rehm, G., & Uszkoreit, H. (2013). META-NET strategic research agenda for multilingual Europe 2020. Berlin, Heidelberg: Springer.CrossRefGoogle Scholar
  26. Sánchez-Cartagena, V. M., Esplà-Gomis, M., Pérez-Ortiz, J. A. (2012a). Source-language dictionaries help non-expert users to enlarge target-language dictionaries for machine translation. In Proceedings of the 8th international conference on language resources and evaluation, Istanbul, Turkey, LREC’12, pp. 3422–3429.Google Scholar
  27. Sánchez-Cartagena, V. M., Esplà-Gomis, M., Sánchez-Martínez, F., & Pérez-Ortiz, J. A. (2012b). Choosing the correct paradigm for unknown words in rule-based machine translation systems. In Proceedings of the 3rd international workshop on free/open-source rule-based machine translation, Gothenburg, Sweden, pp. 27–39.Google Scholar
  28. Šnajder, J. (2013). Models for predicting the inflectional paradigm of Croatian words. Slovenščina 20: Empirical, Applied and Interdisciplinary Research, 1(2), 1–34.Google Scholar
  29. Steinberger, R., Ebrahim, M., Poulis, A., Carrasco-Benitez, M., Schlüter, P., Przybyszewski, M., et al. (2014). An overview of the European Union’s highly multilingual parallel corpora. Language Resources and Evaluation, 48(4), 679–707.CrossRefGoogle Scholar
  30. Tiedemann, J. (2012). Parallel data, tools and interfaces in opus. In Proceedings of the 8th international conference on language resources and evaluation, Istanbul, Turkey, LREC’12, pp. 2214–2218.Google Scholar
  31. Wang, A., Hoang, C., & Kan, M. (2013). Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation, 47(1), 9–31.CrossRefGoogle Scholar
  32. Yeh, A. (2000). More accurate tests for the statistical significance of result differences. In Proceedings of the 18th conference on computational linguistics—volume 2, Stroudsburg, USA, COLING ’00, pp. 947–953.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  • Miquel Esplà-Gomis
    • 1
  • Rafael C. Carrasco
    • 1
  • Víctor M. Sánchez-Cartagena
    • 1
  • Mikel L. Forcada
    • 1
  • Felipe Sánchez-Martínez
    • 1
  • Juan Antonio Pérez-Ortiz
    • 1
  1. 1.Dep. de Llenguatges i Sistemes InformàticsUniversitat d’AlacantAlacantSpain

Personalised recommendations