This paper presents a new method with which to assist individuals with no background in linguistics to create monolingual dictionaries such as those used by the morphological analysers of many natural language processing applications. The involvement of non-expert users is especially critical for under-resourced languages which either lack or cannot afford the recruitment of a skilled workforce. Adding a word to a morphological dictionary usually requires identifying its stem along with the inflection paradigm that can be used in order to generate all the word forms of the new entry. Our method works under the assumption that the average speakers of a language can successfully answer the polar question “is x a valid form of the word w to be inserted?”, where x represents tentative alternative (inflected) forms of the new word w. The experiments show that with a small number of polar questions the correct stem and paradigm can be obtained from non-experts with high success rates. We study the impact of different heuristic and probabilistic approaches on the actual number of questions.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
It could also occur that the word form is not completely unknown, but it is not analysed the way in which it should because it is a homograph and has more than one possible morphological analysis.
In rule-based MT, the SL morphological dictionary contains mappings between SL word forms (also called surface forms), that is, words as they are found in texts, and SL lexical forms (comprising lemma, part-of-speech and inflection information). The same applies to the TL dictionary. Bilingual dictionaries contain mappings between SL lexical forms and TL lexical forms.
A Spanish morphological dictionary, for example, may contain around 500 inflection paradigms. Other morphologically rich languages may exceed this number.
Paradigms ease dictionary management by reducing the quantity of information that needs to be stored, and by simplifying revision and validation because of the explicit encoding of regularities in the dictionary.
Actually, for most European languages, Indo-European or not.
Note that the concept of suffix here obviously refers to the phonological order, regardless of writing direction (left-to-right or right-to-left).
Automatic acquisition of paradigms from monolingual corpora has already been explored (Monson 2009), but this task is out of the scope of this work.
As discussed in the introduction, although our approach has been evaluated with languages that generate word forms by adding suffixes to stems (most European languages), it could straightforwardly be adapted to languages that inflect by changing prefixes, and with a little more effort to languages that show nonconcatenative inflection.
We consider that a stem/paradigm combination \(c_n\) is compatible with a word form w if the stem of \(c_n\) concatenated to one of the suffixes in the paradigm results in the surface form w.
Sánchez-Cartagena et al. (2012b) propose a model based on an n-gram language model of lexical categories and morphological inflection information to perform the disambiguation.
Note that the language pair is indicated here because Apertium has slightly different monolingual dictionaries depending on the particular MT system in which they are used. As expected, the evaluation only uses the monolingual dictionary.
This property is not guaranteed by the ID3 algorithm and it depends on the particular membership relation between the different word forms and the candidate stem/paradigm pairs. The tree becomes less balanced as the differences between the feasibility scores of the different candidates grow.
Recall that an entry is made of a stem and a paradigm. An entry can generate multiple word forms when it is expanded. Each word form also has morphological information attached, although the morphological information is not included in most of the examples presented in this paper for the sake of simplicity.
Parallel corpora were chosen, instead of monolingual ones, simply because they are already segmented into sentences.
This number was chosen since it corresponds to the size of the smallest corpus: that used for Spanish.
This number of iterations was optimal in our preliminary experiments for Spanish; it was therefore used for the remaining languages in order to ensure the same experimental conditions.
Preliminary experiments showed that this value of \(\Theta \) caused the desirable effect of the most infrequent inflected forms not being taken into account (such as unusual combinations of enclitic pronouns in Spanish).
Statistical significance tests were performed with the randomisation version of the paired sample t test described by Yeh (2000) using the software available at http://www.nlpado.de/~sebastian/software/sigf.shtml.
In this analysis, we have chosen Spanish as a representative of the four languages evaluated and omitted the discussion for the remainder of them as the results are consistent across the different languages.
Revision 33,900 in the repository https://svn.code.sf.net/p/apertium/svn/trunk/apertium-es-ca.
Repeated paradigms are excluded; for instance, if the first two entries with the highest frequency belong to the same paradigm, the second one is replaced with another paradigm.
These results are compatible with those obtained in a previous evaluation (Esplà-Gomis et al. 2011), in which the test set was obtained by randomly picking a pair of words from 166 different paradigms; in that case, 10 non-expert humans evaluators took part, and annotator agreement metrics were not computed. In those experiments, the value of average precision and recall was slightly below 90 %, although the recall of the non-interactive baseline was much lower.
Ahlberg, M., Forsberg, M., & Hulden, M. (2014). Semi-supervised learning of morphological paradigms and lexicons. In Proceedings of the 14th conference of the European chapter of the association for computational linguistics, Gothenburg, Sweden, pp. 569–578.
Ambati, V., Vogel, S., & Carbonell, J. (2010). Active learning and crowd-sourcing for machine translation. In Proceedings of the 7th international conference on language resources and evaluation, Valletta, Malta, LREC’10, pp. 2169–2174.
Bartusková, D., & Sedlácek, R. (2002). Tools for semi-automatic assignment of Czech nouns to declination patterns. In Proceedings of the 5th international conference on text. Speech and Dialogue, Brno, Czech Republic, pp. 159–164.
Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. Inequalities, 3, 1–8.
Belz, A. (2000). Multi-syllable phonotactic modelling. In Finite-state phonology: Proceedings of the 5th workshop of the ACL special interest group in computational phonology, Luxembourg, pp. 569–578.
Bojar, O., Buck, C., Callison-Burch, C., Federmann, C., Haddow, B., Koehn, P., et al. (2013). Findings of the 2013 workshop on statistical machine translation. In Proceedings of the 8th workshop on statistical machine translation, pp. 1–44.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press.
Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A practical part-of-speech tagger. In Proceedings of the 3rd conference on applied natural language processing, pp. 133–140.
Desai, S., Pawar, J., & Bhattacharyya, P. (2012). Automated paradigm selection for FSA based Konkani verb morphological analyzer. In 24th International conference on computational linguistics: Demonstration papers, pp. 103–110.
Détrez, G., & Ranta, A. (2012). Smart paradigms and the predictability and complexity of inflectional morphology. In Proceedings of EACL, pp. 645–653.
Esplà-Gomis, M., Sánchez-Cartagena, V. M., & Pérez-Ortiz, J. A. (2011). Enlarging monolingual dictionaries for machine translation with active learning and non-expert users. In Proceedings of the international conference recent advances in natural language processing 2011, Hissar, Bulgaria, pp. 339–346.
Esplà-Gomis, M., Sánchez-Cartagena, V. M., Sánchez-Martínez, F., Carrasco, R.C., Forcada, M. L., & Pérez-Ortiz, J. A. (2014). An efficient method to assist non-expert users in extending dictionaries by assigning stems and inflectional paradigms to unknown words. In Proceedings of the 17th annual conference of the European association for machine translation, pp. 19–26.
Font-Llitjós, A. (2007). Automatic improvement of machine translation systems. Ph.D. thesis, Carnegie Mellon University.
Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Sánchez-Martínez, F., Ramírez-Sánchez, G., Tyers, F. M. (2011). Apertium: A free/open-source platform for rule-based machine translation. Machine Translation, 25(2), 127–144. (special Issue: Free/Open-Source Machine Translation).
Haffari, G., Roy, M., & Sarkar, A. (2009). Active learning for statistical phrase-based machine translation. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, Boulder, Colorado, NAACL ’09, pp. 415–423.
Hutchins, W. J., & Somers, H. L. (1992). An introduction to machine translation (Vol. 362). New York: Academic Press.
Kilbury, J., Naerger, P., & Renz, I. (1992). New lexical entries for unknown words. Theorie des Lexikons: Arbeiten des Sonderforschungsbereichs, 282(29).
Koehn, P. (2010). Statistical machine translation. Cambridge: Cambridge University Press.
McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. Journal of the Association for Computing Machinery, 23(2), 262–272.
McShane, M., Nirenburg, S., Cowie, J., & Zacharski, R. (2002). Embedding knowledge elicitation and MT systems within a single architecture. Machine Translation, 17, 271–305.
Monson, C. (2009). ParaMor: From paradigm structure to natural language morphology induction. Ph.D. thesis, Carnegie Mellon University.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the Institute of Electrical and Electronics Engineers (IEEE), 77(2), 257–286.
Rehm, G., & Uszkoreit, H. (2013). META-NET strategic research agenda for multilingual Europe 2020. Berlin, Heidelberg: Springer.
Sánchez-Cartagena, V. M., Esplà-Gomis, M., Pérez-Ortiz, J. A. (2012a). Source-language dictionaries help non-expert users to enlarge target-language dictionaries for machine translation. In Proceedings of the 8th international conference on language resources and evaluation, Istanbul, Turkey, LREC’12, pp. 3422–3429.
Sánchez-Cartagena, V. M., Esplà-Gomis, M., Sánchez-Martínez, F., & Pérez-Ortiz, J. A. (2012b). Choosing the correct paradigm for unknown words in rule-based machine translation systems. In Proceedings of the 3rd international workshop on free/open-source rule-based machine translation, Gothenburg, Sweden, pp. 27–39.
Šnajder, J. (2013). Models for predicting the inflectional paradigm of Croatian words. Slovenščina 20: Empirical, Applied and Interdisciplinary Research, 1(2), 1–34.
Steinberger, R., Ebrahim, M., Poulis, A., Carrasco-Benitez, M., Schlüter, P., Przybyszewski, M., et al. (2014). An overview of the European Union’s highly multilingual parallel corpora. Language Resources and Evaluation, 48(4), 679–707.
Tiedemann, J. (2012). Parallel data, tools and interfaces in opus. In Proceedings of the 8th international conference on language resources and evaluation, Istanbul, Turkey, LREC’12, pp. 2214–2218.
Wang, A., Hoang, C., & Kan, M. (2013). Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation, 47(1), 9–31.
Yeh, A. (2000). More accurate tests for the statistical significance of result differences. In Proceedings of the 18th conference on computational linguistics—volume 2, Stroudsburg, USA, COLING ’00, pp. 947–953.
This work has been partially funded by the Spanish Ministry of Science & Innovation through project TIN2009-14009-C02-01, by the Spanish Ministry of Economy & Competitiveness through Project TIN2012-32615, by the Generalitat Valenciana through grant ACIF/2010/174 from VALi+d programme, and by the European Commission through Project PIAP-GA-2012-324414 (Abu-MaTran).
About this article
Cite this article
Esplà-Gomis, M., Carrasco, R.C., Sánchez-Cartagena, V.M. et al. Assisting non-expert speakers of under-resourced languages in assigning stems and inflectional paradigms to new word entries of morphological dictionaries. Lang Resources & Evaluation 51, 989–1017 (2017). https://doi.org/10.1007/s10579-016-9360-9
- Enlargement of morphological dictionaries
- Knowledge elicitation
- Resource development for under-resourced languages
- Machine translation