Skip to main content
Log in

Lexicon+TX: rapid construction of a multilingual lexicon with under-resourced languages

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Most efforts at automatically creating multilingual lexicons require input lexical resources with rich content (e.g. semantic networks, domain codes, semantic categories) or large corpora. Such material is often unavailable and difficult to construct for under-resourced languages. In some cases, particularly for some ethnic languages, even unannotated corpora are still in the process of collection. We show how multilingual lexicons with under-resourced languages can be constructed using simple bilingual translation lists, which are more readily available. The prototype multilingual lexicon developed comprise six member languages: English, Malay, Chinese, French, Thai and Iban, the last of which is an under-resourced language in Borneo. Quick evaluations showed that 91.2  % of 500 random multilingual entries in the generated lexicon require minimal or no human correction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. Based on https://en.wiktionary.org/wiki/Wiktionary:Statistics (June 2013).

  2. http://www.casta-net.jp/~kuribayashi/multi/.

  3. http://lcl.uniroma1.it/babelnet/.

  4. http://packages.debian.org/sid/text/dict-xdict.

  5. http://cc-cedict.org/wiki/start.

  6. http://www-clips.imag.fr/cgi-bin/geta/fem/fem.pl.

  7. https://github.com/veer66/Yaitron.

  8. According to https://en.wiktionary.org/wiki/Wiktionary:Statistics, Wiktionary contains only 39 Iban entries at the time of writing (June 2013).

  9. Due to limitations of the evaluator’s linguistics capabilities, only the English, Chinese and Malay members of each translation set are considered.

  10. Five evaluators for each language pair.

References

  • Berment, V. (2004). Méthods pour informatiser les langues et les groupes de langues ‘peu dotées’. PhD thesis, Université Joseph Fourier, Grenoble, France.

  • Boitet. C., Mangeot, M., & Sérasset, G. (2002). The PAPILLON project: Cooperatively building a multilingual lexical database to derive open source dictionaries & lexicons. In Proceedings of (NLPXML’02), Taipei, Taiwan, pp. 1–3.

  • Boitet, C., Zaharin, Y., & Tang, E. K. (2011). Learning-to-translate based on the S-SSTC annotation schema. In Proceedings of 25th Pacific Asia conference on language, information and computation (PACLIC 2011), Singapore.

  • Bond, F., & Ogura, K. (2008). Combining linguistic resources to create a machine-tractable Japanese–Malay dictionary. Language Resources and Evaluation, 42, 127–136.

    Article  Google Scholar 

  • Bond, F., & Paik, K. (2012). A survey of wordnets and their licenses. In Proceedings of the 6th global wordnet conference (GWC 2012), Matsue, Japan, pp. 64–71.

  • Bond, F., Ruhaida, B. S., Yamazaki, T., & Ogura, K. (2001). Design and construction of a machine-tractable Japanese-Malay dictionary. In Proceedings of MT summit VIII, Santiago de Compostela, Spain, pp. 53–58.

  • Daoud, M., Daoud, D., & Boitet, C. (2009). Collaborative construction of Arabic lexical resources. In K. Choukri & B. Maegaard (Eds.), Proceedings of the second international conference on Arabic language resources and tools, the MEDAR consortium, Cairo, Egypt.

  • de Melo, G., & Weikum, G. (2009). Towards a universal wordnet by learning from combined evidence. In Proceedings of the 18th ACM conference on information and knowledge management (CIKM 2009), pp. 513–522. New York, NY: ACM. doi: 10.1145/1645953.1646020.

  • Dorow, B., Laws, F., Michelbacher, L., Scheible, C., & Utt, J. (2009). A graph-theoretic algorithm for automatic extension of translation lexicons. In Proceedings of the EACL 2009 workshop on GEMS: GEometical models of natural language semantics, Athens, Greece, pp. 91–95.

  • Fellbaum, C. (Ed.) (1998). WordNet: An electronic lexical database. Language, speech, and communication. Cambridge, MA: MIT Press.

  • Francopoulo, G., Bel, N., George, M., Calzolari, N., Monachini, M., Pet, M., & Soria, C. (2009). Multilingual resources for NLP in the lexical markup framework (LMF). Language Resources and Evaluation, 43(1), 57–70. doi:10.1007/s10579-008-9077-5.

    Article  Google Scholar 

  • Jalabert, F., & Lafourcade, M. (2002). From sense naming to vocabulary augmentation in Papillon. In : Proceedings of PAPILLON-2003 Workshop, Sapporo, Japan.

  • Janssen, M. (2004). Multilingual lexical databases, lexical gaps, and SIMuLLDA. International Journal of Lexicography, 17, 136–154.

    Google Scholar 

  • Johns, A. H. (Ed.) (2000). Kamus Inggeris Melayu Dewan: An English–Malay dictionary. Kuala Lumpur: Dewan Bahasa dan Pustaka.

    Google Scholar 

  • Kilgariff, A. (1996). BNC database and word frequency lists. http://www.kilgarriff.co.uk/bnc-readme.html.

  • Lafourcade, M. (2007). Making people play for lexical acquisition. In Proceedings of the 7th symposium on natural language processing (SNLP 2007), Pattaya, Thailand.

  • Mangeot-Lerebours, M., Sérasset, G., & Lafourcade, M. (2003). Construction collaborative d’une base lexicale multilingue—le project Papillon. Traitement Automatiques des Langues, 44(2), 151–176.

    Google Scholar 

  • Mausam, Soderland, S., Etzioni, O., Weld, D., Skinner, M., & Bilmes, J. (2009). Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, Suntec, Singapore, pp. 262–270.

  • Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography (special issue), 3(4), 235–312.

    Article  Google Scholar 

  • Navigli, R., & Ponzetto, S. (2012). BabelNetXplorer: A platform for multilingual lexical knowledge base access and exploration. In Proceedings of the 21st international world wide web conference, complementary volume (WWW 2012), Lyon, France, pp. 393–396.

  • Pianta, E., Bentivogli, L., & Girardi, C. (2002). MultiWordNet: Developing an aligned multilingual database. In Proceedings of the first international conference on global wordnet, Mysore, India.

  • Sammer, M., & Soderland, S. (2007). Building a sense-distinguished multilingual lexicon from monolingual corpora and bilingual lexicons. In Proceedings of machine translation summit XI, Copenhagen, Denmark, pp. 399–406.

  • Sutlive, V., & Sutlive, J. (1992). Handy reference dictionary of Iban and English. Malaysia: Tun Jugah Foundation Association.

    Google Scholar 

  • Tanaka, K., Umemura, K., & Iwasaki, H. (1998). Construction of a bilingual dictionary intermediated by a third language. Transactions of the Information Processing Society of Japan, 39(6), 1915–1924 (in Japanese).

    Google Scholar 

  • Tufiş, D., Barbu, A. M., & Ion, R. (2004). Extracting multilingual lexicons from parallel corpora. Computers and the Humanities, 38(2), 163–189.

    Google Scholar 

  • Tufiş, D., Cristea, D., & Stamou, S. (2004). BalkaNet: Aims, methods, results and perspectives—a general overview. Romanian Journal of Information Science and Technology Special Issue 7(1), 9–43.

    Google Scholar 

  • Uchida, H., Zhu, M., & Senta, T. D. (2005). Universal Networking Language. Geneva: UNDL Foundation.

    Google Scholar 

  • Varga, I., Yokoyama, S., & Hashimoto, C. (2009). Dictionary generation for less-frequent language pairs using WordNet. Literary and Linguistic Computing, 24(4), 449–466.

    Article  Google Scholar 

  • Verma, N., & Bhattacharyya, P. (2003). Automatic generation of multilingual lexicon by using WordNet. In Proceedings of international conference on convergence of knowledge, culture, language and IT, Library of Alexandria, Egypt.

  • Vossen, P. (2004). EuroWordNet: A multilingual database of autonomous and language-specific wordnets connected via an Inter-Lingual-Index. Special Issue on Multilingual Databases. International Journal of Linguistics, 17(2), 161–173.

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank volunteers who took part in evaluating the OTIC filtering results on Malay–Chinese and Iban–Malay. We also thank the three anonymous reviewers for their extremely useful comments in improving this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lian Tze Lim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lim, L.T., Soon, LK., Lim, T.Y. et al. Lexicon+TX: rapid construction of a multilingual lexicon with under-resourced languages. Lang Resources & Evaluation 48, 479–492 (2014). https://doi.org/10.1007/s10579-013-9253-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9253-0

Keywords

Navigation