Code-Switching Language Modeling with Bilingual Word Embeddings: A Case Study for Egyptian Arabic-English

Hamed, Injy; Zhu, Moritz; Elmahdy, Mohamed; Abdennadher, Slim; Vu, Ngoc Thang

doi:10.1007/978-3-030-26061-3_17

Injy Hamed¹¹,
Moritz Zhu¹²,
Mohamed Elmahdy¹³,
Slim Abdennadher¹¹ &
…
Ngoc Thang Vu¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11658))

Included in the following conference series:

International Conference on Speech and Computer

1301 Accesses
3 Citations

Abstract

Code-switching (CS) is a widespread phenomenon among bilingual and multilingual societies. The lack of CS resources hinders the performance of many NLP tasks. In this work, we explore the potential use of bilingual word embeddings for code-switching (CS) language modeling (LM) in the low resource Egyptian Arabic-English language. We evaluate different state-of-the-art bilingual word embeddings approaches that require cross-lingual resources at different levels and propose an innovative but simple approach that jointly learns bilingual word representations without the use of any parallel data, relying only on monolingual and a small amount of CS data. While all representations improve CS LM, ours performs the best and improves perplexity 33.5% relative over the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The bilingual word embeddings and the compiled Egyptian Arabic-English dictionary and thesaurus can be obtained by contacting the authors.
2.
http://eg.lisaanmasry.com/intro_en/index.html.
3.
https://github.com/senarvi/theanolm.

References

Adel, H., Vu, N.T., Schultz, T.: Combination of recurrent neural networks and factored language models for code-switching language modeling. In: ACL, vol. 2, pp. 206–211 (2013)
Google Scholar
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1, pp. 238–247, Long Papers (2014)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. JMLR 3, 1137–1155 (2003)
MATH Google Scholar
Çetinoğlu, Ö., Schulz, S., Vu, N.T.: Challenges of computational processing of code-switching. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 1–11. Association for Computational Linguistics (2016)
Google Scholar
Cocos, A., Apidianaki, M., Callison-Burch, C.: Word sense filtering improves embedding-based lexical substitution. In: Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pp. 110–119 (2017)
Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Dyer, C., et al.: cdec: a decoder, alignment, and learning framework for finite-state and context-free translation models. In: ACL, pp. 7–12 (2010)
Google Scholar
Enarvi, S., Kurimo, M.: TheanoLM-an extensible toolkit for neural network language modeling. In: Interspeech, pp. 3052–3056 (2016)
Google Scholar
Faruqui, M., Dyer, C.: Improving vector space word representations using multilingual correlation. In: EACL, pp. 462–471 (2014)
Google Scholar
Garg, S., Parekh, T., Jyothi, P.: Dual language models for code switched speech recognition. In: Interspeech (2018)
Google Scholar
Glavas, G., Litschko, R., Ruder, S., Vulic, I.: How to (properly) evaluate cross-lingual word embeddings: on strong baselines, comparative analyses, and some misconceptions. CoRR abs/1902.00508 (2019)
Google Scholar
Gouws, S., Bengio, Y., Corrado, G.: Bilbowa: fast bilingual distributed representations without word alignments. In: ICML, pp. 748–756 (2015)
Google Scholar
Gouws, S., Søgaard, A.: Simple task-specific bilingual word embeddings. In: NAACL, pp. 1386–1390 (2015)
Google Scholar
Hamed, I., Elmahdy, M., Abdennadher, S.: Building a first language model for code-switch Arabic-English. Proc. Comput. Sci. 117, 208–216 (2017)
Article Google Scholar
Hamed, I., Elmahdy, M., Abdennadher, S.: Collection and analysis of code-switch egyptian Arabic-English speech corpus. In: LREC, vol. 117, 208–216 (2018)
Google Scholar
Hermann, K.M., Blunsom, P.: Multilingual models for compositional distributed semantics. In: ACL, vol. 1, pp. 58–68, Long Papers (2014)
Google Scholar
Karypis, G.: Cluto: A clustering toolkit. University of Minnesota Department of Computer Science, Technical report (2003)
Google Scholar
Li, J., Jurafsky, D.: Do multi-sense embeddings improve natural language understanding? arXiv preprint arXiv:1506.01070 (2015)
Luong, T., Pham, H., Manning, C.D.: Bilingual word representations with monolingual quality in mind. In: NAACL Workshop on Vector Space Modeling for NLP, pp. 151–159 (2015)
Google Scholar
Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)
MATH Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Poplack, S.: Syntactic structure and social function of code-switching, vol. 2. Centro de Estudios Puertorriqueños, City University of New York (1978)
Google Scholar
Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. arXiv preprint arXiv:1706.04902 (2017)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Article Google Scholar
Sabty, C., Elmahdy, M., Abdennadher, S.: Named entity recognition on Arabic-English code-mixed data. In: 2019 IEEE 13th International Conference on Semantic Computing (ICSC), pp. 93–97. IEEE (2019)
Google Scholar
Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: EMNLP (2015)
Google Scholar
Tsai, C.T., Roth, D.: Cross-lingual wikification using multilingual embeddings. In: NAACL, pp. 589–598 (2016)
Google Scholar
Upadhyay, S., Faruqui, M., Dyer, C., Roth, D.: Cross-lingual models of word embeddings: an empirical comparison. In: ACL, pp. 1661–1670 (2016)
Google Scholar
Vu, N.T., et al.: A first speech recognition system for Mandarin-English code-switch conversational speech. In: ICASSP, pp. 4889–4892 (2012)
Google Scholar
Vulić, I., Moens, M.F.: A study on bootstrapping bilingual vector spaces from non-parallel data (and nothing else). In: EMNLP, pp. 1613–1624 (2013)
Google Scholar
Vulić, I., Moens, M.F.: Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), vol. 2, pp. 719–725 (2015)
Google Scholar
Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: ACM SIGIR, pp. 363–372 (2015)
Google Scholar
Vulić, I., Moens, M.F.: Bilingual distributed word representations from document-aligned comparable data. JAIR 55, 953–994 (2016)
Article MathSciNet Google Scholar
Wang, R., Zhao, H., Ploux, S., Lu, B.L., Utiyama, M., Sumita, E.: Graph-based bilingual word embedding for statistical machine translation. TALLIP 17, 31 (2018)
Google Scholar
Wick, M., Kanani, P., Pocock, A.C.: Minimally-constrained multilingual embeddings via artificial code-switching. In: AAAI, pp. 2849–2855 (2016)
Google Scholar
Ying, L., Fung, P.: Language modeling with functional head constraint for code switching speech recognition. In: EMNLP, pp. 907–916 (2014)
Google Scholar
Zbib, R., et al.: Machine translation of Arabic dialects. In: NAACL, pp. 49–59 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, The German University in Cairo, Cairo, Egypt
Injy Hamed & Slim Abdennadher
Institute for Natural Language Processing, University of Stuttgart, Stuttgart, Germany
Moritz Zhu & Ngoc Thang Vu
Data Science Department, Raisa Energy LLC, Cairo, Egypt
Mohamed Elmahdy

Authors

Injy Hamed
View author publications
You can also search for this author in PubMed Google Scholar
Moritz Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Elmahdy
View author publications
You can also search for this author in PubMed Google Scholar
Slim Abdennadher
View author publications
You can also search for this author in PubMed Google Scholar
Ngoc Thang Vu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Injy Hamed .

Editor information

Editors and Affiliations

Utrecht University, Utrecht, The Netherlands
Albert Ali Salah
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hamed, I., Zhu, M., Elmahdy, M., Abdennadher, S., Vu, N.T. (2019). Code-Switching Language Modeling with Bilingual Word Embeddings: A Case Study for Egyptian Arabic-English. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-26061-3_17
Published: 24 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26060-6
Online ISBN: 978-3-030-26061-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics