Abstract
Code-switching (CS) is a widespread phenomenon among bilingual and multilingual societies. The lack of CS resources hinders the performance of many NLP tasks. In this work, we explore the potential use of bilingual word embeddings for code-switching (CS) language modeling (LM) in the low resource Egyptian Arabic-English language. We evaluate different state-of-the-art bilingual word embeddings approaches that require cross-lingual resources at different levels and propose an innovative but simple approach that jointly learns bilingual word representations without the use of any parallel data, relying only on monolingual and a small amount of CS data. While all representations improve CS LM, ours performs the best and improves perplexity 33.5% relative over the baseline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The bilingual word embeddings and the compiled Egyptian Arabic-English dictionary and thesaurus can be obtained by contacting the authors.
- 2.
- 3.
References
Adel, H., Vu, N.T., Schultz, T.: Combination of recurrent neural networks and factored language models for code-switching language modeling. In: ACL, vol. 2, pp. 206–211 (2013)
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1, pp. 238–247, Long Papers (2014)
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. JMLR 3, 1137–1155 (2003)
Çetinoğlu, Ö., Schulz, S., Vu, N.T.: Challenges of computational processing of code-switching. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 1–11. Association for Computational Linguistics (2016)
Cocos, A., Apidianaki, M., Callison-Burch, C.: Word sense filtering improves embedding-based lexical substitution. In: Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pp. 110–119 (2017)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12, 2121–2159 (2011)
Dyer, C., et al.: cdec: a decoder, alignment, and learning framework for finite-state and context-free translation models. In: ACL, pp. 7–12 (2010)
Enarvi, S., Kurimo, M.: TheanoLM-an extensible toolkit for neural network language modeling. In: Interspeech, pp. 3052–3056 (2016)
Faruqui, M., Dyer, C.: Improving vector space word representations using multilingual correlation. In: EACL, pp. 462–471 (2014)
Garg, S., Parekh, T., Jyothi, P.: Dual language models for code switched speech recognition. In: Interspeech (2018)
Glavas, G., Litschko, R., Ruder, S., Vulic, I.: How to (properly) evaluate cross-lingual word embeddings: on strong baselines, comparative analyses, and some misconceptions. CoRR abs/1902.00508 (2019)
Gouws, S., Bengio, Y., Corrado, G.: Bilbowa: fast bilingual distributed representations without word alignments. In: ICML, pp. 748–756 (2015)
Gouws, S., Søgaard, A.: Simple task-specific bilingual word embeddings. In: NAACL, pp. 1386–1390 (2015)
Hamed, I., Elmahdy, M., Abdennadher, S.: Building a first language model for code-switch Arabic-English. Proc. Comput. Sci. 117, 208–216 (2017)
Hamed, I., Elmahdy, M., Abdennadher, S.: Collection and analysis of code-switch egyptian Arabic-English speech corpus. In: LREC, vol. 117, 208–216 (2018)
Hermann, K.M., Blunsom, P.: Multilingual models for compositional distributed semantics. In: ACL, vol. 1, pp. 58–68, Long Papers (2014)
Karypis, G.: Cluto: A clustering toolkit. University of Minnesota Department of Computer Science, Technical report (2003)
Li, J., Jurafsky, D.: Do multi-sense embeddings improve natural language understanding? arXiv preprint arXiv:1506.01070 (2015)
Luong, T., Pham, H., Manning, C.D.: Bilingual word representations with monolingual quality in mind. In: NAACL Workshop on Vector Space Modeling for NLP, pp. 151–159 (2015)
Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Poplack, S.: Syntactic structure and social function of code-switching, vol. 2. Centro de Estudios Puertorriqueños, City University of New York (1978)
Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. arXiv preprint arXiv:1706.04902 (2017)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Sabty, C., Elmahdy, M., Abdennadher, S.: Named entity recognition on Arabic-English code-mixed data. In: 2019 IEEE 13th International Conference on Semantic Computing (ICSC), pp. 93–97. IEEE (2019)
Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: EMNLP (2015)
Tsai, C.T., Roth, D.: Cross-lingual wikification using multilingual embeddings. In: NAACL, pp. 589–598 (2016)
Upadhyay, S., Faruqui, M., Dyer, C., Roth, D.: Cross-lingual models of word embeddings: an empirical comparison. In: ACL, pp. 1661–1670 (2016)
Vu, N.T., et al.: A first speech recognition system for Mandarin-English code-switch conversational speech. In: ICASSP, pp. 4889–4892 (2012)
Vulić, I., Moens, M.F.: A study on bootstrapping bilingual vector spaces from non-parallel data (and nothing else). In: EMNLP, pp. 1613–1624 (2013)
Vulić, I., Moens, M.F.: Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), vol. 2, pp. 719–725 (2015)
Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: ACM SIGIR, pp. 363–372 (2015)
Vulić, I., Moens, M.F.: Bilingual distributed word representations from document-aligned comparable data. JAIR 55, 953–994 (2016)
Wang, R., Zhao, H., Ploux, S., Lu, B.L., Utiyama, M., Sumita, E.: Graph-based bilingual word embedding for statistical machine translation. TALLIP 17, 31 (2018)
Wick, M., Kanani, P., Pocock, A.C.: Minimally-constrained multilingual embeddings via artificial code-switching. In: AAAI, pp. 2849–2855 (2016)
Ying, L., Fung, P.: Language modeling with functional head constraint for code switching speech recognition. In: EMNLP, pp. 907–916 (2014)
Zbib, R., et al.: Machine translation of Arabic dialects. In: NAACL, pp. 49–59 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hamed, I., Zhu, M., Elmahdy, M., Abdennadher, S., Vu, N.T. (2019). Code-Switching Language Modeling with Bilingual Word Embeddings: A Case Study for Egyptian Arabic-English. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-26061-3_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26060-6
Online ISBN: 978-3-030-26061-3
eBook Packages: Computer ScienceComputer Science (R0)