Skip to main content

Code-Switching Language Modeling with Bilingual Word Embeddings: A Case Study for Egyptian Arabic-English

  • Conference paper
  • First Online:
Book cover Speech and Computer (SPECOM 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11658))

Included in the following conference series:

Abstract

Code-switching (CS) is a widespread phenomenon among bilingual and multilingual societies. The lack of CS resources hinders the performance of many NLP tasks. In this work, we explore the potential use of bilingual word embeddings for code-switching (CS) language modeling (LM) in the low resource Egyptian Arabic-English language. We evaluate different state-of-the-art bilingual word embeddings approaches that require cross-lingual resources at different levels and propose an innovative but simple approach that jointly learns bilingual word representations without the use of any parallel data, relying only on monolingual and a small amount of CS data. While all representations improve CS LM, ours performs the best and improves perplexity 33.5% relative over the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The bilingual word embeddings and the compiled Egyptian Arabic-English dictionary and thesaurus can be obtained by contacting the authors.

  2. 2.

    http://eg.lisaanmasry.com/intro_en/index.html.

  3. 3.

    https://github.com/senarvi/theanolm.

References

  1. Adel, H., Vu, N.T., Schultz, T.: Combination of recurrent neural networks and factored language models for code-switching language modeling. In: ACL, vol. 2, pp. 206–211 (2013)

    Google Scholar 

  2. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1, pp. 238–247, Long Papers (2014)

    Google Scholar 

  3. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. JMLR 3, 1137–1155 (2003)

    MATH  Google Scholar 

  4. Çetinoğlu, Ö., Schulz, S., Vu, N.T.: Challenges of computational processing of code-switching. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 1–11. Association for Computational Linguistics (2016)

    Google Scholar 

  5. Cocos, A., Apidianaki, M., Callison-Burch, C.: Word sense filtering improves embedding-based lexical substitution. In: Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pp. 110–119 (2017)

    Google Scholar 

  6. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  7. Dyer, C., et al.: cdec: a decoder, alignment, and learning framework for finite-state and context-free translation models. In: ACL, pp. 7–12 (2010)

    Google Scholar 

  8. Enarvi, S., Kurimo, M.: TheanoLM-an extensible toolkit for neural network language modeling. In: Interspeech, pp. 3052–3056 (2016)

    Google Scholar 

  9. Faruqui, M., Dyer, C.: Improving vector space word representations using multilingual correlation. In: EACL, pp. 462–471 (2014)

    Google Scholar 

  10. Garg, S., Parekh, T., Jyothi, P.: Dual language models for code switched speech recognition. In: Interspeech (2018)

    Google Scholar 

  11. Glavas, G., Litschko, R., Ruder, S., Vulic, I.: How to (properly) evaluate cross-lingual word embeddings: on strong baselines, comparative analyses, and some misconceptions. CoRR abs/1902.00508 (2019)

    Google Scholar 

  12. Gouws, S., Bengio, Y., Corrado, G.: Bilbowa: fast bilingual distributed representations without word alignments. In: ICML, pp. 748–756 (2015)

    Google Scholar 

  13. Gouws, S., Søgaard, A.: Simple task-specific bilingual word embeddings. In: NAACL, pp. 1386–1390 (2015)

    Google Scholar 

  14. Hamed, I., Elmahdy, M., Abdennadher, S.: Building a first language model for code-switch Arabic-English. Proc. Comput. Sci. 117, 208–216 (2017)

    Article  Google Scholar 

  15. Hamed, I., Elmahdy, M., Abdennadher, S.: Collection and analysis of code-switch egyptian Arabic-English speech corpus. In: LREC, vol. 117, 208–216 (2018)

    Google Scholar 

  16. Hermann, K.M., Blunsom, P.: Multilingual models for compositional distributed semantics. In: ACL, vol. 1, pp. 58–68, Long Papers (2014)

    Google Scholar 

  17. Karypis, G.: Cluto: A clustering toolkit. University of Minnesota Department of Computer Science, Technical report (2003)

    Google Scholar 

  18. Li, J., Jurafsky, D.: Do multi-sense embeddings improve natural language understanding? arXiv preprint arXiv:1506.01070 (2015)

  19. Luong, T., Pham, H., Manning, C.D.: Bilingual word representations with monolingual quality in mind. In: NAACL Workshop on Vector Space Modeling for NLP, pp. 151–159 (2015)

    Google Scholar 

  20. Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)

    MATH  Google Scholar 

  21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  22. Poplack, S.: Syntactic structure and social function of code-switching, vol. 2. Centro de Estudios Puertorriqueños, City University of New York (1978)

    Google Scholar 

  23. Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. arXiv preprint arXiv:1706.04902 (2017)

  24. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)

    Article  Google Scholar 

  25. Sabty, C., Elmahdy, M., Abdennadher, S.: Named entity recognition on Arabic-English code-mixed data. In: 2019 IEEE 13th International Conference on Semantic Computing (ICSC), pp. 93–97. IEEE (2019)

    Google Scholar 

  26. Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: EMNLP (2015)

    Google Scholar 

  27. Tsai, C.T., Roth, D.: Cross-lingual wikification using multilingual embeddings. In: NAACL, pp. 589–598 (2016)

    Google Scholar 

  28. Upadhyay, S., Faruqui, M., Dyer, C., Roth, D.: Cross-lingual models of word embeddings: an empirical comparison. In: ACL, pp. 1661–1670 (2016)

    Google Scholar 

  29. Vu, N.T., et al.: A first speech recognition system for Mandarin-English code-switch conversational speech. In: ICASSP, pp. 4889–4892 (2012)

    Google Scholar 

  30. Vulić, I., Moens, M.F.: A study on bootstrapping bilingual vector spaces from non-parallel data (and nothing else). In: EMNLP, pp. 1613–1624 (2013)

    Google Scholar 

  31. Vulić, I., Moens, M.F.: Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), vol. 2, pp. 719–725 (2015)

    Google Scholar 

  32. Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: ACM SIGIR, pp. 363–372 (2015)

    Google Scholar 

  33. Vulić, I., Moens, M.F.: Bilingual distributed word representations from document-aligned comparable data. JAIR 55, 953–994 (2016)

    Article  MathSciNet  Google Scholar 

  34. Wang, R., Zhao, H., Ploux, S., Lu, B.L., Utiyama, M., Sumita, E.: Graph-based bilingual word embedding for statistical machine translation. TALLIP 17, 31 (2018)

    Google Scholar 

  35. Wick, M., Kanani, P., Pocock, A.C.: Minimally-constrained multilingual embeddings via artificial code-switching. In: AAAI, pp. 2849–2855 (2016)

    Google Scholar 

  36. Ying, L., Fung, P.: Language modeling with functional head constraint for code switching speech recognition. In: EMNLP, pp. 907–916 (2014)

    Google Scholar 

  37. Zbib, R., et al.: Machine translation of Arabic dialects. In: NAACL, pp. 49–59 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Injy Hamed .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hamed, I., Zhu, M., Elmahdy, M., Abdennadher, S., Vu, N.T. (2019). Code-Switching Language Modeling with Bilingual Word Embeddings: A Case Study for Egyptian Arabic-English. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26061-3_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26060-6

  • Online ISBN: 978-3-030-26061-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics