Abstract
In this paper we address the problem of Language Identification (LID) of user generated content in Social Media Communication (SMC). The existent LID solutions are very accurate in standard languages and normal texts. However, for non standard ones (i.e. SMC) this is still unreachable. To help resolve this problem, we present a language independent LID solution for non standard use of language, where we combine linguistic tools (morphology analyzers) and statistical models (language models) in a hybrid approach to identify the standard and non standard languages included in these SMC texts. Our solution treats also the Code Switching phenomenon between standard languages and dialect as well as the normalization of SMC special expressions and dialect, and finally the spelling correction of OOV words.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Common words: conjunctions, prepositions, determiners….
- 2.
N-gram: sequence of characters or words.
- 3.
Token: Lexical unit.
- 4.
RNN: Recurrent Neural Network.
- 5.
- 6.
- 7.
References
McNamee, P.: Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20, 94–101 (2005)
Baldwin, T.: Language identification in the Wild (2017)
Voss, C., Tratz, S., Laoudi, J., Briesch, D.: Finding Romanized Arabic dialect in code-mixed tweets. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, pp. 188–199 (2014)
Zhang, W., Clark, R.A.J., Wang, Y.: Unsupervised language filtering using the latent Dirichlet allocation. Comput. Speech Lang. 39, 47–66 (2016)
Samih, Y., Maharjan, S., Attia, M., Kallmeyer, L., Solorio, T.: Multilingual code-switching identification via LSTM recurrent neural networks. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50–59 (2016)
Jurgens, D., Tsvetkov, Y., Jurafsky, D.: Incorporating Dialectal Variability for Socially Equitable Language Identification. In: ACL, pp. 51–57 (2017)
Elfardy, H., Diab, M.: Token level identification of linguistic code switching. In: Proceedings of COLING 2012: Posters, pp. 287–296 (2012)
Nguyen, D., Do, A.S.: Word level language identification in online multilingual communication. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, 18–21 October 2013, pp. 857–862 (2013)
Barman, U., Das, A., Wagner, J., Foster, J.: Code mixing: a challenge for language identification in the language of social media. In: First Workshop on Computational Approaches to Code Switching, pp. 21–31 (2014)
Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., Lindén, K.: Automatic language identification in texts: a survey. J. Artif. Intell. Res. 1–97 (2018)
Han, B., Cook, P., Baldwin, T.: Automatically constructing a normalisation dictionary for microblogs. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, 12–14 July 2012, pp. 421–432 (2012)
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7, 171–176 (1964)
Samih,Y.: Detecting code-switching in Moroccan Arabic social media. In: SocialNLP workshop at IJCAI 2016 (2016)
Jaafar, H.: Le Nom et l’Adjectif dans l’Arabe Marocain: Etude Lexicologique, Ph.D. Thesis (2012)
Koehn, P.: Europarl : a parallel corpus for statistical machine translation. In: MT Summit, pp. 79–86 (2005)
Ling, W., Xiang, G., Dyer, C., Black, A., Trancoso, I.: Microblogs as parallel corpora. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 176–186 (2013)
Adouane, W., Dobnik, S.: Identification of languages in Algerian Arabic multilingual documents. In: Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP), Valencia, Spain, 3 April 2017, pp. 1–8 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zarnoufi, R., Jaafar, H., Abik, M. (2019). Language Identification for User Generated Content in Social Media. In: Rocha, Á., Serrhini, M. (eds) Information Systems and Technologies to Support Learning. EMENA-ISTL 2018. Smart Innovation, Systems and Technologies, vol 111. Springer, Cham. https://doi.org/10.1007/978-3-030-03577-8_73
Download citation
DOI: https://doi.org/10.1007/978-3-030-03577-8_73
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03576-1
Online ISBN: 978-3-030-03577-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)