Language Identification for User Generated Content in Social Media

Zarnoufi, Randa; Jaafar, Hamid; Abik, Mounia

doi:10.1007/978-3-030-03577-8_73

Randa Zarnoufi⁵,
Hamid Jaafar⁶ &
Mounia Abik⁷

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 111))

Included in the following conference series:

International Conference Europe Middle East & North Africa Information Systems and Technologies to Support Learning

1040 Accesses

Abstract

In this paper we address the problem of Language Identification (LID) of user generated content in Social Media Communication (SMC). The existent LID solutions are very accurate in standard languages and normal texts. However, for non standard ones (i.e. SMC) this is still unreachable. To help resolve this problem, we present a language independent LID solution for non standard use of language, where we combine linguistic tools (morphology analyzers) and statistical models (language models) in a hybrid approach to identify the standard and non standard languages included in these SMC texts. Our solution treats also the Code Switching phenomenon between standard languages and dialect as well as the normalization of SMC special expressions and dialect, and finally the spelling correction of OOV words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Common words: conjunctions, prepositions, determiners….
2.
N-gram: sequence of characters or words.
3.
Token: Lexical unit.
4.
RNN: Recurrent Neural Network.
5.
http://people.eng.unimelb.edu.au/tbaldwin/etc./emnlp2012-lexnorm.tgz.
6.
https://norvig.com/spell-correct.html.
7.
https://en.wikipedia.org/wiki/Soundex#cite_note-8.

References

McNamee, P.: Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20, 94–101 (2005)
Google Scholar
Baldwin, T.: Language identification in the Wild (2017)
Google Scholar
Voss, C., Tratz, S., Laoudi, J., Briesch, D.: Finding Romanized Arabic dialect in code-mixed tweets. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, pp. 188–199 (2014)
Google Scholar
Zhang, W., Clark, R.A.J., Wang, Y.: Unsupervised language filtering using the latent Dirichlet allocation. Comput. Speech Lang. 39, 47–66 (2016)
Article Google Scholar
Samih, Y., Maharjan, S., Attia, M., Kallmeyer, L., Solorio, T.: Multilingual code-switching identification via LSTM recurrent neural networks. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50–59 (2016)
Google Scholar
Jurgens, D., Tsvetkov, Y., Jurafsky, D.: Incorporating Dialectal Variability for Socially Equitable Language Identification. In: ACL, pp. 51–57 (2017)
Google Scholar
Elfardy, H., Diab, M.: Token level identification of linguistic code switching. In: Proceedings of COLING 2012: Posters, pp. 287–296 (2012)
Google Scholar
Nguyen, D., Do, A.S.: Word level language identification in online multilingual communication. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, 18–21 October 2013, pp. 857–862 (2013)
Google Scholar
Barman, U., Das, A., Wagner, J., Foster, J.: Code mixing: a challenge for language identification in the language of social media. In: First Workshop on Computational Approaches to Code Switching, pp. 21–31 (2014)
Google Scholar
Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., Lindén, K.: Automatic language identification in texts: a survey. J. Artif. Intell. Res. 1–97 (2018)
Google Scholar
Han, B., Cook, P., Baldwin, T.: Automatically constructing a normalisation dictionary for microblogs. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, 12–14 July 2012, pp. 421–432 (2012)
Google Scholar
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7, 171–176 (1964)
Article Google Scholar
Samih,Y.: Detecting code-switching in Moroccan Arabic social media. In: SocialNLP workshop at IJCAI 2016 (2016)
Google Scholar
Jaafar, H.: Le Nom et l’Adjectif dans l’Arabe Marocain: Etude Lexicologique, Ph.D. Thesis (2012)
Google Scholar
Koehn, P.: Europarl : a parallel corpus for statistical machine translation. In: MT Summit, pp. 79–86 (2005)
Google Scholar
Ling, W., Xiang, G., Dyer, C., Black, A., Trancoso, I.: Microblogs as parallel corpora. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 176–186 (2013)
Google Scholar
Adouane, W., Dobnik, S.: Identification of languages in Algerian Arabic multilingual documents. In: Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP), Valencia, Spain, 3 April 2017, pp. 1–8 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

IPSS Research Team, FSR, Mohammed V University, Rabat, Morocco
Randa Zarnoufi
Polydisciplinary Faculty of Safi, Caddi Ayyad University, Safi, Morocco
Hamid Jaafar
IPSS Research Team, ENSIAS, Mohammed V University, Rabat, Morocco
Mounia Abik

Authors

Randa Zarnoufi
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Jaafar
View author publications
You can also search for this author in PubMed Google Scholar
Mounia Abik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Randa Zarnoufi .

Editor information

Editors and Affiliations

Departamento de Engenharia Informática, Faculdade de Ciências e Tecnologia, Universidade de Coimbra, Coimbra, Portugal
Álvaro Rocha
Departement informatique, Faculté des Sciences, Université Mohammed Ier, Oujda, Morocco
Mohammed Serrhini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zarnoufi, R., Jaafar, H., Abik, M. (2019). Language Identification for User Generated Content in Social Media. In: Rocha, Á., Serrhini, M. (eds) Information Systems and Technologies to Support Learning. EMENA-ISTL 2018. Smart Innovation, Systems and Technologies, vol 111. Springer, Cham. https://doi.org/10.1007/978-3-030-03577-8_73

Download citation

DOI: https://doi.org/10.1007/978-3-030-03577-8_73
Published: 25 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03576-1
Online ISBN: 978-3-030-03577-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics