Skip to main content

Language Identification for User Generated Content in Social Media

  • Conference paper
  • First Online:
Information Systems and Technologies to Support Learning (EMENA-ISTL 2018)

Abstract

In this paper we address the problem of Language Identification (LID) of user generated content in Social Media Communication (SMC). The existent LID solutions are very accurate in standard languages and normal texts. However, for non standard ones (i.e. SMC) this is still unreachable. To help resolve this problem, we present a language independent LID solution for non standard use of language, where we combine linguistic tools (morphology analyzers) and statistical models (language models) in a hybrid approach to identify the standard and non standard languages included in these SMC texts. Our solution treats also the Code Switching phenomenon between standard languages and dialect as well as the normalization of SMC special expressions and dialect, and finally the spelling correction of OOV words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Common words: conjunctions, prepositions, determiners….

  2. 2.

    N-gram: sequence of characters or words.

  3. 3.

    Token: Lexical unit.

  4. 4.

    RNN: Recurrent Neural Network.

  5. 5.

    http://people.eng.unimelb.edu.au/tbaldwin/etc./emnlp2012-lexnorm.tgz.

  6. 6.

    https://norvig.com/spell-correct.html.

  7. 7.

    https://en.wikipedia.org/wiki/Soundex#cite_note-8.

References

  1. McNamee, P.: Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20, 94–101 (2005)

    Google Scholar 

  2. Baldwin, T.: Language identification in the Wild (2017)

    Google Scholar 

  3. Voss, C., Tratz, S., Laoudi, J., Briesch, D.: Finding Romanized Arabic dialect in code-mixed tweets. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, pp. 188–199 (2014)

    Google Scholar 

  4. Zhang, W., Clark, R.A.J., Wang, Y.: Unsupervised language filtering using the latent Dirichlet allocation. Comput. Speech Lang. 39, 47–66 (2016)

    Article  Google Scholar 

  5. Samih, Y., Maharjan, S., Attia, M., Kallmeyer, L., Solorio, T.: Multilingual code-switching identification via LSTM recurrent neural networks. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50–59 (2016)

    Google Scholar 

  6. Jurgens, D., Tsvetkov, Y., Jurafsky, D.: Incorporating Dialectal Variability for Socially Equitable Language Identification. In: ACL, pp. 51–57 (2017)

    Google Scholar 

  7. Elfardy, H., Diab, M.: Token level identification of linguistic code switching. In: Proceedings of COLING 2012: Posters, pp. 287–296 (2012)

    Google Scholar 

  8. Nguyen, D., Do, A.S.: Word level language identification in online multilingual communication. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, 18–21 October 2013, pp. 857–862 (2013)

    Google Scholar 

  9. Barman, U., Das, A., Wagner, J., Foster, J.: Code mixing: a challenge for language identification in the language of social media. In: First Workshop on Computational Approaches to Code Switching, pp. 21–31 (2014)

    Google Scholar 

  10. Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., Lindén, K.: Automatic language identification in texts: a survey. J. Artif. Intell. Res. 1–97 (2018)

    Google Scholar 

  11. Han, B., Cook, P., Baldwin, T.: Automatically constructing a normalisation dictionary for microblogs. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, 12–14 July 2012, pp. 421–432 (2012)

    Google Scholar 

  12. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7, 171–176 (1964)

    Article  Google Scholar 

  13. Samih,Y.: Detecting code-switching in Moroccan Arabic social media. In: SocialNLP workshop at IJCAI 2016 (2016)

    Google Scholar 

  14. Jaafar, H.: Le Nom et l’Adjectif dans l’Arabe Marocain: Etude Lexicologique, Ph.D. Thesis (2012)

    Google Scholar 

  15. Koehn, P.: Europarl : a parallel corpus for statistical machine translation. In: MT Summit, pp. 79–86 (2005)

    Google Scholar 

  16. Ling, W., Xiang, G., Dyer, C., Black, A., Trancoso, I.: Microblogs as parallel corpora. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 176–186 (2013)

    Google Scholar 

  17. Adouane, W., Dobnik, S.: Identification of languages in Algerian Arabic multilingual documents. In: Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP), Valencia, Spain, 3 April 2017, pp. 1–8 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Randa Zarnoufi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zarnoufi, R., Jaafar, H., Abik, M. (2019). Language Identification for User Generated Content in Social Media. In: Rocha, Á., Serrhini, M. (eds) Information Systems and Technologies to Support Learning. EMENA-ISTL 2018. Smart Innovation, Systems and Technologies, vol 111. Springer, Cham. https://doi.org/10.1007/978-3-030-03577-8_73

Download citation

Publish with us

Policies and ethics