Skip to main content

Normalization of Vietnamese Tweets on Twitter

  • Conference paper
  • First Online:
Intelligent Data Analysis and Applications

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 370))

Abstract

We study a task of noisy text normalization focusing on Vietnamese tweets. This task aims to improve the performance of applications mining or analyzing semantics of social media contents as well as other social network analysis applications. Since tweets on Twitter are noisy, irregular, short and consist of acronym, spelling errors, processing those tweets is more challenging than that of news or formal texts. In this paper, we proposed a method that aims to normalize Vietnamese tweets by detecting non-standard words as well as spelling errors and correcting them. The method combines a language model with dictionaries and Vietnamese vocabulary structures. We build a dataset including 1,360 Vietnamese tweets to evaluate the proposed method. Experiment results show that our method achieved encouraging performance with 89 % F1-Score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://blog.twitter.com/2011/numbers.

  2. 2.

    http://vlsp.vietlp.org:8080/demo/.

  3. 3.

    http://www.speech.sri.com/projects/srilm/.

References

  1. Banko M, Brill E (2001) Scaling to very very large corpora for natural language disambiguation. In: Proceedings of the 39th annual meeting on association for computational linguistics, pp 26–33

    Google Scholar 

  2. Carlson A, Fette I (2007) Memory-based context-sensitive spelling correction at web scale. In: Proceedings of the sixth international conference on machine learning and applications, pp 166–171

    Google Scholar 

  3. Choi D, Kim J et al (2014) A method for normalizing non-standard words in online social network services: a case study on twitter. In: context-aware systems and applications second international conference, ICCASA 2013, pp 359–368

    Google Scholar 

  4. Cotelo JM et al (2015) Amodular approach for lexical normalization applied to Spanish tweets. Expert Syst Appl 42(10):4743–4754

    Google Scholar 

  5. Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302

    Article  Google Scholar 

  6. Dien D (2005) Building an English—Vietnamese Bilingual Corpus. Ph.D. thesis, University of Social Sciences and Humanity of HCM City, Vietnam

    Google Scholar 

  7. Duan H et al (2012) A discriminative model for query spelling correction with latent structural svm. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 1511–1521

    Google Scholar 

  8. Duy NTN et al (2004) An approach in Vietnamese spell checking. Vietnamese. Bachelor's thesis, University of Science Ho Chi Minh city

    Google Scholar 

  9. Golding AR, Roth D (1999) Awinnow-based approach to context-sensitive spelling correction. Mach Learn 34(1–3):107–130

    Google Scholar 

  10. Habash N, Roth RM (2011) Using deep morphology to improve automatic error detection in arabic handwriting recognition. In: Proceedings of the 49th annual meeting of the association for computational linguistics. Human language technologies, vol 1, pp 875–884

    Google Scholar 

  11. Hai ND et al (1999) Syntactic parser in Vietnamese sentences and its application in spell checking. Vietnamese. Bachelor's thesis, University of Science Ho Chi Minh city

    Google Scholar 

  12. Han B, Baldwin T (2011) Lexical normalisation of short text messages: Makn sens a# twitter. In: Proceedings of the 49th annual meeting of the association for computational linguistics. Human language technologies, vol 1, pp 368–378

    Google Scholar 

  13. Han B et al (2013) Lexical normalization for social media text. ACMTrans Intell Syst Technol 4(1):621–633

    Google Scholar 

  14. Hassan H, Menezes A (2013) Social text normalization using contextual graph random walks. In: Proceedings of the 51st annual meeting of the association for computational linguistics, Association for computational linguistics, pp 1577–1586

    Google Scholar 

  15. Hassan Y et al (2014) Arabic spelling correction using supervised learning. In: Proceedings of the EMNLP 2014 workshop on Arabic natural language processing, Association for computational linguistics, pp 121–126

    Google Scholar 

  16. Huang Q et al (2014) Chinese spelling check system based on tri-gram model. In: Proceedings of the third CIPS-SIGHAN joint conference on Chinese language processing, pp 173–178

    Google Scholar 

  17. Huong NTX et al (2015) Using large n-gram for Vietnamese spell checking. In: Proceedings of sixth international conference KSE 2014, Springer International Publishing, pp 617–627

    Google Scholar 

  18. Li C, Liu Y (2014) Improving text normalization via unsupervised model and discriminative reranking. In: Proceedings of the ACL 2014 student research workshop, Association for Computational Linguistics, pp 86–93

    Google Scholar 

  19. Pennell DL, Liu Y (2014) Normalization of informal text. Comput Speech Lang 28(1):256–277

    Google Scholar 

  20. Phe H (2011) Syllable dictionary. Hanoi Encyclopedia Publishers, Dictionary Center

    Google Scholar 

  21. Quang N (2012) Language model and word segmentation in Vietnamese spell checking. Vietnamese. Bachelor's thesis, University of Engineering and Technology, Hanoi National University

    Google Scholar 

  22. Saloot MA et al (2014) An architecture for Malay tweet normalization. Inf Process Manage 50(5):621–633

    Google Scholar 

  23. Shaalan KF et al (2012) Arabic word generation and modelling for spell checking. In: Proceedings of the eight international conference on language resources and evaluation (LREC'12), European Language Resources Associations

    Google Scholar 

  24. Sönmez C, Ozgür A (2014) A graph-based approach for contextual text normalization. In: Conference on empirical methods in natural language processing (EMNLP), Association for Computational Linguistics, pp 313–324

    Google Scholar 

  25. Sproat R et al (2001) Normalization of non-standard words. Comput Speech Lang 15(3):287–333

    Google Scholar 

  26. Wu, S-H et al (2010) Reducing the false alarm rate of chinese character error detection and correction. In: Proceedings of CIPS-SIGHAN joint conference on chinese language processing (CLP 2010), pp 54–61

    Google Scholar 

  27. Yang Y, Eisenstein J (2013) A log-linear model for unsupervised text normalization. In: Proceedings of the 2013 conference on empirical methods in natural language processing, Association for Computational Linguistics, pp 61–72

    Google Scholar 

  28. Yeh J-F et al (2013) Chinese word spelling correction based on n-gram ranked inverted index list. In: Proceedings of the seventh IGHAN workshop on Chinese language processing (SIGHAN-7), pp 43–48

    Google Scholar 

Download references

Acknowledgments

This work was supported by the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070), funded by the European Regional Development Fund and the national budget of the Czech Republic via the Research and Development for Innovations Operational Programme and by Project SP2015/146 “Parallel processing of Big data” 2 of the Student Grand System, VŠB—Technical University of Ostrava.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vaclav Snasel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Nguyen, V.H., Nguyen, H.T., Snasel, V. (2015). Normalization of Vietnamese Tweets on Twitter. In: Abraham, A., Jiang, X., Snášel, V., Pan, JS. (eds) Intelligent Data Analysis and Applications. Advances in Intelligent Systems and Computing, vol 370. Springer, Cham. https://doi.org/10.1007/978-3-319-21206-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21206-7_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21205-0

  • Online ISBN: 978-3-319-21206-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics