Normalization of Vietnamese Tweets on Twitter

Nguyen, Vu H.; Nguyen, Hien T.; Snasel, Vaclav

doi:10.1007/978-3-319-21206-7_16

Vu H. Nguyen⁶,
Hien T. Nguyen⁶ &
Vaclav Snasel⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 370))

1543 Accesses
5 Citations

Abstract

We study a task of noisy text normalization focusing on Vietnamese tweets. This task aims to improve the performance of applications mining or analyzing semantics of social media contents as well as other social network analysis applications. Since tweets on Twitter are noisy, irregular, short and consist of acronym, spelling errors, processing those tweets is more challenging than that of news or formal texts. In this paper, we proposed a method that aims to normalize Vietnamese tweets by detecting non-standard words as well as spelling errors and correcting them. The method combines a language model with dictionaries and Vietnamese vocabulary structures. We build a dataset including 1,360 Vietnamese tweets to evaluate the proposed method. Experiment results show that our method achieved encouraging performance with 89 % F1-Score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Banko M, Brill E (2001) Scaling to very very large corpora for natural language disambiguation. In: Proceedings of the 39th annual meeting on association for computational linguistics, pp 26–33
Google Scholar
Carlson A, Fette I (2007) Memory-based context-sensitive spelling correction at web scale. In: Proceedings of the sixth international conference on machine learning and applications, pp 166–171
Google Scholar
Choi D, Kim J et al (2014) A method for normalizing non-standard words in online social network services: a case study on twitter. In: context-aware systems and applications second international conference, ICCASA 2013, pp 359–368
Google Scholar
Cotelo JM et al (2015) Amodular approach for lexical normalization applied to Spanish tweets. Expert Syst Appl 42(10):4743–4754
Google Scholar
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302
Article Google Scholar
Dien D (2005) Building an English—Vietnamese Bilingual Corpus. Ph.D. thesis, University of Social Sciences and Humanity of HCM City, Vietnam
Google Scholar
Duan H et al (2012) A discriminative model for query spelling correction with latent structural svm. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 1511–1521
Google Scholar
Duy NTN et al (2004) An approach in Vietnamese spell checking. Vietnamese. Bachelor's thesis, University of Science Ho Chi Minh city
Google Scholar
Golding AR, Roth D (1999) Awinnow-based approach to context-sensitive spelling correction. Mach Learn 34(1–3):107–130
Google Scholar
Habash N, Roth RM (2011) Using deep morphology to improve automatic error detection in arabic handwriting recognition. In: Proceedings of the 49th annual meeting of the association for computational linguistics. Human language technologies, vol 1, pp 875–884
Google Scholar
Hai ND et al (1999) Syntactic parser in Vietnamese sentences and its application in spell checking. Vietnamese. Bachelor's thesis, University of Science Ho Chi Minh city
Google Scholar
Han B, Baldwin T (2011) Lexical normalisation of short text messages: Makn sens a# twitter. In: Proceedings of the 49th annual meeting of the association for computational linguistics. Human language technologies, vol 1, pp 368–378
Google Scholar
Han B et al (2013) Lexical normalization for social media text. ACMTrans Intell Syst Technol 4(1):621–633
Google Scholar
Hassan H, Menezes A (2013) Social text normalization using contextual graph random walks. In: Proceedings of the 51st annual meeting of the association for computational linguistics, Association for computational linguistics, pp 1577–1586
Google Scholar
Hassan Y et al (2014) Arabic spelling correction using supervised learning. In: Proceedings of the EMNLP 2014 workshop on Arabic natural language processing, Association for computational linguistics, pp 121–126
Google Scholar
Huang Q et al (2014) Chinese spelling check system based on tri-gram model. In: Proceedings of the third CIPS-SIGHAN joint conference on Chinese language processing, pp 173–178
Google Scholar
Huong NTX et al (2015) Using large n-gram for Vietnamese spell checking. In: Proceedings of sixth international conference KSE 2014, Springer International Publishing, pp 617–627
Google Scholar
Li C, Liu Y (2014) Improving text normalization via unsupervised model and discriminative reranking. In: Proceedings of the ACL 2014 student research workshop, Association for Computational Linguistics, pp 86–93
Google Scholar
Pennell DL, Liu Y (2014) Normalization of informal text. Comput Speech Lang 28(1):256–277
Google Scholar
Phe H (2011) Syllable dictionary. Hanoi Encyclopedia Publishers, Dictionary Center
Google Scholar
Quang N (2012) Language model and word segmentation in Vietnamese spell checking. Vietnamese. Bachelor's thesis, University of Engineering and Technology, Hanoi National University
Google Scholar
Saloot MA et al (2014) An architecture for Malay tweet normalization. Inf Process Manage 50(5):621–633
Google Scholar
Shaalan KF et al (2012) Arabic word generation and modelling for spell checking. In: Proceedings of the eight international conference on language resources and evaluation (LREC'12), European Language Resources Associations
Google Scholar
Sönmez C, Ozgür A (2014) A graph-based approach for contextual text normalization. In: Conference on empirical methods in natural language processing (EMNLP), Association for Computational Linguistics, pp 313–324
Google Scholar
Sproat R et al (2001) Normalization of non-standard words. Comput Speech Lang 15(3):287–333
Google Scholar
Wu, S-H et al (2010) Reducing the false alarm rate of chinese character error detection and correction. In: Proceedings of CIPS-SIGHAN joint conference on chinese language processing (CLP 2010), pp 54–61
Google Scholar
Yang Y, Eisenstein J (2013) A log-linear model for unsupervised text normalization. In: Proceedings of the 2013 conference on empirical methods in natural language processing, Association for Computational Linguistics, pp 61–72
Google Scholar
Yeh J-F et al (2013) Chinese word spelling correction based on n-gram ranked inverted index list. In: Proceedings of the seventh IGHAN workshop on Chinese language processing (SIGHAN-7), pp 43–48
Google Scholar

Download references

Acknowledgments

This work was supported by the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070), funded by the European Regional Development Fund and the national budget of the Czech Republic via the Research and Development for Innovations Operational Programme and by Project SP2015/146 “Parallel processing of Big data” 2 of the Student Grand System, VŠB—Technical University of Ostrava.

Author information

Authors and Affiliations

Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Vu H. Nguyen & Hien T. Nguyen
Faculty of Electrical Engineering and Computer Science, Department of Computer Science and IT4Innovations, VŠB-Technical University of Ostrava, Ostrava, Czech Republic
Vaclav Snasel

Authors

Vu H. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Hien T. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Vaclav Snasel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vaclav Snasel .

Editor information

Editors and Affiliations

Machine Intelligence Research Labs (MIR Labs), Auburn, Washington, USA
Ajith Abraham
Fujian University of Technology, Fujian, China
Xin Hua Jiang
Department of Computer Science, VSB-Technical University of Ostrava, Ostrava-Poruba, Czech Republic
Václav Snášel
Fujian University of Technology, Fujian, China
Jeng-Shyang Pan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, V.H., Nguyen, H.T., Snasel, V. (2015). Normalization of Vietnamese Tweets on Twitter. In: Abraham, A., Jiang, X., Snášel, V., Pan, JS. (eds) Intelligent Data Analysis and Applications. Advances in Intelligent Systems and Computing, vol 370. Springer, Cham. https://doi.org/10.1007/978-3-319-21206-7_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-21206-7_16
Published: 26 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21205-0
Online ISBN: 978-3-319-21206-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics