Abstract
We study a task of noisy text normalization focusing on Vietnamese tweets. This task aims to improve the performance of applications mining or analyzing semantics of social media contents as well as other social network analysis applications. Since tweets on Twitter are noisy, irregular, short and consist of acronym, spelling errors, processing those tweets is more challenging than that of news or formal texts. In this paper, we proposed a method that aims to normalize Vietnamese tweets by detecting non-standard words as well as spelling errors and correcting them. The method combines a language model with dictionaries and Vietnamese vocabulary structures. We build a dataset including 1,360 Vietnamese tweets to evaluate the proposed method. Experiment results show that our method achieved encouraging performance with 89 % F1-Score.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Banko M, Brill E (2001) Scaling to very very large corpora for natural language disambiguation. In: Proceedings of the 39th annual meeting on association for computational linguistics, pp 26–33
Carlson A, Fette I (2007) Memory-based context-sensitive spelling correction at web scale. In: Proceedings of the sixth international conference on machine learning and applications, pp 166–171
Choi D, Kim J et al (2014) A method for normalizing non-standard words in online social network services: a case study on twitter. In: context-aware systems and applications second international conference, ICCASA 2013, pp 359–368
Cotelo JM et al (2015) Amodular approach for lexical normalization applied to Spanish tweets. Expert Syst Appl 42(10):4743–4754
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302
Dien D (2005) Building an English—Vietnamese Bilingual Corpus. Ph.D. thesis, University of Social Sciences and Humanity of HCM City, Vietnam
Duan H et al (2012) A discriminative model for query spelling correction with latent structural svm. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 1511–1521
Duy NTN et al (2004) An approach in Vietnamese spell checking. Vietnamese. Bachelor's thesis, University of Science Ho Chi Minh city
Golding AR, Roth D (1999) Awinnow-based approach to context-sensitive spelling correction. Mach Learn 34(1–3):107–130
Habash N, Roth RM (2011) Using deep morphology to improve automatic error detection in arabic handwriting recognition. In: Proceedings of the 49th annual meeting of the association for computational linguistics. Human language technologies, vol 1, pp 875–884
Hai ND et al (1999) Syntactic parser in Vietnamese sentences and its application in spell checking. Vietnamese. Bachelor's thesis, University of Science Ho Chi Minh city
Han B, Baldwin T (2011) Lexical normalisation of short text messages: Makn sens a# twitter. In: Proceedings of the 49th annual meeting of the association for computational linguistics. Human language technologies, vol 1, pp 368–378
Han B et al (2013) Lexical normalization for social media text. ACMTrans Intell Syst Technol 4(1):621–633
Hassan H, Menezes A (2013) Social text normalization using contextual graph random walks. In: Proceedings of the 51st annual meeting of the association for computational linguistics, Association for computational linguistics, pp 1577–1586
Hassan Y et al (2014) Arabic spelling correction using supervised learning. In: Proceedings of the EMNLP 2014 workshop on Arabic natural language processing, Association for computational linguistics, pp 121–126
Huang Q et al (2014) Chinese spelling check system based on tri-gram model. In: Proceedings of the third CIPS-SIGHAN joint conference on Chinese language processing, pp 173–178
Huong NTX et al (2015) Using large n-gram for Vietnamese spell checking. In: Proceedings of sixth international conference KSE 2014, Springer International Publishing, pp 617–627
Li C, Liu Y (2014) Improving text normalization via unsupervised model and discriminative reranking. In: Proceedings of the ACL 2014 student research workshop, Association for Computational Linguistics, pp 86–93
Pennell DL, Liu Y (2014) Normalization of informal text. Comput Speech Lang 28(1):256–277
Phe H (2011) Syllable dictionary. Hanoi Encyclopedia Publishers, Dictionary Center
Quang N (2012) Language model and word segmentation in Vietnamese spell checking. Vietnamese. Bachelor's thesis, University of Engineering and Technology, Hanoi National University
Saloot MA et al (2014) An architecture for Malay tweet normalization. Inf Process Manage 50(5):621–633
Shaalan KF et al (2012) Arabic word generation and modelling for spell checking. In: Proceedings of the eight international conference on language resources and evaluation (LREC'12), European Language Resources Associations
Sönmez C, Ozgür A (2014) A graph-based approach for contextual text normalization. In: Conference on empirical methods in natural language processing (EMNLP), Association for Computational Linguistics, pp 313–324
Sproat R et al (2001) Normalization of non-standard words. Comput Speech Lang 15(3):287–333
Wu, S-H et al (2010) Reducing the false alarm rate of chinese character error detection and correction. In: Proceedings of CIPS-SIGHAN joint conference on chinese language processing (CLP 2010), pp 54–61
Yang Y, Eisenstein J (2013) A log-linear model for unsupervised text normalization. In: Proceedings of the 2013 conference on empirical methods in natural language processing, Association for Computational Linguistics, pp 61–72
Yeh J-F et al (2013) Chinese word spelling correction based on n-gram ranked inverted index list. In: Proceedings of the seventh IGHAN workshop on Chinese language processing (SIGHAN-7), pp 43–48
Acknowledgments
This work was supported by the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070), funded by the European Regional Development Fund and the national budget of the Czech Republic via the Research and Development for Innovations Operational Programme and by Project SP2015/146 “Parallel processing of Big data” 2 of the Student Grand System, VŠB—Technical University of Ostrava.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Nguyen, V.H., Nguyen, H.T., Snasel, V. (2015). Normalization of Vietnamese Tweets on Twitter. In: Abraham, A., Jiang, X., Snášel, V., Pan, JS. (eds) Intelligent Data Analysis and Applications. Advances in Intelligent Systems and Computing, vol 370. Springer, Cham. https://doi.org/10.1007/978-3-319-21206-7_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-21206-7_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21205-0
Online ISBN: 978-3-319-21206-7
eBook Packages: EngineeringEngineering (R0)