Abstract
Vietnamese spelling error detection and correction is a crucial task in Natural language processing, it plays an important role in many different real-world applications. Although there is a lot of research on it, dealing with diverse types of errors in Vietnamese is still a challenge. In this paper, we propose a model to help detect and correct some specific Vietnamese spelling errors by combining a pre-trained neural network-based Vietnamese language model and N-gram language model. We also provide a clear definition of handleable error types, error generation rules in the training set and evaluate our proposed model on a Vietnamese benchmark dataset at the word level. The experimental results show that our model achieves higher than from 1% to 14% f1-score than other neural network-based pre-trained language models in detection and make comparisons with bi, tri and 4-g language models to choose the best model for correction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Nguyen, H.T., Dang, T., Le, C.A., Nguyen, H.T., Dang, T.-T., Le, A.-C.: Adapting Vietnamese word segmentation for microblogs-style data (2014). https://www.researchgate.net/publication/283502832
Nguyen, V.H., Nguyen, H.T., Snasel, V.: Named entity recognition in Vietnamese tweets. In: Thai, M.T., Nguyen, N.P., Shen, H. (eds.) CSoNet 2015. LNCS, vol. 9197, pp. 205–215. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21786-4_18
Nguyen,Q.D., Le, D.A., Zelinka, I.: OCR error correction for unconstrained Vietnamese handwriten text. PervasiveHealth Pervasive Comput. Technol. Healthc. 132–138 (2019). https://doi.org/10.1145/3368926.3369686
Nguyen, P.H., Ngo, T.D., Phan, D.A., Dinh, T.P.T., Huynh, T.Q.: Vietnamese spelling detection and correction: using Bi-gram, minimum edit distance, SoundEx algorithms with some additional heuristics. In: RIVF 2008 - 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, pp. 96–102 (2008). https://doi.org/10.1109/RIVF.2008.4586339
Nguyen, H.T., Dang, T.B., Nguyen, L.M.: Deep learning approach for vietnamese consonant misspell correction. In: Nguyen, L.-M., Phan, X.-H., Hasida, K., Tojo, S. (eds.) PACLING 2019. CCIS, vol. 1215, pp. 497–504. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-6168-9_40
Nguyen, V.H., Nguyen, H.T., Snasel, V.: Normalization of Vietnamese tweets on Twitter. In: Abraham, A., Jiang, X.H., Snášel, V., Pan, J.-S. (eds.) Intelligent Data Analysis and Applications. AISC, vol. 370, pp. 179–189. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21206-7_16
Tran, H., Dinh, C.V., Phan, L., Nguyen, S.T.: Hierarchical transformer encoders for Vietnamese spelling correction, May 2021. http://arxiv.org/abs/2105.13578
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of Conference, vol. 1, no. Mlm, pp. 4171–4186 (2019)
Lin, J.: N-Gram Language Models N-Gram Language Models (2009)
Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of Sixth Workshop on Statistical Machine Translation, pp. 187–197 (2011)
Nguyen, D.Q., Nguyen, A.T.: PhoBert: pre-trained language models for Vietnamese, pp. 1037–1042 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.92
Ruder, S., Søgaard, A., Vulic, I.: Unsupervised cross-lingual representation learning. In: ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Tutoring Abstracts, pp. 31–38 (2019). https://doi.org/10.18653/v1/p19-4007
Acknowledgment
We gratefully acknowledge the support from the CMC Institute of Science Technology for funding the research project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tien, D.N., Minh, T.T.T., Vu, L.L., Minh, T.D. (2022). Vietnamese Spelling Error Detection and Correction Using BERT and N-gram Language Model. In: Anh, N.L., Koh, SJ., Nguyen, T.D.L., Lloret, J., Nguyen, T.T. (eds) Intelligent Systems and Networks. Lecture Notes in Networks and Systems, vol 471. Springer, Singapore. https://doi.org/10.1007/978-981-19-3394-3_49
Download citation
DOI: https://doi.org/10.1007/978-981-19-3394-3_49
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-3393-6
Online ISBN: 978-981-19-3394-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)