Skip to main content

Vietnamese Spelling Error Detection and Correction Using BERT and N-gram Language Model

  • Conference paper
  • First Online:
Intelligent Systems and Networks

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 471))

Abstract

Vietnamese spelling error detection and correction is a crucial task in Natural language processing, it plays an important role in many different real-world applications. Although there is a lot of research on it, dealing with diverse types of errors in Vietnamese is still a challenge. In this paper, we propose a model to help detect and correct some specific Vietnamese spelling errors by combining a pre-trained neural network-based Vietnamese language model and N-gram language model. We also provide a clear definition of handleable error types, error generation rules in the training set and evaluate our proposed model on a Vietnamese benchmark dataset at the word level. The experimental results show that our model achieves higher than from 1% to 14% f1-score than other neural network-based pre-trained language models in detection and make comparisons with bi, tri and 4-g language models to choose the best model for correction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Nguyen, H.T., Dang, T., Le, C.A., Nguyen, H.T., Dang, T.-T., Le, A.-C.: Adapting Vietnamese word segmentation for microblogs-style data (2014). https://www.researchgate.net/publication/283502832

  2. Nguyen, V.H., Nguyen, H.T., Snasel, V.: Named entity recognition in Vietnamese tweets. In: Thai, M.T., Nguyen, N.P., Shen, H. (eds.) CSoNet 2015. LNCS, vol. 9197, pp. 205–215. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21786-4_18

    Chapter  Google Scholar 

  3. Nguyen,Q.D., Le, D.A., Zelinka, I.: OCR error correction for unconstrained Vietnamese handwriten text. PervasiveHealth Pervasive Comput. Technol. Healthc. 132–138 (2019). https://doi.org/10.1145/3368926.3369686

  4. Nguyen, P.H., Ngo, T.D., Phan, D.A., Dinh, T.P.T., Huynh, T.Q.: Vietnamese spelling detection and correction: using Bi-gram, minimum edit distance, SoundEx algorithms with some additional heuristics. In: RIVF 2008 - 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, pp. 96–102 (2008). https://doi.org/10.1109/RIVF.2008.4586339

  5. Nguyen, H.T., Dang, T.B., Nguyen, L.M.: Deep learning approach for vietnamese consonant misspell correction. In: Nguyen, L.-M., Phan, X.-H., Hasida, K., Tojo, S. (eds.) PACLING 2019. CCIS, vol. 1215, pp. 497–504. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-6168-9_40

    Chapter  Google Scholar 

  6. Nguyen, V.H., Nguyen, H.T., Snasel, V.: Normalization of Vietnamese tweets on Twitter. In: Abraham, A., Jiang, X.H., Snášel, V., Pan, J.-S. (eds.) Intelligent Data Analysis and Applications. AISC, vol. 370, pp. 179–189. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21206-7_16

    Chapter  Google Scholar 

  7. Tran, H., Dinh, C.V., Phan, L., Nguyen, S.T.: Hierarchical transformer encoders for Vietnamese spelling correction, May 2021. http://arxiv.org/abs/2105.13578

  8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of Conference, vol. 1, no. Mlm, pp. 4171–4186 (2019)

    Google Scholar 

  9. Lin, J.: N-Gram Language Models N-Gram Language Models (2009)

    Google Scholar 

  10. Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of Sixth Workshop on Statistical Machine Translation, pp. 187–197 (2011)

    Google Scholar 

  11. Nguyen, D.Q., Nguyen, A.T.: PhoBert: pre-trained language models for Vietnamese, pp. 1037–1042 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.92

  12. Ruder, S., Søgaard, A., Vulic, I.: Unsupervised cross-lingual representation learning. In: ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Tutoring Abstracts, pp. 31–38 (2019). https://doi.org/10.18653/v1/p19-4007

Download references

Acknowledgment

We gratefully acknowledge the support from the CMC Institute of Science Technology for funding the research project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tuan Dang Minh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tien, D.N., Minh, T.T.T., Vu, L.L., Minh, T.D. (2022). Vietnamese Spelling Error Detection and Correction Using BERT and N-gram Language Model. In: Anh, N.L., Koh, SJ., Nguyen, T.D.L., Lloret, J., Nguyen, T.T. (eds) Intelligent Systems and Networks. Lecture Notes in Networks and Systems, vol 471. Springer, Singapore. https://doi.org/10.1007/978-981-19-3394-3_49

Download citation

Publish with us

Policies and ethics