Skip to main content
Log in

Improving Chinese spell checking with bidirectional LSTMs and confusionset-based decision network

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Chinese spell checking is a task to detect and correct Chinese spelling errors, which is very important for natural language understanding. Generally, studies on Chinese spell checking are mainly based on n-gram language model and neural network models. However, the validity of the n-gram model needs to balance the value n and the storage resources, and most neural networks cannot efficiently handle the cases with severely uneven distribution of the correct and incorrect characters. This makes spell checking be limited in text application scenarios that contain many oral expressions. To solve the issues, a confusionset-guided decision network for spoken Chinese spell checking is proposed. By using confusionset to generate candidate set, the model can reasonably locate the wrong characters with decision network which ensures bidirectional long short-term memory pay more attention to the characteristics of the wrong characters. To verify the correctness and effectiveness of our model, extensive experiments were carried out on a logistics question and answer corpus and SIGHAN Bake-off dataset. Experimental results show that the model is efficient. It is much effective in spell checking for spoken Chinese, and it outperforms all competitor models. Besides, it can efficiently correct the wrong characters in real scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Availability of data and materials

In this work, we have used two publicly available datasets. SBDS dataset can be downloaded from “http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html”, “http://ir.itc.ntnu.edu.tw/lre/clp14csc.html”, “http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html”. LCQMC dataset can be founded at “https://aclanthology.org/C18-1166”. The LDDS dataset is provided by YTO Express and can not be published publicly. The code of our model is available at “https://github.com/JackMacs/CGDN-SC”.

Notes

  1. Pinyin input method uses Chinese Pinyin as the coding method, including full spelling input method and double spelling input method.

  2. Shape code input method is a method to encode Chinese characters based on their shape, such as strokes or parts of Chinese characters.

  3. Confusionset is a set of Chinese characters with similar shapes, same pronunciation, and similar pronunciation. In this paper, the confusionset is shared by SIGHAN Bake-off.

  4. In Chinese, there is no explicit delimiter between words. One or more characters can form a word. e.g., ‘唱歌’ (Sing) is a word which consists of Chinese character ‘唱’ and ‘歌’. However, in some cases, ‘唱’ (sing) can be expressed as a verb and ‘歌’ (song) as a noun. Therefore, the segmentation error is hard to avoid, especially the segmentation that contains the wrong characters.

  5. Jieba is a Chinese word segmentation tool with good effect at present. It supports Chinese simplified and traditional Chinese word segmentation, and also supports customized dictionaries. https://github.com/fxsjy/jieba/tree/jieba3k.

References

  1. Li YH, Zhou QY, Li YN, Li ZL, Liu RY, Sun RY, Wang ZZ, Li C, Cao YB, Zheng HT (2022) The past mistake is the future wisdom: error-driven contrastive probability optimization for Chinese spell checking. In: Findings of the association for computational linguistics, ACL, pp 3202–3213 https://doi.org/10.18653/v1/2022.findings-acl.252

  2. Duan JY, Pan LJ, Wang H, Zhang M, Wu ML (2019) Automatically build corpora for chinese spelling check based on the input method. In: CCF international conference on natural language processing and Chinese computing. NLPCC, pp 471–485. https://doi.org/10.1007/978-3-030-32233-5_37

  3. Liu XD, Cheng F, Luo YY, Duh K, Matsumoto Y (2013) A hybrid chinese spelling correction using language model and statistical machine translation with reranking. In: Proceedings of the seventh SIGHAN workshop on chinese language processing, pp 54–58. https://aclanthology.org/W13-4409

  4. Xie HH, Li AL, Li YB, Cheng J, Chen ZY, Lyu XQ, Tang Z (2019) Automatic chinese spelling checking and correction based on character-based pre-trained contextual representations. In: Natural language processing and Chinese computing: 8th CCF international conference. NLPCC 11839:540–549. https://doi.org/10.1007/978-3-030-32236-6_49

  5. Hu M, Peng JJ, Zhang WQ, Hu JX, Qi LZ, Zhang HX (2022) Text representation model for multiple language forms in spoken Chinese expression. Int J Pattern Recognit Artif Intell 36(8):22530041–225300417

    Article  Google Scholar 

  6. Prabhakar G (2020) A context-sensitive real-time spell checker with language adaptability. In: 2020 IEEE 14th international conference on semantic computing. IEEE, pp 116–122. https://doi.org/10.1109/ICSC.2020.00023

  7. Ji T, Yan H, Qiu XP (2021) Spellbert: a lightweight pretrained model for Chinese spelling check. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 3544–3551. https://doi.org/10.18653/v1/2021.emnlp-main.287

  8. Huang L, Li JJ, Jiang WW, Zhang ZY, Chen MC, Wang SJ, Xiao J (2021) Phmospell: phonological and morphological knowledge guided Chinese spelling check. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing. ACL, pp 5958–5967. https://doi.org/10.18653/v1/2021.acl-long.464

  9. Singh S, Singh S (2019) Handling real-word errors of Hindi language using n-gram and confusion set. In: Amity international conference on artificial intelligence, pp 433–438. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp= &arnumber=8701394

  10. Hu M, Peng JJ, Zhang WQ, Hu JX, Qi LZ, Zhang HX (2021) An intent recognition model supporting the spoken expression mixed with Chinese and English. J Intell Fuzzy Syst. https://doi.org/10.3233/JIFS-202365

    Article  Google Scholar 

  11. Lin CJ, Chu WC (2015) A study on Chinese spelling check using confusion sets and N-gram statistics. Int J Comput Linguist Chin Lang Process 20(1). http://www.aclclp.org.tw/clclp/v20n1/v20n1a2.pdf

  12. Wang H, Wang B, Duan JY, Zhang JJ (2021) Chinese spelling error detection using a fusion lattice LSTM. ACM Trans Asian Lang Inf Process 20(2):28:1-28:11. https://doi.org/10.1145/3426882

    Article  Google Scholar 

  13. Qiu ZQ, Qu YL (2019) A two-stage model for chinese grammatical error correction. IEEE Access 7:146772–146777. https://doi.org/10.18653/v1/2021.acl-long.46410.1109/ACCESS.2019.2940607

  14. Wang QF, Liu MH, Zhang WJ, Guo YH, Li TR (2019) Automatic proofreading in chinese: Detect and correct spelling errors in character-level with deep neural networks. In: International conference on natural language processing and Chinese computing. NLPCC 2, pp 349–359. https://doi.org/10.1007/978-3-030-32236-6_31

  15. Tian JC, Chen SZ, Zhang XW, Feng ZY (2019) Bsil: A brain storm-based framework for imbalanced text classification. In: CCF international conference on natural language processing and Chinese computing. NLPCC 2:53–64. https://doi.org/10.1007/978-3-030-32236-6_5

  16. Nagata R, Whittaker E, Sheinman V (2011) Creating a manually error-tagged and shallow-parsed learner corpus. In: Proc. of 49th annual meeting of the association for computational linguistics: human language technologies, pp 1210-1219. http://aclweb.org/anthology/P11-1121

  17. Liu LL, Cao CG (2016) Chinese real-word error automatic proofreading based on combining of local context features. Comput Sci 43(12):30–35. https://doi.org/10.11896/j.issn.1002-137X.2016.12.005

    Article  Google Scholar 

  18. Duan JY, Wang B, Tan Z, Wei XP, Wang H (2019) Chinese spelling check via bidirectional lstm-crf. In: 2019 IEEE 8th joint international information technology and artificial intelligence conference. ITAIC, pp 1333–1336. https://doi.org/10.1109/ITAIC.2019.8785520

  19. Nguyen M, Ngo GH, Chen NF (2021) Domain-shift conditioning using adaptable filtering via hierarchical embeddings for robust Chinese spell check. IEEE/ACM Trans Audio Speech Lang Process. 29:2027–2036. https://doi.org/10.1109/TASLP.2021.3083108

    Article  Google Scholar 

  20. Liu SL, Yang T, Yue TC, Zhang F, Wang D (2021) PLOME: Pre-training with misspelled knowledge for Chinese spelling correction. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing. ACL 1:2991–3000. https://doi.org/10.18653/v1/2021.acl-long.233

  21. Wang DM, Tay Y, Zhong L (2019) Confusionset-guided pointer networks for Chinese spelling check. In: Proceedings of the 57th annual meeting of the association for computational linguistics. ACL:, pp 5780–5785. https://doi.org/10.18653/v1/p19-1578

  22. Zhang HQ, Xiao X, Mercaldo F, Ni SG, Martinelli F, Sangaiah AK (2019) Classification of ransomware families with machine learning based on n-gram of opcodes. Future Gener Comput Syst 90:211–221. https://doi.org/10.1016/j.future.2018.07.052

    Article  Google Scholar 

  23. Cheng XY, Xu WD, Chen KL, Jiang SH, Wang F, Wang TF, Chu W, Qi Y (2020) Spellgcn: incorporating phonological and visual similarities into language models for Chinese spelling check. In: Proceedings of the 58th annual meeting of the association for computational linguistics. ACL, pp 871–881. https://doi.org/10.18653/v1/2020.acl-main.81

  24. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies. NAACL-HLT, pp 4171–4186. https://aclanthology.org/N19-1423

  25. Niranjan A, Shaik MAB, Verma K (2021) Hierarchical attention transformer architecture for syntactic spell correction. CoRR, abs/2005.04876. https://arxiv.org/abs/2005.04876

  26. Do DT, Nguyen HT, Bui TN, Vo HD (2021) VSEC: transformer-based model for vietnamese spelling correction. In: PRICAI 2021: trends in artificial intelligence—18th Pacific RIM international conference on artificial intelligence. PRICAI 13032, pp 259–272. https://doi.org/10.1007/978-3-030-89363-7_20

  27. Guo Z, Ni Y, Wang KQ, Zhu W, Xie GT (2021) Global attention decoder for chinese spelling error correction. In: Findings of the association for computational linguistics. ACL/IJCNLP, pp 1419–1428. https://doi.org/10.18653/v1/2021.findings-acl.122

  28. Zhao QB, Shen XF, Yao J (2020) Ime-spell: Chinese spelling check based on input method. In: Proceedings of the 4th international conference on natural language processing and information retrieval. NLPIR:85–90. https://doi.org/10.1145/3443279.3443297

  29. Wu SH, Liu CL, Lee LH (2013) Chinese spelling check evaluation at Sighan bake-off 2013. In: Proceedings of the seventh SIGHAN workshop on Chinese language processing, pp 35–42. https://aclanthology.org/W13-4406/

  30. Fung G, Debosschere M, Wang DM, Li B, Zhu J, Wong KF (2017) Nlptea 2017 shared task–Chinese spelling check. In: Proceedings of the 4th workshop on natural language processing techniques for educational applications. NLPTEA:, pp 29–34. https://aclanthology.org/W17-5905/

  31. Zhao YY, Jiang N, Sun WW, Wan XJ (2018) Overview of the NLPCC 2018 shared task: grammatical error correction. In: CCF international conference on natural language processing and Chinese computing. NLPCC, pp 439–445. https://doi.org/10.1007/978-3-319-99501-4_41

  32. Tseng YH, Lee LH, Chang LP, Chen HH (2015) Introduction to Sighan 2015 bake-off for Chinese spelling check. In: Proceedings of the Eighth SIGHAN workshop on Chinese language processing:2–37. https://aclanthology.org/W15-3106

  33. Liu X, Chen QC, Deng C, Zeng HJ,Chen J, Li DF, Tang BZ (2018) Lcqmc: A large-scale Chinese question matching corpus. In: Proceedings of the 27th international conference on computational linguistics:1952–1962. https://aclanthology.org/C18-1166

  34. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning. ICML 37:448–456. http://proceedings.mlr.press/v37/ioffe15.html

  35. Liu XX, Wang S, Wang DS, Wang PZ, Cao CG (2013) Automatic text error detection in domain question answering. J Chin Inf Process, 27(3):77–83. https://pay.cnki.net/zscsdoc/download?flag=cnkispace &plat=cnkispace &filename=MESS201303011 &dbtype=CJFD &year=2013 &dtype=pdf

  36. Huang Q, Huang PJ, Zhang XR, Xie WJ, Hong KD, Chen BZ, Huang L (2014) Chinese spelling check system based on tri-gram model. In: Proceedings of the third CIPS-SIGHAN joint conference on Chinese language processing, pp 173–178. https://doi.org/10.3115/v1/W14-6827

  37. Xu M (2020) pycorrector: Text error correction tool. https://github.com/shibing624/pycorrector

  38. Clark K, Luong MT, Le QV, Manning CD (2020) Electra: pre-training text encoders as discriminators rather than generators. In: International conference on learning representations. https://openreview.net/forum?id=r1xMH1BtvB

  39. Wang BX, Che WX, Wu DY, Wang SJ , Hu GP, Liu T (2021) Dynamic connected networks for Chinese spelling check. In: Findings of the association for computational linguistics. ACL/IJCNLP:2437–2446. https://doi.org/10.18653/v1/2021.findings-acl.216

  40. Mikolov T, Chen K, Corrado G, (2013) Efficient estimation of word representations in vector space. Comput Sci: 1-12. https://arxiv.org/pdf/1301.3781.pdf

  41. Chen KJ, Huang CR, Chang LP, Hsu HL (1996) SINICA CORPUS : design methodology for balanced corpora. In: Language, information and computation: selected papers from the 11th Pacific Asia conference on language, information and computation, pp 167–176. https://hdl.handle.net/2065/12025

  42. https://github.com/fxsjy/jieba/tree/jieba3k

  43. Johannes JM (1979) An example of how the control variate method reduces noise in Monte Carlo experiments: an example of how the control variate method. Commun Stat-Simul Comput 8(4):335–347. https://doi.org/10.1080/03610917908812123

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We appreciate the Open Project Program of Shanghai Key Laboratory of Data Science (No. 2020090600004), and thank the High Performance Computing Center of Shanghai University, and Shanghai Engineering Research Center of Intelligent Computing System (No. 19DZ2252600) for providing the computing resources. We also thank YTO express company for supporting the data and industry knowledge.

Funding

This study was supported by the Open Project Program of Shanghai Key Laboratory of Data Science (No. 2020090600004), and the High Performance Computing Center of Shanghai University, and Shanghai Engineering Research Center of Intelligent Computing System (No. 19DZ2252600).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization contributed by JP; methodology contributed by CM and MH; formal analysis and investigation contributed by MH; writing—original draft preparation contributed by CM; writing—review and editing contributed by CM and JP; software contributed by CZ; visualization contributed by QX.

Corresponding author

Correspondence to Junjie Peng.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Ethics approval

This article has never been submitted to more than one journal for simultaneous consideration. This article is original.

Consent to participate

The authors have approved this article before submission, including the names and order of authors.

Consent for publication

The authors agreed with the content and gave explicit consent to submit.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, C., Hu, M., Peng, J. et al. Improving Chinese spell checking with bidirectional LSTMs and confusionset-based decision network. Neural Comput & Applic 35, 15679–15692 (2023). https://doi.org/10.1007/s00521-023-08570-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08570-5

Keywords

Navigation