Abstract
Chinese spell checking is a task to detect and correct Chinese spelling errors, which is very important for natural language understanding. Generally, studies on Chinese spell checking are mainly based on n-gram language model and neural network models. However, the validity of the n-gram model needs to balance the value n and the storage resources, and most neural networks cannot efficiently handle the cases with severely uneven distribution of the correct and incorrect characters. This makes spell checking be limited in text application scenarios that contain many oral expressions. To solve the issues, a confusionset-guided decision network for spoken Chinese spell checking is proposed. By using confusionset to generate candidate set, the model can reasonably locate the wrong characters with decision network which ensures bidirectional long short-term memory pay more attention to the characteristics of the wrong characters. To verify the correctness and effectiveness of our model, extensive experiments were carried out on a logistics question and answer corpus and SIGHAN Bake-off dataset. Experimental results show that the model is efficient. It is much effective in spell checking for spoken Chinese, and it outperforms all competitor models. Besides, it can efficiently correct the wrong characters in real scenarios.
Similar content being viewed by others
Availability of data and materials
In this work, we have used two publicly available datasets. SBDS dataset can be downloaded from “http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html”, “http://ir.itc.ntnu.edu.tw/lre/clp14csc.html”, “http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html”. LCQMC dataset can be founded at “https://aclanthology.org/C18-1166”. The LDDS dataset is provided by YTO Express and can not be published publicly. The code of our model is available at “https://github.com/JackMacs/CGDN-SC”.
Notes
Pinyin input method uses Chinese Pinyin as the coding method, including full spelling input method and double spelling input method.
Shape code input method is a method to encode Chinese characters based on their shape, such as strokes or parts of Chinese characters.
Confusionset is a set of Chinese characters with similar shapes, same pronunciation, and similar pronunciation. In this paper, the confusionset is shared by SIGHAN Bake-off.
In Chinese, there is no explicit delimiter between words. One or more characters can form a word. e.g., ‘唱歌’ (Sing) is a word which consists of Chinese character ‘唱’ and ‘歌’. However, in some cases, ‘唱’ (sing) can be expressed as a verb and ‘歌’ (song) as a noun. Therefore, the segmentation error is hard to avoid, especially the segmentation that contains the wrong characters.
Jieba is a Chinese word segmentation tool with good effect at present. It supports Chinese simplified and traditional Chinese word segmentation, and also supports customized dictionaries. https://github.com/fxsjy/jieba/tree/jieba3k.
References
Li YH, Zhou QY, Li YN, Li ZL, Liu RY, Sun RY, Wang ZZ, Li C, Cao YB, Zheng HT (2022) The past mistake is the future wisdom: error-driven contrastive probability optimization for Chinese spell checking. In: Findings of the association for computational linguistics, ACL, pp 3202–3213 https://doi.org/10.18653/v1/2022.findings-acl.252
Duan JY, Pan LJ, Wang H, Zhang M, Wu ML (2019) Automatically build corpora for chinese spelling check based on the input method. In: CCF international conference on natural language processing and Chinese computing. NLPCC, pp 471–485. https://doi.org/10.1007/978-3-030-32233-5_37
Liu XD, Cheng F, Luo YY, Duh K, Matsumoto Y (2013) A hybrid chinese spelling correction using language model and statistical machine translation with reranking. In: Proceedings of the seventh SIGHAN workshop on chinese language processing, pp 54–58. https://aclanthology.org/W13-4409
Xie HH, Li AL, Li YB, Cheng J, Chen ZY, Lyu XQ, Tang Z (2019) Automatic chinese spelling checking and correction based on character-based pre-trained contextual representations. In: Natural language processing and Chinese computing: 8th CCF international conference. NLPCC 11839:540–549. https://doi.org/10.1007/978-3-030-32236-6_49
Hu M, Peng JJ, Zhang WQ, Hu JX, Qi LZ, Zhang HX (2022) Text representation model for multiple language forms in spoken Chinese expression. Int J Pattern Recognit Artif Intell 36(8):22530041–225300417
Prabhakar G (2020) A context-sensitive real-time spell checker with language adaptability. In: 2020 IEEE 14th international conference on semantic computing. IEEE, pp 116–122. https://doi.org/10.1109/ICSC.2020.00023
Ji T, Yan H, Qiu XP (2021) Spellbert: a lightweight pretrained model for Chinese spelling check. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 3544–3551. https://doi.org/10.18653/v1/2021.emnlp-main.287
Huang L, Li JJ, Jiang WW, Zhang ZY, Chen MC, Wang SJ, Xiao J (2021) Phmospell: phonological and morphological knowledge guided Chinese spelling check. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing. ACL, pp 5958–5967. https://doi.org/10.18653/v1/2021.acl-long.464
Singh S, Singh S (2019) Handling real-word errors of Hindi language using n-gram and confusion set. In: Amity international conference on artificial intelligence, pp 433–438. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp= &arnumber=8701394
Hu M, Peng JJ, Zhang WQ, Hu JX, Qi LZ, Zhang HX (2021) An intent recognition model supporting the spoken expression mixed with Chinese and English. J Intell Fuzzy Syst. https://doi.org/10.3233/JIFS-202365
Lin CJ, Chu WC (2015) A study on Chinese spelling check using confusion sets and N-gram statistics. Int J Comput Linguist Chin Lang Process 20(1). http://www.aclclp.org.tw/clclp/v20n1/v20n1a2.pdf
Wang H, Wang B, Duan JY, Zhang JJ (2021) Chinese spelling error detection using a fusion lattice LSTM. ACM Trans Asian Lang Inf Process 20(2):28:1-28:11. https://doi.org/10.1145/3426882
Qiu ZQ, Qu YL (2019) A two-stage model for chinese grammatical error correction. IEEE Access 7:146772–146777. https://doi.org/10.18653/v1/2021.acl-long.46410.1109/ACCESS.2019.2940607
Wang QF, Liu MH, Zhang WJ, Guo YH, Li TR (2019) Automatic proofreading in chinese: Detect and correct spelling errors in character-level with deep neural networks. In: International conference on natural language processing and Chinese computing. NLPCC 2, pp 349–359. https://doi.org/10.1007/978-3-030-32236-6_31
Tian JC, Chen SZ, Zhang XW, Feng ZY (2019) Bsil: A brain storm-based framework for imbalanced text classification. In: CCF international conference on natural language processing and Chinese computing. NLPCC 2:53–64. https://doi.org/10.1007/978-3-030-32236-6_5
Nagata R, Whittaker E, Sheinman V (2011) Creating a manually error-tagged and shallow-parsed learner corpus. In: Proc. of 49th annual meeting of the association for computational linguistics: human language technologies, pp 1210-1219. http://aclweb.org/anthology/P11-1121
Liu LL, Cao CG (2016) Chinese real-word error automatic proofreading based on combining of local context features. Comput Sci 43(12):30–35. https://doi.org/10.11896/j.issn.1002-137X.2016.12.005
Duan JY, Wang B, Tan Z, Wei XP, Wang H (2019) Chinese spelling check via bidirectional lstm-crf. In: 2019 IEEE 8th joint international information technology and artificial intelligence conference. ITAIC, pp 1333–1336. https://doi.org/10.1109/ITAIC.2019.8785520
Nguyen M, Ngo GH, Chen NF (2021) Domain-shift conditioning using adaptable filtering via hierarchical embeddings for robust Chinese spell check. IEEE/ACM Trans Audio Speech Lang Process. 29:2027–2036. https://doi.org/10.1109/TASLP.2021.3083108
Liu SL, Yang T, Yue TC, Zhang F, Wang D (2021) PLOME: Pre-training with misspelled knowledge for Chinese spelling correction. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing. ACL 1:2991–3000. https://doi.org/10.18653/v1/2021.acl-long.233
Wang DM, Tay Y, Zhong L (2019) Confusionset-guided pointer networks for Chinese spelling check. In: Proceedings of the 57th annual meeting of the association for computational linguistics. ACL:, pp 5780–5785. https://doi.org/10.18653/v1/p19-1578
Zhang HQ, Xiao X, Mercaldo F, Ni SG, Martinelli F, Sangaiah AK (2019) Classification of ransomware families with machine learning based on n-gram of opcodes. Future Gener Comput Syst 90:211–221. https://doi.org/10.1016/j.future.2018.07.052
Cheng XY, Xu WD, Chen KL, Jiang SH, Wang F, Wang TF, Chu W, Qi Y (2020) Spellgcn: incorporating phonological and visual similarities into language models for Chinese spelling check. In: Proceedings of the 58th annual meeting of the association for computational linguistics. ACL, pp 871–881. https://doi.org/10.18653/v1/2020.acl-main.81
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies. NAACL-HLT, pp 4171–4186. https://aclanthology.org/N19-1423
Niranjan A, Shaik MAB, Verma K (2021) Hierarchical attention transformer architecture for syntactic spell correction. CoRR, abs/2005.04876. https://arxiv.org/abs/2005.04876
Do DT, Nguyen HT, Bui TN, Vo HD (2021) VSEC: transformer-based model for vietnamese spelling correction. In: PRICAI 2021: trends in artificial intelligence—18th Pacific RIM international conference on artificial intelligence. PRICAI 13032, pp 259–272. https://doi.org/10.1007/978-3-030-89363-7_20
Guo Z, Ni Y, Wang KQ, Zhu W, Xie GT (2021) Global attention decoder for chinese spelling error correction. In: Findings of the association for computational linguistics. ACL/IJCNLP, pp 1419–1428. https://doi.org/10.18653/v1/2021.findings-acl.122
Zhao QB, Shen XF, Yao J (2020) Ime-spell: Chinese spelling check based on input method. In: Proceedings of the 4th international conference on natural language processing and information retrieval. NLPIR:85–90. https://doi.org/10.1145/3443279.3443297
Wu SH, Liu CL, Lee LH (2013) Chinese spelling check evaluation at Sighan bake-off 2013. In: Proceedings of the seventh SIGHAN workshop on Chinese language processing, pp 35–42. https://aclanthology.org/W13-4406/
Fung G, Debosschere M, Wang DM, Li B, Zhu J, Wong KF (2017) Nlptea 2017 shared task–Chinese spelling check. In: Proceedings of the 4th workshop on natural language processing techniques for educational applications. NLPTEA:, pp 29–34. https://aclanthology.org/W17-5905/
Zhao YY, Jiang N, Sun WW, Wan XJ (2018) Overview of the NLPCC 2018 shared task: grammatical error correction. In: CCF international conference on natural language processing and Chinese computing. NLPCC, pp 439–445. https://doi.org/10.1007/978-3-319-99501-4_41
Tseng YH, Lee LH, Chang LP, Chen HH (2015) Introduction to Sighan 2015 bake-off for Chinese spelling check. In: Proceedings of the Eighth SIGHAN workshop on Chinese language processing:2–37. https://aclanthology.org/W15-3106
Liu X, Chen QC, Deng C, Zeng HJ,Chen J, Li DF, Tang BZ (2018) Lcqmc: A large-scale Chinese question matching corpus. In: Proceedings of the 27th international conference on computational linguistics:1952–1962. https://aclanthology.org/C18-1166
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning. ICML 37:448–456. http://proceedings.mlr.press/v37/ioffe15.html
Liu XX, Wang S, Wang DS, Wang PZ, Cao CG (2013) Automatic text error detection in domain question answering. J Chin Inf Process, 27(3):77–83. https://pay.cnki.net/zscsdoc/download?flag=cnkispace &plat=cnkispace &filename=MESS201303011 &dbtype=CJFD &year=2013 &dtype=pdf
Huang Q, Huang PJ, Zhang XR, Xie WJ, Hong KD, Chen BZ, Huang L (2014) Chinese spelling check system based on tri-gram model. In: Proceedings of the third CIPS-SIGHAN joint conference on Chinese language processing, pp 173–178. https://doi.org/10.3115/v1/W14-6827
Xu M (2020) pycorrector: Text error correction tool. https://github.com/shibing624/pycorrector
Clark K, Luong MT, Le QV, Manning CD (2020) Electra: pre-training text encoders as discriminators rather than generators. In: International conference on learning representations. https://openreview.net/forum?id=r1xMH1BtvB
Wang BX, Che WX, Wu DY, Wang SJ , Hu GP, Liu T (2021) Dynamic connected networks for Chinese spelling check. In: Findings of the association for computational linguistics. ACL/IJCNLP:2437–2446. https://doi.org/10.18653/v1/2021.findings-acl.216
Mikolov T, Chen K, Corrado G, (2013) Efficient estimation of word representations in vector space. Comput Sci: 1-12. https://arxiv.org/pdf/1301.3781.pdf
Chen KJ, Huang CR, Chang LP, Hsu HL (1996) SINICA CORPUS : design methodology for balanced corpora. In: Language, information and computation: selected papers from the 11th Pacific Asia conference on language, information and computation, pp 167–176. https://hdl.handle.net/2065/12025
Johannes JM (1979) An example of how the control variate method reduces noise in Monte Carlo experiments: an example of how the control variate method. Commun Stat-Simul Comput 8(4):335–347. https://doi.org/10.1080/03610917908812123
Acknowledgements
We appreciate the Open Project Program of Shanghai Key Laboratory of Data Science (No. 2020090600004), and thank the High Performance Computing Center of Shanghai University, and Shanghai Engineering Research Center of Intelligent Computing System (No. 19DZ2252600) for providing the computing resources. We also thank YTO express company for supporting the data and industry knowledge.
Funding
This study was supported by the Open Project Program of Shanghai Key Laboratory of Data Science (No. 2020090600004), and the High Performance Computing Center of Shanghai University, and Shanghai Engineering Research Center of Intelligent Computing System (No. 19DZ2252600).
Author information
Authors and Affiliations
Contributions
Conceptualization contributed by JP; methodology contributed by CM and MH; formal analysis and investigation contributed by MH; writing—original draft preparation contributed by CM; writing—review and editing contributed by CM and JP; software contributed by CZ; visualization contributed by QX.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Ethics approval
This article has never been submitted to more than one journal for simultaneous consideration. This article is original.
Consent to participate
The authors have approved this article before submission, including the names and order of authors.
Consent for publication
The authors agreed with the content and gave explicit consent to submit.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ma, C., Hu, M., Peng, J. et al. Improving Chinese spell checking with bidirectional LSTMs and confusionset-based decision network. Neural Comput & Applic 35, 15679–15692 (2023). https://doi.org/10.1007/s00521-023-08570-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08570-5