Abstract
Electronic health records (EHRs) in hospital information systems contain patients’ diagnoses and treatments, so EHRs are essential to clinical data mining. Of all the tasks in the mining process, Chinese word segmentation (CWS) is a fundamental and important one, and most state-of-the-art methods greatly rely on large scale of manually annotated data. Since annotation is time-consuming and expensive, efforts have been devoted to techniques, such as active learning, to locate the most informative samples for modeling. In this paper, we follow the trend and present an active learning method for CWS in EHRs. Specifically, a new sampling strategy combining normalized entropy with loss prediction (NE–LP) is proposed to select the most valuable data. Meanwhile, to minimize the computational cost of learning, we propose a joint model including a word segmenter and a loss prediction model. Furthermore, to capture interactions between adjacent characters, bigram features are also applied in the joint model. To illustrate the effectiveness of NE–LP, we conducted experiments on EHRs collected from the Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine. The results demonstrate that NE–LP consistently outperforms conventional uncertainty-based sampling strategies for active learning in CWS
Similar content being viewed by others
Notes
References
Angluin D (1988) Queries and concept learning. Mach Learn 2(4):319–342
Balcan MF, Broder A, Zhang T (2007) Margin based active learning. In: International Conference on Computational Learning Theory, pp. 35–50. Springer
Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucl Acids Res 32(Database–Issue):267–270
Cai T, Zhou Y, Zheng H (2020) Cost-Quality Adaptive Active Learning for Chinese Clinical Named Entity Recognition. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine, pp. 528–533. IEEE
Chen X, Qiu X, Zhu C, Liu P, Huang X (2015) Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206
Cheng K, Lu Z (2021) Active learning Bayesian support vector regression model for global approximation. Inf Sci 544:549–563
Culotta A, McCallum A (2005) Reducing labeling effort for structured prediction tasks. Proc AAAI Conf Artif Intell 5:746–751
Devlin J , Chang MW, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Eddy SR (1998) Profile hidden markov models. Bioinform (Oxf, Engl) 14(9):755–763
Gan L, Zhang Y (2020) Investigating self-attention network for Chinese word segmentation. IEEE/ACM Trans Audio, Speech, Lang Process 28:2933–2941
Gesulga JM, Berjame A, Moquiala KS, Galido A (2017) Barriers to electronic health record system implementation and information systems resources: a structured review. Procedia Comput Sci 124:544–551
Gilad-Bachrach R, Navot A, Tishby N (2006) Query by committee made real. In: Advances in Neural Information Processing Systems, pp. 443–450
Goldberg Y, Levy O (2014) Word2Vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722
Guo Y (2010) Active instance sampling via matrix partition. Adv Neural Inf Process Syst 23:802–810
Hasan M, Roy-Chowdhury AK (2015) Context aware active learning of activity recognition models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4543–4551
Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q (2020) TinyBERT: Distilling BERT for Natural Language Understanding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4163–4174
La Su Y, Liu W (2020) Research on the LSTM mongolian and Chinese machine translation based on morpheme encoding. Neural Comput Appl 32(1):41–49
Lafferty J.D, McCallum A, Pereira F.C (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: Proceedings of the 17th Annual International Conference on Research and Development in Information Retrieval, pp. 3–12. Springer
Li S, Zhou G, Huang C.R (2012) Active learning for Chinese word segmentation. In: Proceedings of International Conference on Computational Linguistics 2012: Posters, pp. 683–692
Lindberg DS, Prosperi M, Bjarnadottir RI, Thomas J, Crane M, Chen Z, Shear K, Solberg LM, Snigurska UA, Wu Y et al (2020) Identification of important factors in an inpatient fall risk prediction model to improve the quality of care using EHR and electronic administrative data: a machine-learning approach. Int J Med Inf 143:104272
Liu J, Wu F, Wu C, Huang Y, Xie X (2019) Neural chinese word segmentation with dictionary. Neurocomputing 338:46–54
Liu M, Tu Z, Wang Z, Xu X (2020) LTP: a new active learning strategy for BERT-CRF based named entity recognition. arXiv preprint arXiv:2001.02524
Liu W, Zhou P, Zhao Z, Wang Z, Deng H, Ju Q (2020) FastBERT: a Self-distilling BERT with Adaptive Inference Time. In: Proceedings of the 58th Association for Computational Linguistics, pp. 6035–6044
Ma J, Ganchev K, Weiss D (2018) State-of-the-art Chinese word segmentation with Bi-LSTMs. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4902–4908
Marcheggiani D, Artieres T (2014) An experimental comparison of active learning strategies for partially labeled sequences. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 898–906
Peng F, Feng F, McCallum A (2004) Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th international conference on Computational Linguistics, pp. 562–568
Rasmy L, Tiryaki F, Zhou Y, Xiang Y, Tao C, Xu H, Zhi D (2020) Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies. J Am Med Inf Assoc 27(10):1593–1599
Shao D, Zheng N, Yang Z, Chen Z, Xiang Y, Xian Y, Yu Z (2019) Domain-specific Chinese word segmentation based on bi-directional long-short term memory model. IEEE Access 7:12993–13002
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Sun D, Yaqot A, Qiu J, Rauchhaupt L, Jumar U, Wu H (2020) Attention-based deep convolutional neural network for spectral efficiency optimization in mimo systems. Neural Computing and Applications
Tang P, Yang P, Shi Y, Zhou Y, Lin F, Wang Y (2020) Recognizing Chinese judicial named entity using BiLSTM-CRF. arXiv preprint arXiv:2006.00464
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A.N, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008
Wang C, Xu B (2017) Convolutional neural network with word embeddings for Chinese word segmentation. In: Proceedings of the 8th International Joint Conference on Natural Language Processing, pp. 163–172
Wang Q, Zhou Y, Ruan T, Gao D, Xia Y, He P (2019) Incorporating dictionaries into deep neural networks for the Chinese clinical named entity recognition. J Biomed Inf 92:103–133
Wei W, Wang Z, Mao X, Zhou G, Zhou P, Jiang S (2021) Position-aware self-attention based neural sequence labeling. Pattern Recognit 110:107636
Xing J, Zhu K, Zhang S (2018) Adaptive multi-task transfer learning for Chinese word segmentation in medical text. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3619–3630
Xue N, Shen L (2003) Chinese word segmentation as lmr tagging. In: Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17, pp. 176–179. Association for Computational Linguistics
Yan Q, Wang, L, Li S, Liu H, Zhou G (2017) Active learning for Chinese word segmentation on judgements. In: National CCF Conference on Natural Language Processing and Chinese Computing, pp. 839–848. Springer
Yan YF, Huang SJ, Chen S, Liao M, Xu J (2020) Active learning with query generation for cost-effective text classification. Proc AAAI Conf Artif Intell 34:6583–6590
Yang H (2019) BERT Meets Chinese Word Segmentation. arXiv preprint arXiv:1909.09292
Yang J, Yu Q, Guan Y, Jiang Z (2014) An overview of research on electronic medical record oriented named entity recognition and entity relation extraction. Acta Automatica Sinica 40(8):1537–1562
Yoo D, Kweon IS (2019) Learning loss for active learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 93–102
Yuan Z, Liu Y, Yin Q, Li B, Feng X, Zhang G, Yu S (2020) Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition. J Biomed Inf 110:103542
Zhang H, Huang W, Liu L, Chow TWS (2020) Learning to match clothing from textual feature-based compatible relationships. IEEE Trans Ind Inf 16(11):6750–6759
Zhao H, Huang, C.N, Li M, Lu BL (2006) Effective tag set selection in Chinese word segmentation via conditional random field modeling. In: Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, pp. 87–94
Zheng X, Chen H, Xu T (2013) Deep learning for Chinese word segmentation and pos tagging. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 647–657
Acknowledgements
We would like to thank the reviewers for their useful comments and suggestions which helped us to considerably improve the work. We also kindly thank Ju Gao from Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine for providing us clinical datasets, and Ping He from Shanghai Hospital Development Center for her help. This work was supported by the Zhejiang Lab (No. 2019ND0AB01), the National Natural Science Foundation of China (No. 61903144) and the National Key R&D Program of China for “Precision medical research” (No. 2018YFC0910550)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
No conflict of interest exists in the submission of this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cai, T., Ma, Z., Zheng, H. et al. NE–LP: Normalized entropy- and loss prediction-based sampling for active learning in Chinese word segmentation on EHRs. Neural Comput & Applic 33, 12535–12549 (2021). https://doi.org/10.1007/s00521-021-05896-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-05896-w