NE–LP: Normalized entropy- and loss prediction-based sampling for active learning in Chinese word segmentation on EHRs

Cai, Tingting; Ma, Zhiyuan; Zheng, Hong; Zhou, Yangming

doi:10.1007/s00521-021-05896-w

NE–LP: Normalized entropy- and loss prediction-based sampling for active learning in Chinese word segmentation on EHRs

Original Article
Published: 29 March 2021

Volume 33, pages 12535–12549, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Tingting Cai¹,
Zhiyuan Ma²,
Hong Zheng¹ &
…
Yangming Zhou ORCID: orcid.org/0000-0002-4254-6517¹

442 Accesses
8 Citations
Explore all metrics

Abstract

Electronic health records (EHRs) in hospital information systems contain patients’ diagnoses and treatments, so EHRs are essential to clinical data mining. Of all the tasks in the mining process, Chinese word segmentation (CWS) is a fundamental and important one, and most state-of-the-art methods greatly rely on large scale of manually annotated data. Since annotation is time-consuming and expensive, efforts have been devoted to techniques, such as active learning, to locate the most informative samples for modeling. In this paper, we follow the trend and present an active learning method for CWS in EHRs. Specifically, a new sampling strategy combining normalized entropy with loss prediction (NE–LP) is proposed to select the most valuable data. Meanwhile, to minimize the computational cost of learning, we propose a joint model including a word segmenter and a loss prediction model. Furthermore, to capture interactions between adjacent characters, bigram features are also applied in the joint model. To illustrate the effectiveness of NE–LP, we conducted experiments on EHRs collected from the Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine. The results demonstrate that NE–LP consistently outperforms conventional uncertainty-based sampling strategies for active learning in CWS

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Active Learning for Chinese Word Segmentation on Judgements

Medical Named Entity Recognition Using Weakly Supervised Learning

Article 26 February 2022

Weak Supervision and Clustering-Based Sample Selection for Clinical Named Entity Recognition

Notes

References

Angluin D (1988) Queries and concept learning. Mach Learn 2(4):319–342
MathSciNet MATH Google Scholar
Balcan MF, Broder A, Zhang T (2007) Margin based active learning. In: International Conference on Computational Learning Theory, pp. 35–50. Springer
Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucl Acids Res 32(Database–Issue):267–270
Article Google Scholar
Cai T, Zhou Y, Zheng H (2020) Cost-Quality Adaptive Active Learning for Chinese Clinical Named Entity Recognition. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine, pp. 528–533. IEEE
Chen X, Qiu X, Zhu C, Liu P, Huang X (2015) Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206
Cheng K, Lu Z (2021) Active learning Bayesian support vector regression model for global approximation. Inf Sci 544:549–563
Article MathSciNet Google Scholar
Culotta A, McCallum A (2005) Reducing labeling effort for structured prediction tasks. Proc AAAI Conf Artif Intell 5:746–751
Google Scholar
Devlin J , Chang MW, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Eddy SR (1998) Profile hidden markov models. Bioinform (Oxf, Engl) 14(9):755–763
Article Google Scholar
Gan L, Zhang Y (2020) Investigating self-attention network for Chinese word segmentation. IEEE/ACM Trans Audio, Speech, Lang Process 28:2933–2941
Article Google Scholar
Gesulga JM, Berjame A, Moquiala KS, Galido A (2017) Barriers to electronic health record system implementation and information systems resources: a structured review. Procedia Comput Sci 124:544–551
Article Google Scholar
Gilad-Bachrach R, Navot A, Tishby N (2006) Query by committee made real. In: Advances in Neural Information Processing Systems, pp. 443–450
Goldberg Y, Levy O (2014) Word2Vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722
Guo Y (2010) Active instance sampling via matrix partition. Adv Neural Inf Process Syst 23:802–810
Google Scholar
Hasan M, Roy-Chowdhury AK (2015) Context aware active learning of activity recognition models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4543–4551
Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q (2020) TinyBERT: Distilling BERT for Natural Language Understanding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4163–4174
La Su Y, Liu W (2020) Research on the LSTM mongolian and Chinese machine translation based on morpheme encoding. Neural Comput Appl 32(1):41–49
Article Google Scholar
Lafferty J.D, McCallum A, Pereira F.C (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: Proceedings of the 17th Annual International Conference on Research and Development in Information Retrieval, pp. 3–12. Springer
Li S, Zhou G, Huang C.R (2012) Active learning for Chinese word segmentation. In: Proceedings of International Conference on Computational Linguistics 2012: Posters, pp. 683–692
Lindberg DS, Prosperi M, Bjarnadottir RI, Thomas J, Crane M, Chen Z, Shear K, Solberg LM, Snigurska UA, Wu Y et al (2020) Identification of important factors in an inpatient fall risk prediction model to improve the quality of care using EHR and electronic administrative data: a machine-learning approach. Int J Med Inf 143:104272
Article Google Scholar
Liu J, Wu F, Wu C, Huang Y, Xie X (2019) Neural chinese word segmentation with dictionary. Neurocomputing 338:46–54
Article Google Scholar
Liu M, Tu Z, Wang Z, Xu X (2020) LTP: a new active learning strategy for BERT-CRF based named entity recognition. arXiv preprint arXiv:2001.02524
Liu W, Zhou P, Zhao Z, Wang Z, Deng H, Ju Q (2020) FastBERT: a Self-distilling BERT with Adaptive Inference Time. In: Proceedings of the 58th Association for Computational Linguistics, pp. 6035–6044
Ma J, Ganchev K, Weiss D (2018) State-of-the-art Chinese word segmentation with Bi-LSTMs. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4902–4908
Marcheggiani D, Artieres T (2014) An experimental comparison of active learning strategies for partially labeled sequences. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 898–906
Peng F, Feng F, McCallum A (2004) Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th international conference on Computational Linguistics, pp. 562–568
Rasmy L, Tiryaki F, Zhou Y, Xiang Y, Tao C, Xu H, Zhi D (2020) Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies. J Am Med Inf Assoc 27(10):1593–1599
Article Google Scholar
Shao D, Zheng N, Yang Z, Chen Z, Xiang Y, Xian Y, Yu Z (2019) Domain-specific Chinese word segmentation based on bi-directional long-short term memory model. IEEE Access 7:12993–13002
Article Google Scholar
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Sun D, Yaqot A, Qiu J, Rauchhaupt L, Jumar U, Wu H (2020) Attention-based deep convolutional neural network for spectral efficiency optimization in mimo systems. Neural Computing and Applications
Tang P, Yang P, Shi Y, Zhou Y, Lin F, Wang Y (2020) Recognizing Chinese judicial named entity using BiLSTM-CRF. arXiv preprint arXiv:2006.00464
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A.N, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008
Wang C, Xu B (2017) Convolutional neural network with word embeddings for Chinese word segmentation. In: Proceedings of the 8th International Joint Conference on Natural Language Processing, pp. 163–172
Wang Q, Zhou Y, Ruan T, Gao D, Xia Y, He P (2019) Incorporating dictionaries into deep neural networks for the Chinese clinical named entity recognition. J Biomed Inf 92:103–133
Article Google Scholar
Wei W, Wang Z, Mao X, Zhou G, Zhou P, Jiang S (2021) Position-aware self-attention based neural sequence labeling. Pattern Recognit 110:107636
Article Google Scholar
Xing J, Zhu K, Zhang S (2018) Adaptive multi-task transfer learning for Chinese word segmentation in medical text. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3619–3630
Xue N, Shen L (2003) Chinese word segmentation as lmr tagging. In: Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17, pp. 176–179. Association for Computational Linguistics
Yan Q, Wang, L, Li S, Liu H, Zhou G (2017) Active learning for Chinese word segmentation on judgements. In: National CCF Conference on Natural Language Processing and Chinese Computing, pp. 839–848. Springer
Yan YF, Huang SJ, Chen S, Liao M, Xu J (2020) Active learning with query generation for cost-effective text classification. Proc AAAI Conf Artif Intell 34:6583–6590
Google Scholar
Yang H (2019) BERT Meets Chinese Word Segmentation. arXiv preprint arXiv:1909.09292
Yang J, Yu Q, Guan Y, Jiang Z (2014) An overview of research on electronic medical record oriented named entity recognition and entity relation extraction. Acta Automatica Sinica 40(8):1537–1562
Google Scholar
Yoo D, Kweon IS (2019) Learning loss for active learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 93–102
Yuan Z, Liu Y, Yin Q, Li B, Feng X, Zhang G, Yu S (2020) Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition. J Biomed Inf 110:103542
Article Google Scholar
Zhang H, Huang W, Liu L, Chow TWS (2020) Learning to match clothing from textual feature-based compatible relationships. IEEE Trans Ind Inf 16(11):6750–6759
Article Google Scholar
Zhao H, Huang, C.N, Li M, Lu BL (2006) Effective tag set selection in Chinese word segmentation via conditional random field modeling. In: Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, pp. 87–94
Zheng X, Chen H, Xu T (2013) Deep learning for Chinese word segmentation and pos tagging. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 647–657

Download references

Acknowledgements

We would like to thank the reviewers for their useful comments and suggestions which helped us to considerably improve the work. We also kindly thank Ju Gao from Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine for providing us clinical datasets, and Ping He from Shanghai Hospital Development Center for her help. This work was supported by the Zhejiang Lab (No. 2019ND0AB01), the National Natural Science Foundation of China (No. 61903144) and the National Key R&D Program of China for “Precision medical research” (No. 2018YFC0910550)

Author information

Authors and Affiliations

School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
Tingting Cai, Hong Zheng & Yangming Zhou
Institute of Machine Intelligence, University of Shanghai for Science and Technology, Shanghai, 200093, China
Zhiyuan Ma

Authors

Tingting Cai
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyuan Ma
View author publications
You can also search for this author in PubMed Google Scholar
Hong Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Yangming Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yangming Zhou.

Ethics declarations

Conflicts of interest

No conflict of interest exists in the submission of this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cai, T., Ma, Z., Zheng, H. et al. NE–LP: Normalized entropy- and loss prediction-based sampling for active learning in Chinese word segmentation on EHRs. Neural Comput & Applic 33, 12535–12549 (2021). https://doi.org/10.1007/s00521-021-05896-w

Download citation

Received: 17 July 2020
Accepted: 01 March 2021
Published: 29 March 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s00521-021-05896-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NE–LP: Normalized entropy- and loss prediction-based sampling for active learning in Chinese word segmentation on EHRs

Abstract

Access this article

Similar content being viewed by others

Active Learning for Chinese Word Segmentation on Judgements

Medical Named Entity Recognition Using Weakly Supervised Learning

Weak Supervision and Clustering-Based Sample Selection for Clinical Named Entity Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

NE–LP: Normalized entropy- and loss prediction-based sampling for active learning in Chinese word segmentation on EHRs

Abstract

Access this article

Similar content being viewed by others

Active Learning for Chinese Word Segmentation on Judgements

Medical Named Entity Recognition Using Weakly Supervised Learning

Weak Supervision and Clustering-Based Sample Selection for Clinical Named Entity Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation