Skip to main content
Log in

NE–LP: Normalized entropy- and loss prediction-based sampling for active learning in Chinese word segmentation on EHRs

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Electronic health records (EHRs) in hospital information systems contain patients’ diagnoses and treatments, so EHRs are essential to clinical data mining. Of all the tasks in the mining process, Chinese word segmentation (CWS) is a fundamental and important one, and most state-of-the-art methods greatly rely on large scale of manually annotated data. Since annotation is time-consuming and expensive, efforts have been devoted to techniques, such as active learning, to locate the most informative samples for modeling. In this paper, we follow the trend and present an active learning method for CWS in EHRs. Specifically, a new sampling strategy combining normalized entropy with loss prediction (NE–LP) is proposed to select the most valuable data. Meanwhile, to minimize the computational cost of learning, we propose a joint model including a word segmenter and a loss prediction model. Furthermore, to capture interactions between adjacent characters, bigram features are also applied in the joint model. To illustrate the effectiveness of NE–LP, we conducted experiments on EHRs collected from the Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine. The results demonstrate that NE–LP consistently outperforms conventional uncertainty-based sampling strategies for active learning in CWS

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://www.snomed.org.

  2. https://github.com/isnowfy/snownlp.

  3. https://github.com/hankcs/pyhanlp.

  4. https://github.com/fxsjy/jieba.

  5. https://github.com/thunlp/THULAC-Python.

  6. https://github.com/tsroten/pynlpir.

  7. https://github.com/rockyzhengwu/FoolNLTK.

  8. https://github.com/lancopku/pkuseg-python.

  9. https://github.com/google-research/bert.

  10. https://tianchi.aliyun.com/dataset/dataDetail?dataId=22288.

References

  1. Angluin D (1988) Queries and concept learning. Mach Learn 2(4):319–342

    MathSciNet  MATH  Google Scholar 

  2. Balcan MF, Broder A, Zhang T (2007) Margin based active learning. In: International Conference on Computational Learning Theory, pp. 35–50. Springer

  3. Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucl Acids Res 32(Database–Issue):267–270

    Article  Google Scholar 

  4. Cai T, Zhou Y, Zheng H (2020) Cost-Quality Adaptive Active Learning for Chinese Clinical Named Entity Recognition. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine, pp. 528–533. IEEE

  5. Chen X, Qiu X, Zhu C, Liu P, Huang X (2015) Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206

  6. Cheng K, Lu Z (2021) Active learning Bayesian support vector regression model for global approximation. Inf Sci 544:549–563

    Article  MathSciNet  Google Scholar 

  7. Culotta A, McCallum A (2005) Reducing labeling effort for structured prediction tasks. Proc AAAI Conf Artif Intell 5:746–751

    Google Scholar 

  8. Devlin J , Chang MW, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  9. Eddy SR (1998) Profile hidden markov models. Bioinform (Oxf, Engl) 14(9):755–763

    Article  Google Scholar 

  10. Gan L, Zhang Y (2020) Investigating self-attention network for Chinese word segmentation. IEEE/ACM Trans Audio, Speech, Lang Process 28:2933–2941

    Article  Google Scholar 

  11. Gesulga JM, Berjame A, Moquiala KS, Galido A (2017) Barriers to electronic health record system implementation and information systems resources: a structured review. Procedia Comput Sci 124:544–551

    Article  Google Scholar 

  12. Gilad-Bachrach R, Navot A, Tishby N (2006) Query by committee made real. In: Advances in Neural Information Processing Systems, pp. 443–450

  13. Goldberg Y, Levy O (2014) Word2Vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722

  14. Guo Y (2010) Active instance sampling via matrix partition. Adv Neural Inf Process Syst 23:802–810

    Google Scholar 

  15. Hasan M, Roy-Chowdhury AK (2015) Context aware active learning of activity recognition models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4543–4551

  16. Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q (2020) TinyBERT: Distilling BERT for Natural Language Understanding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4163–4174

  17. La Su Y, Liu W (2020) Research on the LSTM mongolian and Chinese machine translation based on morpheme encoding. Neural Comput Appl 32(1):41–49

    Article  Google Scholar 

  18. Lafferty J.D, McCallum A, Pereira F.C (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289

  19. Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: Proceedings of the 17th Annual International Conference on Research and Development in Information Retrieval, pp. 3–12. Springer

  20. Li S, Zhou G, Huang C.R (2012) Active learning for Chinese word segmentation. In: Proceedings of International Conference on Computational Linguistics 2012: Posters, pp. 683–692

  21. Lindberg DS, Prosperi M, Bjarnadottir RI, Thomas J, Crane M, Chen Z, Shear K, Solberg LM, Snigurska UA, Wu Y et al (2020) Identification of important factors in an inpatient fall risk prediction model to improve the quality of care using EHR and electronic administrative data: a machine-learning approach. Int J Med Inf 143:104272

    Article  Google Scholar 

  22. Liu J, Wu F, Wu C, Huang Y, Xie X (2019) Neural chinese word segmentation with dictionary. Neurocomputing 338:46–54

    Article  Google Scholar 

  23. Liu M, Tu Z, Wang Z, Xu X (2020) LTP: a new active learning strategy for BERT-CRF based named entity recognition. arXiv preprint arXiv:2001.02524

  24. Liu W, Zhou P, Zhao Z, Wang Z, Deng H, Ju Q (2020) FastBERT: a Self-distilling BERT with Adaptive Inference Time. In: Proceedings of the 58th Association for Computational Linguistics, pp. 6035–6044

  25. Ma J, Ganchev K, Weiss D (2018) State-of-the-art Chinese word segmentation with Bi-LSTMs. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4902–4908

  26. Marcheggiani D, Artieres T (2014) An experimental comparison of active learning strategies for partially labeled sequences. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 898–906

  27. Peng F, Feng F, McCallum A (2004) Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th international conference on Computational Linguistics, pp. 562–568

  28. Rasmy L, Tiryaki F, Zhou Y, Xiang Y, Tao C, Xu H, Zhi D (2020) Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies. J Am Med Inf Assoc 27(10):1593–1599

    Article  Google Scholar 

  29. Shao D, Zheng N, Yang Z, Chen Z, Xiang Y, Xian Y, Yu Z (2019) Domain-specific Chinese word segmentation based on bi-directional long-short term memory model. IEEE Access 7:12993–13002

    Article  Google Scholar 

  30. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  31. Sun D, Yaqot A, Qiu J, Rauchhaupt L, Jumar U, Wu H (2020) Attention-based deep convolutional neural network for spectral efficiency optimization in mimo systems. Neural Computing and Applications

  32. Tang P, Yang P, Shi Y, Zhou Y, Lin F, Wang Y (2020) Recognizing Chinese judicial named entity using BiLSTM-CRF. arXiv preprint arXiv:2006.00464

  33. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A.N, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008

  34. Wang C, Xu B (2017) Convolutional neural network with word embeddings for Chinese word segmentation. In: Proceedings of the 8th International Joint Conference on Natural Language Processing, pp. 163–172

  35. Wang Q, Zhou Y, Ruan T, Gao D, Xia Y, He P (2019) Incorporating dictionaries into deep neural networks for the Chinese clinical named entity recognition. J Biomed Inf 92:103–133

    Article  Google Scholar 

  36. Wei W, Wang Z, Mao X, Zhou G, Zhou P, Jiang S (2021) Position-aware self-attention based neural sequence labeling. Pattern Recognit 110:107636

    Article  Google Scholar 

  37. Xing J, Zhu K, Zhang S (2018) Adaptive multi-task transfer learning for Chinese word segmentation in medical text. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3619–3630

  38. Xue N, Shen L (2003) Chinese word segmentation as lmr tagging. In: Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17, pp. 176–179. Association for Computational Linguistics

  39. Yan Q, Wang, L, Li S, Liu H, Zhou G (2017) Active learning for Chinese word segmentation on judgements. In: National CCF Conference on Natural Language Processing and Chinese Computing, pp. 839–848. Springer

  40. Yan YF, Huang SJ, Chen S, Liao M, Xu J (2020) Active learning with query generation for cost-effective text classification. Proc AAAI Conf Artif Intell 34:6583–6590

    Google Scholar 

  41. Yang H (2019) BERT Meets Chinese Word Segmentation. arXiv preprint arXiv:1909.09292

  42. Yang J, Yu Q, Guan Y, Jiang Z (2014) An overview of research on electronic medical record oriented named entity recognition and entity relation extraction. Acta Automatica Sinica 40(8):1537–1562

    Google Scholar 

  43. Yoo D, Kweon IS (2019) Learning loss for active learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 93–102

  44. Yuan Z, Liu Y, Yin Q, Li B, Feng X, Zhang G, Yu S (2020) Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition. J Biomed Inf 110:103542

    Article  Google Scholar 

  45. Zhang H, Huang W, Liu L, Chow TWS (2020) Learning to match clothing from textual feature-based compatible relationships. IEEE Trans Ind Inf 16(11):6750–6759

    Article  Google Scholar 

  46. Zhao H, Huang, C.N, Li M, Lu BL (2006) Effective tag set selection in Chinese word segmentation via conditional random field modeling. In: Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, pp. 87–94

  47. Zheng X, Chen H, Xu T (2013) Deep learning for Chinese word segmentation and pos tagging. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 647–657

Download references

Acknowledgements

We would like to thank the reviewers for their useful comments and suggestions which helped us to considerably improve the work. We also kindly thank Ju Gao from Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine for providing us clinical datasets, and Ping He from Shanghai Hospital Development Center for her help. This work was supported by the Zhejiang Lab (No. 2019ND0AB01), the National Natural Science Foundation of China (No. 61903144) and the National Key R&D Program of China for “Precision medical research” (No. 2018YFC0910550)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yangming Zhou.

Ethics declarations

Conflicts of interest

No conflict of interest exists in the submission of this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cai, T., Ma, Z., Zheng, H. et al. NE–LP: Normalized entropy- and loss prediction-based sampling for active learning in Chinese word segmentation on EHRs. Neural Comput & Applic 33, 12535–12549 (2021). https://doi.org/10.1007/s00521-021-05896-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-05896-w

Keywords

Navigation