Skip to main content

Word Embedding Bootstrapped Deep Active Learning Method to Information Extraction on Chinese Electronic Medical Record


Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.

This is a preview of subscription content, access via your institution.


  1. [1]

    YE Q, SHU T. EMR-based evaluation of medical care quality: Status quo and trends [J]. Chinese Journal of Hospital Administration, 2018, 34(7): 560–563 (in Chinese).

    Google Scholar 

  2. [2]

    TANG Q, YUAN J, MA Q. Implementation and application of paperless filing system for medical records based on electronic signature [J]. China Medical Devices, 2018, 33(9): 129–131 (in Chinese).

    Google Scholar 

  3. [3]

    SUN W, CAI Z, LI Y, et al. Data processing and text mining technologies on electronic medical records: a review [J]. Journal of Healthcare Engineering, 2018, 2018: 4302425.

    Article  Google Scholar 

  4. [4]

    LIANG H, TSUI B Y, NI H, et al. Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence [J]. Nature Medicine, 2019, 25(3): 433–438.

    Article  Google Scholar 

  5. [5]

    DENIS M. U.K. clinical record interactive search(cris) [J]. Alzheimer’s & Dementia, 2017, 13(7): P1223.

    Article  Google Scholar 

  6. [6]

    KARYSTIANIS G, NEVADO A J, KIM C H, et al. Automatic mining of symptom severity from psychiatric evaluation notes [J]. International Journal of Methods in Psychiatric Research, 2018, 27(1): e1602.

    Article  Google Scholar 

  7. [7]

    CAMBRIA E, WHITE B. Jumping NLP curves: A review of natural language processing research [J]. IEEE Computational Intelligence Magazine, 2014, 9(2): 48–57.

    Article  Google Scholar 

  8. [8]

    YAO C, QU Y, JIN B, et al. A convolutional neural network model for online medical guidance [J]. IEEE Access, 2016, 4: 4094–4103.

    Article  Google Scholar 

  9. [9]

    DONG X, QIAN L, GUAN Y, et al. A multiclass classification method based on deep learning for named entity recognition in electronic medical records [C]//2016 New York Scientific Data Summit (NYSDS). Piscataway, NJ, USA: IEEE, 2016: 1–10.

    Google Scholar 

  10. [10]

    HAMMERTON J. Named entity recognition with long short-term memory [C]//Proceedings of the Seventh Conference on Natural Language Learning at HLTNAACL 2003-Volume 4. Morristown, NJ, USA: ACL, 2003: 172–175.

    Chapter  Google Scholar 

  11. [11]

    WANG P, QIAN Y, SOONG F K, et al. A unified tagging solution: Bidirectional LSTM recurrent neural network with word embedding [DB/OL]. (2015-11-01) [2020-08-01].

  12. [12]

    DONG X, CHOWDHURY S, QIAN L, et al. Transfer bi-directional LSTM RNN for named entity recognition in Chinese electronic medical records [C]//2017 IEEE 19th International Conference on E-Health Networking, Applications and Services (Healthcom). Piscataway, NJ, USA: IEEE, 2017: 1–4.

    Google Scholar 

  13. [13]

    LU S, DOU Z, WEN J. Research on structural data extraction in surgical cases [J]. Chinese Journal of Computers, 2019, 42(12): 2754–2768.

    Google Scholar 

  14. [14]

    GLIGIC L, KORMILITZIN A, GOLDBERG P, et al. Named entity recognition in electronic health records using transfer learning bootstrapped Neural Networks [J]. Neural Networks, 2020, 121: 132–139.

    Article  Google Scholar 

  15. [15]

    XU G, MENG Y, QIU X, et al. Sentiment analysis of comment texts based on BiLSTM [J]. IEEE Access, 2019, 7: 51522–51532.

    Article  Google Scholar 

  16. [16]

    WINTAKA D C, BIJAKSANA M A, ASROR I. Named-entity recognition on Indonesian tweets using bidirectional LSTM-CRF [J]. Procedia Computer Science, 2019, 157: 221–228.

    Article  Google Scholar 

  17. [17]

    MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality [DB/OL]//(2013-10-16) [2020-08-01].

  18. [18]

    LI J, ZHAO S, YANG J, et al. WCP-RNN: A novel RNN-based approach for Bio-NER in Chinese EMRs [J]. Journal of Supercomputing, 2020, 76: 1450–1467.

    Article  Google Scholar 

  19. [19]

    MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [DB/OL]. (2013-09-07) [2020-08-01].

  20. [20]

    DYER C, BALLESTEROS M, LING W, et al. Transition-based dependency parsing with stack long short-term memory [DB/OL]. (2015-05-29) [2020-08-01].

  21. [21]

    YANG J, YU Q, GUAN Y, et al. An overview of research on electronic medical record oriented named entity recognition and entity relation extraction [J]. Acta Automatica Sinica, 2014, 40(8): 1537–1562.

    Google Scholar 

  22. [22]

    LINDBERG D A B, HUMPHREYS B L, MCCRAY A T. The unified medical language system [J]. Methods of Information in Medicine, 1993, 32(4): 281–291.

    Article  Google Scholar 

  23. [23]

    RATINOV L, ROTH D. Design challenges and misconceptions in named entity recognition [C]//Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Morristown, NJ, USA: ACL, 2009: 147–155.

    Chapter  Google Scholar 

  24. [24]

    KINGMA D P, BA J. Adam: A method for stochastic optimization [DB/OL]. (2014-12-22) [2020-08-01].

  25. [25]

    SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: A simple way to prevent neural networks from overfitting [J]. Journal of Machine Learning Research, 2014, 15: 1929–1958.

    MathSciNet  MATH  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Xumin Hou.

Additional information

Foundation item: the Artificial Intelligence Innovation and Development Project of Shanghai Municipal Commission of Economy and Information (No. 2019-RGZN-01081)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ma, Q., Cen, X., Yuan, J. et al. Word Embedding Bootstrapped Deep Active Learning Method to Information Extraction on Chinese Electronic Medical Record. J. Shanghai Jiaotong Univ. (Sci.) 26, 494–502 (2021).

Download citation

Key words

  • deep active learning
  • named entity recognition (NER)
  • information extraction
  • word embedding
  • Chinese electronic medical record (EMR)

CLC number

  • R 319

Document code

  • A