Skip to main content

Automatic Text Classification for Label Imputation of Medical Diagnosis Notes Based on Random Forest

  • Conference paper
  • First Online:
Health Information Science (HIS 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11148))

Included in the following conference series:

Abstract

Electronic medical records (EMRs) contain many information of patients, which are of great value for data mining for various clinical applications. However, information missing, including label missing, is pervasive in nature EMRs which would bring lots of obstacles for processing of the medical text contents. The aim of this study is to adopt automatic text classification technologies to recover missing medical text labels for EMRs and support downstream analyses. A combination of word-embedding technology and random forest classifiers are applied to identify multiple medical note labels including disease types and examination types, from short texts of medical imaging diagnosis notes. The results show that the average binary classification accuracies are 91%. Our research results indicate that using advanced NLP techniques for EMRs can reach high classification accuracies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Gunter, T.D., Terry, N.P.: The emergence of national electronic health record architectures in the United States and Australia: models, costs, and questions. J. Med. Internet Res. 7(1), e3 (2005)

    Article  Google Scholar 

  2. Dong, X., Qian, L., Guan, Y., et al.: A multiclass classification method based on deep learning for named entity recognition in electronic medical records. In: Scientific Data Summit, pp. 1–10. IEEE (2016)

    Google Scholar 

  3. Mujtaba, G., Shuib, L., Raj, R.G., et al.: Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection. PLoS ONE 12(2), e0170242 (2017)

    Article  Google Scholar 

  4. Li, M., Fei, Z., Zeng, M., et al.: Automated ICD-9 coding via a deep learning approach. IEEE/ACM Trans. Comput. Biol. Bioinform. PP(99), 1 (2018)

    Google Scholar 

  5. Rajkomar, A., Oren, E., Chen, K., et al.: Scalable and accurate deep learning with electronic health records. npj Digit. Med. 1(1), 18 (2018)

    Google Scholar 

  6. Rios, A., Kavuluru, R.: Convolutional neural networks for biomedical text classification: application in indexing biomedical articles. In: ACM BCB 2015, pp. 258–267 (2015)

    Google Scholar 

  7. Kooi, T., Litjens, G., Ginneken, B.V., et al.: Large scale deep learning for computer aided detection of mammographic lesions. Med. Image Anal. 35, 303 (2016)

    Article  Google Scholar 

  8. Roth, H.R., Lu, L., Seff, A., et al.: A new 2.5D representation for lymph node detection using random sets of deep convolutional neural network observations. Med. Image Comput. Comput. Assist. Interv. 17(1), 520–527 (2014)

    Google Scholar 

  9. Ypsilantis, P.P., Siddique, M., Sohn, H.M., et al.: Predicting response to neoadjuvant chemotherapy with pet imaging using convolutional neural networks. PLoS ONE 10(9), e0137036 (2015)

    Article  Google Scholar 

  10. . 31(3), 32–40 (2017)

    Google Scholar 

  11. Yu, D., Deng, L.: Feature representation learning in deep neural networks (2015)

    Google Scholar 

  12. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. Comput. Sci. (2013)

    Google Scholar 

  13. Mikolov, T., Chen, K., Corrado, G., et al.: Efficient estimation of word representations in vector space. Comput. Sci. (2013)

    Google Scholar 

  14. Goldberg, Y., Levy, O.: word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. Eprint Arxiv (2014)

    Google Scholar 

  15. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  16. Cutler, A., Cutler, D.R., Stevens, J.R.: Random forests. Mach. Learn. 45(1), 157–176 (2004)

    Google Scholar 

  17. Zimmerman, N., Presto, A.A., Kumar, S.P.N., et al.: A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring. Atmos. Meas. Tech. 11(1), 291–313 (2018)

    Article  Google Scholar 

  18. Martineau, J., Finin, T.: Delta TFIDF: an improved feature space for sentiment analysis. In: International Conference on Weblogs and Social Media, ICWSM 2009, San Jose, California, USA. DBLP, May 2009

    Google Scholar 

  19. Soucy, P., Mineau, G.W.: Beyond TFIDF weighting for text categorization in the vector space model. In: International Joint Conference on Artificial Intelligence, pp. 1130–1135. Morgan Kaufmann Publishers Inc. (2005)

    Google Scholar 

  20. Sarker, A., Gonzalez, G.: Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J. Biomed. Inform. 53, 196–207 (2015)

    Article  Google Scholar 

  21. Peng, K.H., Liou, L.H., Chang, C.S., et al.: Predicting personality traits of Chinese users based on Facebook wall posts. In: Wireless and Optical Communication Conference, pp. 9–14. IEEE (2015)

    Google Scholar 

  22. Saif, H., Fernandez, M., He, Y., et al.: On stopwords, filtering and data sparsity for sentiment analysis of twitter. In: The International Conference on Language Resources and Evaluation (2014)

    Google Scholar 

  23. Liu, Y., Ge, T., Mathews, K., et al.: Exploiting task-oriented resources to learn word embeddings for clinical abbreviation expansion. In: BioNLP 2015, pp. 92–97 (2015)

    Google Scholar 

  24. Rong, X.: word2vec Parameter Learning Explained. Comput. Sci. (2014)

    Google Scholar 

Download references

Acknowledgment

The work is supported in part by the NSFC funding 11471313, the Science and Technology Planning Project of Guangdong Province (2015B010129012) and the Shenzhen Basic Research Funding JCYJ20150630114942270.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jing Zheng or Yunpeng Cai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, B. et al. (2018). Automatic Text Classification for Label Imputation of Medical Diagnosis Notes Based on Random Forest. In: Siuly, S., Lee, I., Huang, Z., Zhou, R., Wang, H., Xiang, W. (eds) Health Information Science. HIS 2018. Lecture Notes in Computer Science(), vol 11148. Springer, Cham. https://doi.org/10.1007/978-3-030-01078-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01078-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01077-5

  • Online ISBN: 978-3-030-01078-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics