Abstract
Electronic medical records (EMRs) contain many information of patients, which are of great value for data mining for various clinical applications. However, information missing, including label missing, is pervasive in nature EMRs which would bring lots of obstacles for processing of the medical text contents. The aim of this study is to adopt automatic text classification technologies to recover missing medical text labels for EMRs and support downstream analyses. A combination of word-embedding technology and random forest classifiers are applied to identify multiple medical note labels including disease types and examination types, from short texts of medical imaging diagnosis notes. The results show that the average binary classification accuracies are 91%. Our research results indicate that using advanced NLP techniques for EMRs can reach high classification accuracies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Gunter, T.D., Terry, N.P.: The emergence of national electronic health record architectures in the United States and Australia: models, costs, and questions. J. Med. Internet Res. 7(1), e3 (2005)
Dong, X., Qian, L., Guan, Y., et al.: A multiclass classification method based on deep learning for named entity recognition in electronic medical records. In: Scientific Data Summit, pp. 1–10. IEEE (2016)
Mujtaba, G., Shuib, L., Raj, R.G., et al.: Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection. PLoS ONE 12(2), e0170242 (2017)
Li, M., Fei, Z., Zeng, M., et al.: Automated ICD-9 coding via a deep learning approach. IEEE/ACM Trans. Comput. Biol. Bioinform. PP(99), 1 (2018)
Rajkomar, A., Oren, E., Chen, K., et al.: Scalable and accurate deep learning with electronic health records. npj Digit. Med. 1(1), 18 (2018)
Rios, A., Kavuluru, R.: Convolutional neural networks for biomedical text classification: application in indexing biomedical articles. In: ACM BCB 2015, pp. 258–267 (2015)
Kooi, T., Litjens, G., Ginneken, B.V., et al.: Large scale deep learning for computer aided detection of mammographic lesions. Med. Image Anal. 35, 303 (2016)
Roth, H.R., Lu, L., Seff, A., et al.: A new 2.5D representation for lymph node detection using random sets of deep convolutional neural network observations. Med. Image Comput. Comput. Assist. Interv. 17(1), 520–527 (2014)
Ypsilantis, P.P., Siddique, M., Sohn, H.M., et al.: Predicting response to neoadjuvant chemotherapy with pet imaging using convolutional neural networks. PLoS ONE 10(9), e0137036 (2015)
. 31(3), 32–40 (2017)
Yu, D., Deng, L.: Feature representation learning in deep neural networks (2015)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. Comput. Sci. (2013)
Mikolov, T., Chen, K., Corrado, G., et al.: Efficient estimation of word representations in vector space. Comput. Sci. (2013)
Goldberg, Y., Levy, O.: word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. Eprint Arxiv (2014)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Cutler, A., Cutler, D.R., Stevens, J.R.: Random forests. Mach. Learn. 45(1), 157–176 (2004)
Zimmerman, N., Presto, A.A., Kumar, S.P.N., et al.: A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring. Atmos. Meas. Tech. 11(1), 291–313 (2018)
Martineau, J., Finin, T.: Delta TFIDF: an improved feature space for sentiment analysis. In: International Conference on Weblogs and Social Media, ICWSM 2009, San Jose, California, USA. DBLP, May 2009
Soucy, P., Mineau, G.W.: Beyond TFIDF weighting for text categorization in the vector space model. In: International Joint Conference on Artificial Intelligence, pp. 1130–1135. Morgan Kaufmann Publishers Inc. (2005)
Sarker, A., Gonzalez, G.: Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J. Biomed. Inform. 53, 196–207 (2015)
Peng, K.H., Liou, L.H., Chang, C.S., et al.: Predicting personality traits of Chinese users based on Facebook wall posts. In: Wireless and Optical Communication Conference, pp. 9–14. IEEE (2015)
Saif, H., Fernandez, M., He, Y., et al.: On stopwords, filtering and data sparsity for sentiment analysis of twitter. In: The International Conference on Language Resources and Evaluation (2014)
Liu, Y., Ge, T., Mathews, K., et al.: Exploiting task-oriented resources to learn word embeddings for clinical abbreviation expansion. In: BioNLP 2015, pp. 92–97 (2015)
Rong, X.: word2vec Parameter Learning Explained. Comput. Sci. (2014)
Acknowledgment
The work is supported in part by the NSFC funding 11471313, the Science and Technology Planning Project of Guangdong Province (2015B010129012) and the Shenzhen Basic Research Funding JCYJ20150630114942270.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, B. et al. (2018). Automatic Text Classification for Label Imputation of Medical Diagnosis Notes Based on Random Forest. In: Siuly, S., Lee, I., Huang, Z., Zhou, R., Wang, H., Xiang, W. (eds) Health Information Science. HIS 2018. Lecture Notes in Computer Science(), vol 11148. Springer, Cham. https://doi.org/10.1007/978-3-030-01078-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-01078-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01077-5
Online ISBN: 978-3-030-01078-2
eBook Packages: Computer ScienceComputer Science (R0)