Abstract
ICD-10 is the 10th revision of the International Classification of Diseases, a medical ontology for the encoding of diseases and related health problems provided by the World Health Organization. This encoding is used by physicians to be able to describe diseases in a standardized way. Since this is currently performed manually by medical professionals, the ability to automate this task would save time and allow doctors to focus more on patient care. The task of automatic association of ICD-10 codes to a textual description is an extreme scale multi-class multi-label classification task, due to the huge number of classes – 11000, and the possibility to assign multiple valid ICD-10 codes to a diagnosis. Moreover, for the application of machine learning algorithms for this task, a large training data set is required. This task is even a bigger challenge for low resource languages such as the Bulgarian language. We created semi-automatically a dataset from linked open data and public documents. The corpora contain about 350,000 diagnoses in the Bulgarian language labeled with 3-sign and 4 sign ICD-10 codes. The paper presents a cascading approach for automatic classification of ICD-10 codes to diagnosis, which uses the hierarchical nature of the ICD-10 classification, to improve the accuracy of classification. This approach is tested and compared with the flat classification approach on the above-mentioned date set. Different machine learning algorithms are tested, including those based on deep learning transformers like BERT models. The results from the conducted experiments provide evidence that the proposed approach which takes into account the hierarchical structure of the ICD-10 codes outperforms the ones that ignore it.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
Scikit-Learn https://scikit-learn.org/stable/.
- 8.
nlpaug library https://pypi.org/project/nlpaug/.
- 9.
- 10.
BTB stopword list in Bulgarian: http://bultreebank.org/wp-content/uploads/2017/04/BTB-StopWordList.zip.
- 11.
BulStem stemmer: https://github.com/mhardalov/bulstem-py.
- 12.
- 13.
Multilingual BERT https://github.com/google-research/bert/blob/master/multilingual.md.
- 14.
- 15.
References
Almagro, M., Unanue, R.M., Fresno, V., Montalvo, S.: ICD-10 coding of Spanish electronic discharge summaries: an extreme classification problem. IEEE Access 8, 100073–100083 (2020)
Alsentzer, E., et al.: Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 (2019)
Amin, S., Neumann, G., Dunfield, K., Vechkaeva, A., Chapman, K., Wixted, M.: Mlt-dfki at clef ehealth 2019: Multi-label classification of ICD-10 codes with bert (September 2019)
Arifoğlu, D., Deniz, O., Aleçakır, K., Yöndem, M.: CodeMagic: semi-automatic assignment of ICD-10-AM codes to patient records. In: Czachórski, T., Gelenbe, E., Lent, R. (eds.) Information Sciences and Systems 2014, pp. 259–268. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09465-6_27
Arkhipov, M., Trofimova, M., Kuratov, Y., Sorokin, A.: Tuning multilingual transformers for language-specific named entity recognition. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 89–93 (2019)
Atutxa, A., Pérez, A., Casillas, A.: Machine learning approaches on diagnostic term encoding with the icd for clinical documentation. IEEE J. Biomed. Health Inform. 22(4), 1323–1329 (2017)
Bagheri, A., Sammani, A., Van der Heijden, P.G., Asselbergs, F.W., Oberski, D.L.: Automatic icd-10 classification of diseases from dutch discharge letters. In: BIOINFORMATICS 2020–11th International Conference on Bioinformatics Models, Methods and Algorithms, Proceedings; Part of 13th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2020, vol. 13, pp. 281–289. SciTePress (2020). https://doi.org/10.5220/0009372602810289
Boytcheva, S.: Automatic matching of ICD-10 codes to diagnoses in discharge letters. In: Proceedings of the Second Workshop on Biomedical Natural Language Processing, pp. 11–18. Association for Computational Linguistics, Hissar, Bulgaria (September 2011). https://www.aclweb.org/anthology/W11-4203
Boytcheva, S., Velichkov, B., Velchev, G., Koychev, I.: Automatic generation of annotated corpora of diagnoses with icd-10 codes based on open data and linked open data. In: 2020 15th Conference on Computer Science and Information Systems (FedCSIS), pp. 163–167. IEEE (2020)
Catling, F., Spithourakis, G.P., Riedel, S.: Towards automated clinical coding. Int. J. Med. Inform. 120, 50–61 (2018)
CEYLAN, N.M., ALPKOÇAK, A., ESATOĞLU, A.E.: Tıbbi kayıtlara icd-10 hastalık kodlarının atanmasına yardımcı akıllı bir sistem (2012)
Chen, Y., Lu, H., Li, L.: Automatic icd-10 coding algorithm using an improved longest common subsequence based on semantic similarity. PLoS ONE 12(3), e0173410 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Ning, W., Yu, M., Zhang, R.: A hierarchical method to automatically encode Chinese diagnoses through semantic similarity estimation. BMC Med. Inform. Decis. Mak. 16(1), 1–12 (2016)
Parlak, B., Uysal, A.K.: On feature weighting and selection for medical document classification. In: Rocha, Á., Reis, L.P. (eds.) Developments and Advances in Intelligent Systems and Applications. SCI, vol. 718, pp. 269–282. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-58965-7_19
Velichkov, B., et al.: Automatic icd-10 codes association to diagnosis: bulgarian case. In: CSBio 2020: Proceedings of the Eleventh International Conference on Computational Systems-Biology and Bioinformatics, pp. 46–53 (2020). https://doi.org/10.1145/3429210.3429224
Wang, Q., et al.: A study of entity-linking methods for normalizing Chinese diagnosis and procedure terms to icd codes. J. Biomed. Inform. 105, 103418 (2020). https://doi.org/10.1016/j.jbi.2020.103418
Acknowledgments
This research is partially funded by the Bulgarian Ministry of Education and Science, grant DO1-200/2018 ‘Electronic health care in Bulgaria’ (e-Zdrave) and the Bulgarian National Science Fund, grant DN-02/4-2016 ‘Specialized Data Mining Methods Based on Semantic Attributes’ (IZIDA).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Velichkov, B. et al. (2022). Cascading Approach for Automatic ICD-10 Codes Association To Diseases in Bulgarian. In: Sotirov, S.S., Pencheva, T., Kacprzyk, J., Atanassov, K.T., Sotirova, E., Staneva, G. (eds) Contemporary Methods in Bioinformatics and Biomedicine and Their Applications. BioInfoMed 2020. Lecture Notes in Networks and Systems, vol 374. Springer, Cham. https://doi.org/10.1007/978-3-030-96638-6_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-96638-6_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96637-9
Online ISBN: 978-3-030-96638-6
eBook Packages: EngineeringEngineering (R0)