Effective Identification of Similar Patients Through Sequential Matching over ICD Code Embedding

  • Dang NguyenEmail author
  • Wei Luo
  • Svetha Venkatesh
  • Dinh Phung
Patient Facing Systems
Part of the following topical collections:
  1. Patient Facing Systems


Evidence-based medicine often involves the identification of patients with similar conditions, which are often captured in ICD (International Classification of Diseases (World Health Organization 2013)) code sequences. With no satisfying prior solutions for matching ICD-10 code sequences, this paper presents a method which effectively captures the clinical similarity among routine patients who have multiple comorbidities and complex care needs. Our method leverages the recent progress in representation learning of individual ICD-10 codes, and it explicitly uses the sequential order of codes for matching. Empirical evaluation on a state-wide cancer data collection shows that our proposed method achieves significantly higher matching performance compared with state-of-the-art methods ignoring the sequential order. Our method better identifies similar patients in a number of clinical outcomes including readmission and mortality outlook. Although this paper focuses on ICD-10 diagnosis code sequences, our method can be adapted to work with other codified sequence data.


Code embedding Word2Vec Sequential matching Patient similarity matching Cancer 



This work is partially supported by the Telstra-Deakin Centre of Excellence (CoE) in Big Data and Machine Learning. Dinh Phung gratefully acknowledges the partial support from the Australian Research Council (ARC).

Compliance with Ethical Standards

Conflict of Interest

The authors have no conflict of interest to declare.

Ethical Approval

Ethics approval was obtained from the New South Wales Population and Health Services Research Ethics Committee (AU RED Reference: HREC/15/CIPHS/1).

Informed Consent

This study is a secondary analysis of routinely collected data, and the consent had been obtained by the original data guarantor.


  1. 1.
    World Health Organization: International Classification of Diseases (ICD)., 2013
  2. 2.
    World Health Organization: International statistical classification of diseases and related health problems 10th revision. [Online]. Available:, 2010
  3. 3.
    Australian Consortium for Classification Development: ICD-10-AM. [Online]. Available:, 2017
  4. 4.
    O’Malley, K., Cook, K., Price, M., Wildes, K. R., Hurdle, J., and Ashton, C., Measuring diagnoses: ICD code accuracy. Health Serv. Res. 40:1620–1639, 2005.CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Wang, F., Hu, J., and Sun, J.: Medical prognosis based on patient similarity and expert feedback. In: The 21st International Conference on Pattern Recognition, pp. 1799–1802, IEEE, 2012.Google Scholar
  6. 6.
    Choi, E., Schuetz, A., Stewart, W. F., and Sun, J.: Medical concept representation learning from electronic health records and its application on heart failure prediction. arXiv:1602.03686, 2016
  7. 7.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119, 2013.Google Scholar
  8. 8.
    Lee, J., Maslove, D.M., and Dubin, J., Personalized mortality prediction driven by electronic medical data and a patient similarity metric. PloS One 10(5):e0127428, 2015.CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Carnaby-Mann, G., and Crary, M., Mcneill dysphagia therapy program: a case-control study. Arch. Phys. Med. Rehabil. 91(5):743–749, 2010.CrossRefPubMedGoogle Scholar
  10. 10.
    Hielscher, T., Spiliopoulou, M., Völzke, H., and Kühn, J.-P.: Using participant similarity for the classification of epidemiological data on hepatic steatosis. In: The 27th International Symposium on Computer-Based Medical Systems, pp. 1–7, IEEE, 2014.Google Scholar
  11. 11.
    Le, Q, and Mikolov, T.: Distributed representations of sentences and documents. In: ICML, pp. 1188–1196, 2014.Google Scholar
  12. 12.
    Levy, O., Goldberg, Y., and Dagan, I., Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3:211–225, 2015.Google Scholar
  13. 13.
    Grover, A, and Leskovec, J.: node2vec: scalable feature learning for networks in KDD. In: ACM, pp. 855–864, 2016.Google Scholar
  14. 14.
    Nguyen, D., Luo, W., Nguyen, T. D., Venkatesh, S., and Phung, D.: Learning graph representation via frequent subgraphs. In: SDM. Accepted, SIAM, 2018.Google Scholar
  15. 15.
    Moen, H., Ginter, F., Marsi, E., Peltonen, L.-M., Salakoski, T., and Salanterä, S., Care episode retrieval: distributional semantic models for information retrieval in the clinical domain. BMC Med. Inform. Decis. Mak. 15(2):1, 2015.Google Scholar
  16. 16.
    Nguyen, P., Tran, T., Wickramasinghe, N., and Venkatesh, S., Deepr: a convolutional net for medical records. IEEE J. Biomed. Health Inform. 21(1):22–30, 2017.CrossRefPubMedGoogle Scholar
  17. 17.
    Choi, E., Bahadori, M. T., Searles, E., Coffey, C., Thompson, M., Bost, J., Tejedor-Sojo, J., and Sun. J.: Multi-layer representation learning for medical concepts in KDD. In: ACM, pp. 1495–1504, 2016.Google Scholar
  18. 18.
    Choi, Y., Chiu, C. Y.-I., and Sontag, D.: Learning low-dimensional representations of medical concepts. In: AMIA Summits on Translational Science Proceedings, pp. 41–51, 2016.Google Scholar
  19. 19.
    Mikolov, T., Chen, K., Corrado, G., and Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013
  20. 20.
    Pearce, N., Analysis of matched case-control studies. BMJ 352:i969, 2016.CrossRefPubMedPubMedCentralGoogle Scholar
  21. 21.
    Nguyen, D., Luo, W., Phung, D., and Venkatesh, S.: Exceptional contrast set mining: moving beyond the deluge of the obvious. In: Australasian Joint Conference on Artificial Intelligence, pp. 455–468. Springer, Berlin, 2016.Google Scholar
  22. 22.
    Bigus, J., Campbell, M., Carmeli, B., Cefkin, M., Chang, H., Chen-Ritzo, C.-H., Cody, W., Ebadollahi, S., Evfimievski, A., Farkash, A., et al., Information technology for healthcare transformation. IBM Journal of Research and Development 55(5):6–20, 2011.CrossRefGoogle Scholar
  23. 23.
    Thomas, K., Rahman, M., Mor, V., and Intrator, O., Influence of hospital and nursing home quality on hospital readmissions. The American Journal of Managed Care 20(11):e523, 2014.PubMedPubMedCentralGoogle Scholar
  24. 24.
    Håkonsen, S., Pedersen, P., Bjerrum, M., Bygholm, A., and Peters, M., Nursing minimum data sets for documenting nutritional care for adults in primary healthcare: a scoping review. JBI Database of Systematic Reviews and Implementation Reports 16(1):117–139, 2018.CrossRefPubMedGoogle Scholar
  25. 25.
    Maaten, L. V. D., and Hinton, G., Visualizing data using t-sne. Journal of Machine Learning Research 9: 2579–2605, 2008.Google Scholar
  26. 26.
    Futoma, J., Morris, J., and Lucas, J., A comparison of models for predicting early hospital readmissions. Journal of Biomedical Informatics 56:229–238, 2015.CrossRefPubMedGoogle Scholar
  27. 27.
    Pham, T., Tran, T., Phung, D., and Venkatesh, S., Deepcare: a deep dynamic memory model for predictive medicine in PAKDD, pp. 30–41. Berlin: Springer, 2016.Google Scholar
  28. 28.
    Turgeman, L., May, J., and Sciulli, R., Insights from a machine learning model for predicting the hospital length of stay (los) at the time of admission. Expert Systems with Applications 78:376–385, 2017.CrossRefGoogle Scholar
  29. 29.
    Chaou, C.-H., Chen, H.-H., Chang, S.-H., Tang, P., Pan, S.-L., Yen, A. M.-F., and Chiu, T.-F., Predicting length of stay among patients discharged from the emergency departmentusing an accelerated failure time model. PloS One 12(1):e0165756, 2017.CrossRefPubMedPubMedCentralGoogle Scholar
  30. 30.
    Nguyen, D., Nguyen, T. D., Luo, W., and Venkatesh, S.: Trans2vec: learning transaction embedding via items and frequent itemsets. In: PAKDD. Accepted. Springer, Berlin, 2018.Google Scholar
  31. 31.
    Pobiedina, N., and Ichise, R., Citation count prediction as a link prediction problem. Applied Intelligence 44(2):252–268, 2016.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Dang Nguyen
    • 1
    Email author
  • Wei Luo
    • 1
  • Svetha Venkatesh
    • 1
  • Dinh Phung
    • 1
  1. 1.Centre for Pattern Recognition and Data Analytics, School of Information TechnologyDeakin UniversityGeelongAustralia

Personalised recommendations