Skip to main content
Log in

A relative similarity based method for interactive patient risk prediction

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This paper investigates the patient risk prediction problem in the context of active learning with relative similarities. Active learning has been extensively studied and successfully applied to solve real problems. The typical setting of active learning methods is to query absolute questions. In a medical application where the goal is to predict the risk of patients on certain disease using Electronic Health Records (EHR), the absolute questions take the form of “Will this patient suffer from Alzheimer’s later in his/her life?”, or “Are these two patients similar or not?”. Due to the excessive requirements of domain knowledge, such absolute questions are usually difficult to answer, even for experienced medical experts. In addition, the performance of absolute question focused active learning methods is less stable, since incorrect answers often occur which can be detrimental to the risk prediction model. In this paper, alternatively, we focus on designing relative questions that can be easily answered by domain experts. The proposed relative queries take the form of “Is patient A or patient B more similar to patient C?”, which can be answered by medical experts with more confidence. These questions poll relative information as opposed to absolute information, and even can be answered by non-experts in some cases. In this paper we propose an interactive patient risk prediction method, which actively queries medical experts with the relative similarity of patients. We explore our method on both benchmark and real clinic datasets, and make several interesting discoveries including that querying relative similarities is effective in patient risk prediction, and sometimes can even yield better prediction accuracy than asking for absolute questions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Asuncion A, Newman D (2007) Uci machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html

  • Cebron N, Berthold MR (2009) Active learning for object classification: from exploration to exploitation. Data Min Knowl Discov 18(2):283–299

    Article  MathSciNet  Google Scholar 

  • Chattopadhyay R, Wang Z, Fan W, Davidson I, Panchanathan S, Ye J (2012) Batch mode active sampling based on marginal probability distribution matching. In: KDD, pp 741–749

  • Chen Y, Carroll RJ, Hinz ERM, Shah A, Eyler AE, Denny JC, Xu H (2013) Applying active learning to high-throughput phenotyping algorithms for electronic health records data. JAMIA 20:e253–e259

    Google Scholar 

  • Culotta A, McCallum A (2005) Reducing labeling effort for structured prediction tasks. In: Proceedings of the 20th national conference on artificial intelligence—vol 2, AAAI’05. AAAI Press, Menlo Park, pp 746–751

  • Davis DA, Chawla NV, Christakis NA, Barabási AL (2010) Time to care: a collaborative engine for practical disease prediction. Data Min Knowl Discov 20(3):388–415. doi:10.1007/s10618-009-0156-z

  • Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases, VLDB ’99. Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 518–529

  • Gionis A, Lappas T, Terzi E (2012) Estimating entity importance via counting set covers. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12. ACM, New York, NY, pp 687–695

  • Guo Y, Greiner R (2007) Optimistic active learning using mutual information. In: Proceedings of the 20th international joint conference on artifical intelligence, IJCAI’07, pp 823–829

  • Hoi SCH, Jin R, Zhu J, Lyu MR (2006) Batch mode active learning and its application to medical image classification. In: Proceedings of the 23rd international conference on machine learning, ICML ’06. ACM, New York, NY, pp 417–424. doi:10.1145/1143844.1143897

  • Ipeirotis PG, Provost FJ, Sheng VS, Wang J (2014) Repeated labeling using multiple noisy labelers. Data Min Knowl Discov 28(2):402–441

    Article  MATH  MathSciNet  Google Scholar 

  • Kapoor A, Horvitz E, Basu S (2007) Selective supervision: guiding supervised learning with decision-theoretic active learning. In: IJCAI, pp 877–882

  • Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’94. Springer-Verlag New York Inc, New York, NY, pp 3–12

  • Melville P, Mooney RJ (2004) Diverse ensembles for active learning. In: Proceedings of the twenty-first international conference on machine learning, ICML ’04. ACM, New York, NY, pp 74–81

  • Muslea I, Minton S, Knoblock C (2000) Selective sampling with redundant views. In: Proceedings of the national conference on artificial intelligence

  • Norén GN, Hopstadius J, Bate A, Star K, Edwards IR (2010) Temporal pattern discovery in longitudinal electronic patient records. Data Min Knowl Discov 20(3):361–387. doi:10.1007/s10618-009-0152-3

  • Panigrahy R (2008) An improved algorithm finding nearest neighbor using kd-trees. In: Proceedings of the 8th Latin American conference on theoretical informatics, LATIN’08. Springer-Verlag, Berlin, Heidelberg, pp 387–398

  • Qian B, Li H, Wang J, Wang X, Davidson I (2013a) Active learning to rank using pairwise supervision. In: SDM, pp 297–305

  • Qian B, Wang X, Wang J, Li H, Cao N, Zhi W, Davidson I (2013b) Fast pairwise query selection for large-scale active learning to rank. In: ICDM, pp 607–616

  • Rashidi P, Cook DJ (2011) Ask me better questions: active learning queries based on rule induction. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11. ACM, New York, NY, pp 904–912. doi:10.1145/2020408.2020559

  • Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326

    Article  Google Scholar 

  • Roy N, Mccallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of 18th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 441–448

  • Settles B (2009) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison

  • Settles B, Craven M (2008) An analysis of active learning strategies for sequence labeling tasks. In: EMNLP, pp 1070–1079

  • Settles B, Craven M, Ray S (2008) Multiple-instance active learning. In: Advances in neural information processing systems NIPS. MIT Press, Cambridge, pp 1289–1296

  • Sun J, Wang F, Hu J, Edabollahi S (2012) Supervised patient similarity measure of heterogeneous patient records. SIGKDD Explor 14(1):16–24

    Article  Google Scholar 

  • Wang F, Zhang C (2006) Label propagation through linear neighborhoods. In: Proceedings of the 23rd international conference on machine learning, ICML’06. ACM, New York, NY, pp 985–992. doi:10.1145/1143844.1143968

  • Wang F, Sun J, Ebadollahi S (2012) Composite distance metric integration by leveraging multiple experts’ inputs and its application in patient similarity assessment. Stat Anal Data Min 5(1):54–69

    Article  MathSciNet  Google Scholar 

  • Wang X, Wang F, Wang J, Qian B, Hu J (2013) Exploring patient risk groups with incomplete knowledge. In: ICDM, pp 1223–1228

  • Wauthier FL, Jojic N, Jordan MI (2012) Active spectral clustering via iterative uncertainty reduction. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’12. ACM, New York, NY, pp 1339–1347

  • Wu J, Roy J, Stewart WF (2010) Prediction modeling using ehr data: challenges, strategies, and a comparison of machine learning approaches. Med care 48(6):S106–S113

    Article  Google Scholar 

  • Zhang T, Oles FJ (2000) A probability analysis on the value of unlabeled data for classification problems. In: Proceedings 17th international conference on machine learning, pp 1191–1198

  • Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2003) Learning with local and global consistency. In: NIPS

  • Zhou J, Sun J, Liu Y, Hu J, Ye J (2013) Patient risk prediction model via top-k stability selection. In: SDM, pp 55–63

  • Zhu X, Ghahramani Z, Lafferty JD (2003a) Semi-supervised learning using gaussian fields and harmonic functions. In: ICML, pp 912–919

  • Zhu X, Lafferty J, Ghahramani Z (2003b) Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In: ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, pp 58–65

  • Zhuang H, Tang J, Tang W, Lou T, Chin A, Wang X (2012) Actively learning to infer social ties. Data Min Knowl Discov 25(2):270–297

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Buyue Qian or Nan Cao.

Additional information

Responsible editors: Fei Wang, Gregor Stiglic, Ian Davidson and Zoran Obradovic.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qian, B., Wang, X., Cao, N. et al. A relative similarity based method for interactive patient risk prediction. Data Min Knowl Disc 29, 1070–1093 (2015). https://doi.org/10.1007/s10618-014-0379-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0379-5

Keywords

Navigation