Abstract
Clinical information, stored over time and increasingly linked to other types of information such as environmental and social determinants of health and healthcare claims, is a potentially rich data source for clinical research. Knowledge discovery in databases (KDD) is a process for pattern discovery and predictive modeling in large databases. KDD encompasses and makes extensive use of data-mining methods—automated processes and algorithms that enable pattern recognition and classification. Characteristically, KDD involves the use of machine learning methods developed in the domain of artificial intelligence and information retrieval. These methods, which include both structure learning and parameter learning, have been applied to healthcare and biomedical data for various purposes with good success and potential or realized clinical translation. We introduce the Fayyad model of knowledge discovery in databases and describe the steps of the process, providing select examples from clinical research informatics. These steps range from initial data selection and preparation to interpretation and evaluation. Commonly used data-mining methods are surveyed: artificial neural networks, decision-tree induction, support vector machines (kernel methods), association-rule induction, k-nearest neighbor, and probabilistic methods such as Bayesian networks. We link methods for evaluating the models that result from the KDD process to methods used in diagnostic medicine, spotlighting measures derived from a confusion matrix and receiver operating characteristic curve analysis and, more recently, uncertainty quantification and conformal prediction. Throughout the chapter, we discuss salient aspects of biomedical data management and use, including applications, the use of FAIR principles, pipelines and infrastructure for KDD, and future directions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Benson K, Hartz AJ. A comparison of observational studies and randomized, controlled trials. N Engl J Med. 2000;342(25):1878–86. Epub 2000/06/22. https://doi.org/10.1056/NEJM200006223422506.
Aronsky D, Fiszman M, Chapman WW, Haug PJ. Combining decision support methodologies to diagnose pneumonia. Proc AMIA Symp. 2001:12–6. Epub 2002/02/05.
Lagor C, Aronsky D, Fiszman M, Haug PJ. Automatic identification of patients eligible for a pneumonia guideline: comparing the diagnostic accuracy of two decision support models. Stud Health Technol Inform. 2001;84(Pt 1):493–7. Epub 2001/10/18.
Rong G, Mendez A, Assi EB, Zhao B, Sawan M. Artificial intelligence in healthcare: review and prediction case studies. Engineering. 2020;6(3):291–301.
Shah NH, Milstein A, Bagley S. Making machine learning models clinically useful. JAMA. 2019;322(14):1351–2. https://doi.org/10.1001/jama.2019.10306.
Beam AL, Manrai AK, Ghassemi M. Challenges to the reproducibility of machine learning models in health care. JAMA. 2020;323(4):305–6. https://doi.org/10.1001/jama.2019.20866.
Liu VX, Bates DW, Wiens J, Shah NH. The number needed to benefit: estimating the value of predictive analytics in healthcare. J Am Med Inform Assoc. 2019;26(12):1655–9. https://doi.org/10.1093/jamia/ocz088.
Stead WW. Clinical implications and challenges of artificial intelligence and deep learning. JAMA. 2018;320(11):1107–8. https://doi.org/10.1001/jama.2018.11029.
Van Calster B, Wynants L, Timmerman D, Steyerberg EW, Collins GS. Predictive analytics in health care: how can we know it works? J Am Med Inform Assoc. 2019;26(12):1651–4. https://doi.org/10.1093/jamia/ocz130.
Frey LJ, Bernstam EV, Denny JC. Precision medicine informatics. J Am Med Inform Assoc. 2016;23(4):668–70. https://doi.org/10.1093/jamia/ocw053.
Hunter DJ. Uncertainty in the era of precision medicine. N Engl J Med. 2016;375(8):711–3. https://doi.org/10.1056/NEJMp1608282.
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128–38. https://doi.org/10.1097/EDE.0b013e3181c30fb2.
Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380(14):1347–58. https://doi.org/10.1056/NEJMra1814259.
Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. N Engl J Med. 2020;383(9):874–82. Epub 2020/06/17. https://doi.org/10.1056/NEJMms2004740.
Cirillo D, Catuara-Solarz S, Morey C, Guney E, Subirats L, Mellino S, Gigante A, Valencia A, Rementeria MJ, Chadha AS, Mavridis N. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ Digit Med. 2020;3:81. Epub 2020/06/01. https://doi.org/10.1038/s41746-020-0288-5.
Fayyad U, Piatetsky-Shapiro G, et al. From data mining to knowledge discovery: an overview. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurasamy R, editors. Advances in knowledge discovery and data mining. Menlo Park, CA: AAAI Press/MIT Press; 1996. p. 1–34.
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJG, Groth P, Goble C, Grethe JS, Heringa J, ‘t Hoen PAC, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone S-A, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3(1):160018. https://doi.org/10.1038/sdata.2016.18.
Poynton MR, Frey L, et al. Representation of smoking-related concepts in an electronic health record. In: MEDINFO 2007: 12th world congress on health (medical) informatics. Brisbane, Australia; 2007.
Zheutlin AB, Vieira L, Shewcraft RA, Li S, Wang Z, Schadt E, Kao YH, Gross S, Dolan SM, Stone J, Schadt E, Li L. A comprehensive digital phenotype for postpartum hemorrhage. J Am Med Inform Assoc. 2022;29(2):321–8. https://doi.org/10.1093/jamia/ocab181.
Matheny ME, Ricket I, Goodrich CA, Shah RU, Stabler ME, Perkins AM, Dorn C, Denton J, Bray BE, Gouripeddi R, Higgins J, Chapman WW, MacKenzie TA, Brown JR. Development of electronic health record-based prediction models for 30-day readmission risk among patients hospitalized for acute myocardial infarction. JAMA Netw Open. 2021;4(1):e2035782. https://doi.org/10.1001/jamanetworkopen.2020.35782.
Minsky ML. The society of mind. New York: Simon and Schuster; 1986. p. 339.
Wolpert DH. What is important about the no free lunch theorems? In: Pardalos PM, Rasskazova V, Vrahatis MN, editors. Black box optimization, machine learning, and no-free lunch theorems. Cham: Springer International Publishing; 2021. p. 373–88.
McCulloch WS, Pitts WH. A logical calculus of the ideas imminent in nervous activity. Bull Math Biophys. 1943;5:115–33.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44. https://doi.org/10.1038/nature14539.
Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2018;19(6):1236–46. https://doi.org/10.1093/bib/bbx044.
Piccialli F, Somma VD, Giampaolo F, Cuomo S, Fortino G. A survey on deep learning in medicine: why, how and when? Inform Fusion. 2021;66:111–37. https://doi.org/10.1016/j.inffus.2020.09.006.
Krizhevsky A, Sutskever I, Hinton GE, editors. ImageNet classification with deep convolutional neural networks. Curran Associates, Inc.; 2012.
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag. 2012;29(6):82–97. https://doi.org/10.1109/MSP.2012.2205597.
Quinlan J. C4.5: programs for machine learning. San Mateo, CA: Morgan Kauffman; 1993.
Inan M, Hasan R, Alam F. A hybrid probabilistic ensemble based extreme gradient boosting approach for breast cancer diagnosis. 2021. p. 1029–35.
Hassan MM, Peya ZJ, Mollick S, Billah MA, Shakil MMH, Dulla AU. Diabetes prediction in healthcare at early stage using machine learning approach. In: 2021 12th International conference on computing communication and networking technologies (ICCCNT), 6–8 Jul 2021.
Kilic A, Goyal A, Miller JK, Gleason TG, Dubrawksi A. Performance of a machine learning algorithm in predicting outcomes of aortic valve replacement. Ann Thorac Surg. 2021;111(2):503–10. https://doi.org/10.1016/j.athoracsur.2020.05.107.
Vapnik VN. The nature of statistical learning theory. New York: Springer; 1995.
Vapnik VN. Statistical learning theory. New York: Wiley; 1998.
Christianini N, Shawe-Taylor J. An introduction to support vector machines and other kernel-based learning methods. New York: Cambridge University Press; 2000.
Jonsson P, Wohlin C. Benchmarking k-nearest neighbour imputation with homogeneous Likert data. Empir Softw Eng. 2006;11(3):1382–3256.
Genolini C, Falissard B. KmL: k-means for longitudinal data. Comput Stat. 2010;25(2):317–28. https://doi.org/10.1007/s00180-009-0178-4.
Genolini C, Pingault JB, Driss T, Côté S, Tremblay RE, Vitaro F, Arnaud C, Falissard B. KmL3D: a non-parametric algorithm for clustering joint trajectories. Comput Methods Programs Biomed. 2013;109(1):104–11. Epub 2012/11/03. https://doi.org/10.1016/j.cmpb.2012.08.016.
Matheny ME, Ohno-Machado L, Resnic FS. Discrimination and calibration of mortality risk prediction models in interventional cardiology. J Biomed Inform. 2005;38(5):367–75. https://doi.org/10.1016/j.jbi.2005.02.007.
Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36. Epub 1982/04/01. https://doi.org/10.1148/radiology.143.1.7063747.
Lasko TA, Bhagwat JG, Zou KH, Ohno-Machado L. The use of receiver operating characteristic curves in biomedical informatics. J Biomed Inform. 2005;38(5):404–15. Epub 2005/04/02. https://doi.org/10.1016/j.jbi.2005.02.008.
Biswas S, Rajan H. Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline. In: Proceedings of the 29th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. Athens: Association for Computing Machinery; 2021. p. 981–93.
De Balso M. Tecton, Inc. 2020 [21 Mar 2022]. Available from: https://www.tecton.ai/blog/what-is-a-feature-store/.
Breuel C. Towards data science. 2020 [21 Mar 2022]. Available from: https://towardsdatascience.com/ml-ops-machine-learning-as-an-engineering-discipline-b86ca4874a3f.
Rajan NS, Gouripeddi R, Facelli JC. A service oriented framework to assess the quality of electronic health data for clinical research. In: 2013 IEEE international conference on healthcare informatics, 9–11 Sept 2013.
Rajan NS, Gouripeddi R, Mo P, Madsen RK, Facelli JC. Towards a content agnostic computable knowledge repository for data quality assessment. Comput Methods Prog Biomed. 2019;177:193–201. https://doi.org/10.1016/j.cmpb.2019.05.017.
Barocas S, Hardt M, Narayanan A, editors. Fairness and machine learning limitations and opportunities. 2018.
Verma S, Rubin J. Fairness definitions explained. In: Proceedings of the international workshop on software fairness. Gothenburg: Association for Computing Machinery; 2018. p. 1–7.
McDermott MBA, Wang S, Marinsek N, Ranganath R, Foschini L, Ghassemi M. Reproducibility in machine learning for health research: still a ways to go. Sci Transl Med. 2021;13(586):eabb1655. https://doi.org/10.1126/scitranslmed.abb1655.
Qayyum A, Qadir J, Bilal M, Al-Fuqaha A. Secure and robust machine learning for healthcare: a survey. IEEE Rev Biomed Eng. 2021;14:156–80. https://doi.org/10.1109/RBME.2020.3013489.
Morid MA, Sheng ORL, Kawamoto K, Abdelrahman S. Learning hidden patterns from patient multivariate time series data using convolutional neural networks: a case study of healthcare cost prediction. J Biomed Inform. 2020;111:103565. https://doi.org/10.1016/j.jbi.2020.103565.
Purushotham S, Meng C, Che Z, Liu Y. Benchmarking deep learning models on large healthcare datasets. J Biomed Inform. 2018;83:112–34. https://doi.org/10.1016/j.jbi.2018.04.007.
Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, Schacht AL. How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat Rev Drug Discov. 2010;9(3):203–14. https://doi.org/10.1038/nrd3078.
Kaitin KI. Deconstructing the drug development process: the new face of innovation. Clin Pharmacol Therap. 2010;87(3):356–61. https://doi.org/10.1038/clpt.2009.293.
DiMasi JA, Grabowski HG, Hansen RW. Innovation in the pharmaceutical industry: new estimates of R&D costs. J Health Econ. 2016;47:20–33. Epub 2016/02/12. https://doi.org/10.1016/j.jhealeco.2016.01.012.
Ashburn TT, Thor KB. Drug repositioning: identifying and developing new uses for existing drugs. Nat Rev Drug Discov. 2004;3(8):673–83.
Hay M, Thomas DW, Craighead JL, Economides C, Rosenthal J. Clinical development success rates for investigational drugs. Nat Biotechnol. 2014;32(1):40–51.
Zhang P, Wang F, Hu J. Towards drug repositioning: a unified computational framework for integrating multiple aspects of drug similarity and disease similarity. In: AMIA annual symposium proceedings. American Medical Informatics Association; 2014.
Ghofrani HA, Osterloh IH, Grimminger F. Sildenafil: from angina to erectile dysfunction to pulmonary hypertension and beyond. Nat Rev Drug Discov. 2006;5(8):689–702.
Xu H, Aldrich MC, Chen Q, Liu H, Peterson NB, Dai Q, Levy M, Shah A, Han X, Ruan X. Validating drug repurposing signals using electronic health records: a case study of metformin associated with reduced cancer mortality. J Am Med Inform Assoc. 2015;22(1):179–91.
Xu M, Lee EM, Wen Z, Cheng Y, Huang W-K, Qian X, Julia T, Kouznetsova J, Ogden SC, Hammack C. Identification of small-molecule inhibitors of Zika virus infection and induced neural cell death via a drug repurposing screen. Nat Med. 2016;22(10):1101–7.
Gouripeddi R, Balasubramanian V, Panchanathan S, Harris J, Bhaskaran A, Siegel RM. Predicting risk of complications following a drug eluting stent procedure: a SVM approach for imbalanced data. In: 2009 22nd IEEE international symposium on computer-based medical systems, 2–5 Aug 2009.
Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. J Am Med Inform Assoc. 2013;20(1):117–21. Epub 2012/09/06. https://doi.org/10.1136/amiajnl-2012-001145.
Pepe MS. The statistical evaluation of medical tests for classification and prediction. New York: Oxford University Press; 2003.
Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20(1):144–51. Epub 2012/06/25. https://doi.org/10.1136/amiajnl-2011-000681.
Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J, Marshall MS, Ogbuji C, Rees J, Stephens S, Wong GT, Wu E, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung KH. Advancing translational research with the semantic web. BMC Bioinformatics. 2007;8(Suppl 3):S2. Epub 2007/05/09. https://doi.org/10.1186/1471-2105-8-s3-s2.
National Institute of Biomedical Imaging and Bioengineering. Pediatric research using integrated sensor monitoring systems. 2022 [8 Mar 2022]. Available from: https://www.nibib.nih.gov/research-funding/prisms.
Mitra K, Carvunis A-R, Ramesh SK, Ideker T. Integrative approaches for finding modular structure in biological networks. Nat Rev Genet. 2013;14:nrg3552. https://doi.org/10.1038/nrg3552.
Parikh RB, Kakad M, Bates DW. Integrating predictive analytics into high-value care: the dawn of precision delivery. JAMA. 2016;315:651–2. https://doi.org/10.1001/jama.2015.19417.
Szolovits P. Uncertainty and decisions in medical informatics. Methods Inf Med. 1995;34:111–21.
Council NR. Assessing the reliability of complex models: mathematical and statistical foundations of verification, validation, and uncertainty quantification. Washington, DC: The National Academies Press; 2012. 131 p.
Pflieger LT, Mason CC, Facelli JC. Uncertainty quantification in breast cancer risk prediction models using self-reported family health history. J Clin Transl Sci. 2017;1(1):53–9. Epub 2017/01/20. https://doi.org/10.1017/cts.2016.9.
Shafer G, Vovk V. A tutorial on conformal prediction. J Mach Learn Res. 2008;9(3):371.
Balasubramanian V, Ho S-S, Vovk V. Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes; 2014.
Balasubramanian V, Gouripeddi R, Panchanathan S, Vermillion J, Bhaskaran A, Siegel R. Support vector machine based conformal predictors for risk of complications following a coronary Drug Eluting Stent procedure. In: 2009 36th Annual computers in cardiology conference (CinC), 13–16 Sept 2009.
Vazquez J, Facelli JC. Conformal prediction in clinical medical sciences. J Healthc Inform Res. 2022;6:241. https://doi.org/10.1007/s41666-021-00113-8.
Balasubramanian VN, Ho S-S, Vovk V, editors. Conformal prediction for reliable machine learning. Boston: Morgan Kaufmann; 2014. p. i.
Pereira T, Cardoso S, Guerreiro M, Mendonça A, Madeira SC. Targeting the uncertainty of predictions at patient-level using an ensemble of classifiers coupled with calibration methods, Venn-ABERS, and conformal predictors: a case study in AD. J Biomed Inform. 2020;101 https://doi.org/10.1016/j.jbi.2019.103350.
Papadopoulos H, Gammerman A, Vovk V. Reliable diagnosis of acute abdominal pain with conformal prediction. Eng Intell Syst. 2009;17(2):127.
Pokhrel SR, Choi J. Federated learning with blockchain for autonomous vehicles: analysis and design challenges. IEEE Trans Commun. 2020;68(8):4734–46. https://doi.org/10.1109/TCOMM.2020.2990686.
Bonawitz K, Eichner H, Grieskamp W, Huba D, Ingerman A, Ivanov V, Kiddon C, Konečný J, Mazzocchi S, McMahan B. Towards federated learning at scale: system design. Proc Mach Learn Syst. 2019;1:374–88.
Xu J, Glicksberg BS, Su C, Walker P, Bian J, Wang F. Federated learning for healthcare informatics. J Healthc Inform Res. 2021;5(1):1–19. Epub 2020/11/12. https://doi.org/10.1007/s41666-020-00082-4.
Gouripeddi R, Lundrigan P, Kasera S, Collingwood S, Cummins M, Facelli JC, Sward K. Exposure health informatics ecosystem. In: Phillips KA, Yamamoto DP, Racz LA, editors. Total exposure health: an introduction. Boca Raton, FL: CRC Press; 2020.
Choudhury O, Park Y, Salonidis T, Gkoulalas-Divanis A, Sylla I, Das AK. Predicting adverse drug reactions on distributed health data using federated learning. AMIA Annu Symp Proc. 2019;2019:313–22. Epub 2020/03/04.
Bey R, Goussault R, Grolleau F, Benchoufi M, Porcher R. Fold-stratified cross-validation for unbiased and privacy-preserving federated learning. J Am Med Inform Assoc. 2020;27(8):1244–51. https://doi.org/10.1093/jamia/ocaa096.
Breiman L. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci. 2001;16(3):199–231.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Cummins, M.R., Nachimuthu, S.K., Abdelrahman, S.E., Facelli, J.C., Gouripeddi, R. (2023). Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery. In: Richesson, R.L., Andrews, J.E., Fultz Hollis, K. (eds) Clinical Research Informatics. Health Informatics. Springer, Cham. https://doi.org/10.1007/978-3-031-27173-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-27173-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27172-4
Online ISBN: 978-3-031-27173-1
eBook Packages: MedicineMedicine (R0)