Abstract
Clinical information, stored over time, is a potentially rich source of data for clinical research. Knowledge discovery in databases (KDD), commonly known as data mining, is a process for pattern discovery and predictive modeling in large databases. KDD makes extensive use of data mining methods, automated processes, and algorithms that enable pattern recognition. Characteristically, data mining involves the use of machine learning methods developed in the domain of artificial intelligence. These methods have been applied to healthcare and biomedical data for a variety of purposes with good success and potential or realized clinical translation. Herein, the Fayyad model of knowledge discovery in databases is introduced. The steps of the process are described with select examples from clinical research informatics. These steps range from initial data selection to interpretation and evaluation. Commonly used data mining methods are surveyed: artificial neural networks, decision tree induction, support vector machines (kernel methods), association rule induction, and k-nearest neighbor. Methods for evaluating the models that result from the KDD process are closely linked to methods used in diagnostic medicine. These include the use of measures derived from a confusion matrix and receiver operating characteristic curve analysis. Data partitioning and model validation are critical aspects of evaluation. International efforts to develop and refine clinical data repositories are critically linked to the potential of these methods for developing new knowledge.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Kush R. Where is caBIG Going? [Internet]. CDISC Website. 2012. Available from: http://www.cdisc.org/where-cabig-going?
References
Benson K, Hartz AJ. A comparison of observational studies and randomized, controlled trials. Am J Ophthalmol. 2000;130(5):688.
Aronsky D, Fiszman M, Chapman WW, Haug PJ. Combining decision support methodologies to diagnose pneumonia. In:Proceedings of the AMIA symposium; 2001. p. 12–6.
Lagor C, Aronsky D, Fiszman M, Haug PJ. Automatic identification of patients eligible for a pneumonia guideline: comparing the diagnostic accuracy of two decision support models. Stud Health Technol Inform. 2001;84(Pt 1):493–7.
Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Mag. 1996;17(3):37–54.
Aronsky D, Haug PJ, Lagor C, Dean NC. Accuracy of administrative data for identifying patients with pneumonia. Am J Med Qual. 2005;20(6):319–28. https://doi.org/10.1177/1062860605280358.
Poynton MR, Frey L, Freg H. Representation of smoking-related concepts in an electronic health record. In:Medinfo 2007: Proceedings of the 12th world congress on health (medical) informatics; building sustainable health systems; 2007. p. 2255.
Minsky M. The society of mind. New York: Simon & Schuster; 1986.
McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys. 1943;5(4):115–33. https://doi.org/10.1007/BF02478259.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436.
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90. https://doi.org/10.1145/3065386.
Quinlan JR. C4. 5: programs for machine learning. Oxford: Elsevier; 2014.
Cristianini N, Shawe-Taylor J. An introduction to support vector machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press; 2000.
Vapnik VN. The nature of statistical learning theory. New York: Springer; 1995. p. 188.
Vapnik VN. Statistical learning theory. New York: Wiley; 1998. p. 736.
Jonsson P, Wohlin C. Benchmarking k-nearest neighbour imputation with homogeneous likert data. Empir Softw Eng. 2006;11(3):463–89. https://doi.org/10.1007/s10664-006-9001-9.
Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology. 1982;143(1):29–36. https://doi.org/10.1148/radiology.143.1.7063747.
Lasko TA, Bhagwat JG, Zou KH, Ohno-Machado L. The use of receiver operating characteristic curves in biomedical informatics. J Biomed Inform. 2005;38(5):404–15. https://doi.org/10.1016/j.jbi.2005.02.008.
Cordero F, Botta M, Calogero RA. Microarray data analysis and mining approaches. Brief Funct Genomics. 2007;6(4):265–81. https://doi.org/10.1093/bfgp/elm034.
Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford: Oxford University Press; 2003. ISBN 9780198509844.
Breiman L. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci. 2001;16(3):199–231. https://doi.org/10.1214/ss/1009213726.
Genomeweb. Persistent systems helps first european deploy cabig’s catissue repository. 2009.
Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J, Marshall MS, Ogbuji C, Rees J, Stephens S, Wong GT, Wu E, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung K-H. Advancing translational research with the semantic web. BMC Bioinforma. 2007;8(3):S2. https://doi.org/10.1186/1471-2105-8-s3-s2.
Program E. Environmental influences on child health outcomes (echo) program. 1/30/2018), ECHO supports multiple longitudinal studies using existing study populations to investigate environmental exposures on child health and development.
Burnett N. Harmonization of sensor measurement to support health research. In: Proceedings of the national conference of undergraduate research 2017. 2017.
Kelly KE, Whitaker J, Petty A, Widmer C, Dybwad A, Sleeth D, Martin R, Butterfield A. Ambient and laboratory evaluation of a low-cost particulate matter sensor. Environ Pollut. 2017;221:491–500. https://doi.org/10.1016/j.envpol.2016.12.039.
Matheny ME, Ohno-Machado L, Resnic FS. Discrimination and calibration of mortality risk prediction models in interventional cardiology. J Biomed Inform. 2005;38(5):367–75.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing
About this chapter
Cite this chapter
Cummins, M.R. (2019). Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery. In: Richesson, R., Andrews, J. (eds) Clinical Research Informatics. Health Informatics. Springer, Cham. https://doi.org/10.1007/978-3-319-98779-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-98779-8_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98778-1
Online ISBN: 978-3-319-98779-8
eBook Packages: MedicineMedicine (R0)