Abstract
Clinical information, stored over time, is a potentially rich source of data for clinical research. Knowledge discovery in databases (KDD), commonly known as data mining, is a process for pattern discovery and predictive modeling in large databases. KDD makes extensive use of data mining methods, automated processes, and algorithms that enable pattern recognition. Characteristically, data mining involves the use of machine learning methods developed in the domain of artificial intelligence. These methods have been applied to healthcare and biomedical data for a variety of purposes with good success and potential or realized clinical translation. Herein, the Fayyad model of knowledge discovery in databases is introduced. The steps of the process are described with select examples from clinical research informatics. These steps range from initial data selection to interpretation and evaluation. Commonly used data mining methods are surveyed: artificial neural networks, decision tree induction, support vector machines (kernel methods), association rule induction, and k-nearest neighbor. Methods for evaluating the models that result from the KDD process are closely linked to methods used in diagnostic medicine. These include the use of measures derived from a confusion matrix and receiver operating characteristic curve analysis. Data partitioning and model validation are critical aspects of evaluation. International efforts to develop and refine clinical data repositories are critically linked to the potential of these methods for developing new knowledge.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Benson K, Hartz AJ. A comparison of observational studies and randomized, controlled trials. Am J Ophthalmol. 2000;130(5):688.
Aronsky D, Fiszman M, et al. Combining decision support methodologies to diagnose pneumonia. Proc AMIA Symp 2001;12–6.
Lagor C, Aronsky D, et al. Automatic identification of patients eligible for a pneumonia guideline: comparing the diagnostic accuracy of two decision support models. Stud Health Technol Inform. 2001;84(Pt 1):493–7.
Fayyad U, PiatetskyShapiro G, et al. From data mining to knowledge discovery: an overview. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, editors. Advances in knowledge discovery and data mining. Menlo Park: AAAI Press/MIT Press; 1996. p. 1–34.
Aronsky D, Haug PJ, et al. Accuracy of administrative data for identifying patients with pneumonia. Am J Med Qual. 2005;20(6):319–28.
Poynton MR, Frey L, et al. Representation of smoking-related concepts in an electronic health record. MEDINFO 2007: 12th world congress on health (Medical) informatics, International Medical Informatics Association, Brisbane; 2007.
McCulloch WS, Pitts WH. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys. 1943;5:115–33.
Quinlan J. C4.5: programs for machine learning. San Mateo, CA: Morgan Kaufmann; 1993.
Vapnik VN. The nature of statistical learning theory. New York: Springer; 1995.
Vapnik VN. Statistical learning theory. New York: Wiley; 1998.
Cristianini N, Shawe-Taylor J. An introduction to support vector machines: and other kernel-based learning methods. Cambridge/New York: Cambridge University Press; 2000.
Jonsson P, Wohlin C. Benchmarking k-nearest neighbour imputation with homogeneous Likert data. Empirical Softw Eng. 2006;11(3):1382–3256.
Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36.
Lasko TA, Bhagwat JG, et al. The use of receiver operating characteristic curves in biomedical informatics. J Biomed Inform. 2005;38(5):404–15.
Cordero F, Botta M, et al. Microarray data analysis and mining approaches. Brief Funct Genomic Proteomic. 2007;6(4):265–81.
Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford/New York: Oxford University Press; 2003.
Genomeweb. Persistent systems helps first european university deploy caBIG’s caTissue repository. BioInform [serial on the Internet]. 2009; (May 22, 2009): Available from: http://www.genomeweb.com/persistent-systems-helps-first-european-university-deploy-cabigs-catissue-reposi
Ruttenberg A, Clark T, et al. Advancing translational research with the Semantic Web. BMC Bioinformatics. 2007;8 Suppl 3:S2.
Breiman L. Statistical modeling: the two cultures. Stat Sci. 2001;16(3):199–231.
Matheny ME, Ohno-Machado L, et al. Discrimination and calibration of mortality risk prediction models in interventional cardiology. J Biomed Inform. 2005;38(5):367–75.
Minsky ML. The society of mind. New York: Simon and Schuster; 1986.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag London Limited
About this chapter
Cite this chapter
Cummins, M.R. (2012). Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery. In: Richesson, R., Andrews, J. (eds) Clinical Research Informatics. Health Informatics. Springer, London. https://doi.org/10.1007/978-1-84882-448-5_15
Download citation
DOI: https://doi.org/10.1007/978-1-84882-448-5_15
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-84882-447-8
Online ISBN: 978-1-84882-448-5
eBook Packages: MedicineMedicine (R0)