Skip to main content

Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery

  • Chapter
  • First Online:
Clinical Research Informatics

Part of the book series: Health Informatics ((HI))

Abstract

Clinical information, stored over time, is a potentially rich source of data for clinical research. Knowledge discovery in databases (KDD), commonly known as data mining, is a process for pattern discovery and predictive modeling in large databases. KDD makes extensive use of data mining methods, automated processes, and algorithms that enable pattern recognition. Characteristically, data mining involves the use of machine learning methods developed in the domain of artificial intelligence. These methods have been applied to healthcare and biomedical data for a variety of purposes with good success and potential or realized clinical translation. Herein, the Fayyad model of knowledge discovery in databases is introduced. The steps of the process are described with select examples from clinical research informatics. These steps range from initial data selection to interpretation and evaluation. Commonly used data mining methods are surveyed: artificial neural networks, decision tree induction, support vector machines (kernel methods), association rule induction, and k-nearest neighbor. Methods for evaluating the models that result from the KDD process are closely linked to methods used in diagnostic medicine. These include the use of measures derived from a confusion matrix and receiver operating characteristic curve analysis. Data partitioning and model validation are critical aspects of evaluation. International efforts to develop and refine clinical data repositories are critically linked to the potential of these methods for developing new knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Benson K, Hartz AJ. A comparison of observational studies and randomized, controlled trials. Am J Ophthalmol. 2000;130(5):688.

    Article  PubMed  Google Scholar 

  2. Aronsky D, Fiszman M, et al. Combining decision support methodologies to diagnose pneumonia. Proc AMIA Symp 2001;12–6.

    Google Scholar 

  3. Lagor C, Aronsky D, et al. Automatic identification of patients eligible for a pneumonia guideline: comparing the diagnostic accuracy of two decision support models. Stud Health Technol Inform. 2001;84(Pt 1):493–7.

    PubMed  CAS  Google Scholar 

  4. Fayyad U, PiatetskyShapiro G, et al. From data mining to knowledge discovery: an overview. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, editors. Advances in knowledge discovery and data mining. Menlo Park: AAAI Press/MIT Press; 1996. p. 1–34.

    Google Scholar 

  5. Aronsky D, Haug PJ, et al. Accuracy of administrative data for identifying patients with pneumonia. Am J Med Qual. 2005;20(6):319–28.

    Article  PubMed  Google Scholar 

  6. Poynton MR, Frey L, et al. Representation of smoking-related concepts in an electronic health record. MEDINFO 2007: 12th world congress on health (Medical) informatics, International Medical Informatics Association, Brisbane; 2007.

    Google Scholar 

  7. McCulloch WS, Pitts WH. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys. 1943;5:115–33.

    Article  Google Scholar 

  8. Quinlan J. C4.5: programs for machine learning. San Mateo, CA: Morgan Kaufmann; 1993.

    Google Scholar 

  9. Vapnik VN. The nature of statistical learning theory. New York: Springer; 1995.

    Google Scholar 

  10. Vapnik VN. Statistical learning theory. New York: Wiley; 1998.

    Google Scholar 

  11. Cristianini N, Shawe-Taylor J. An introduction to support vector machines: and other kernel-based learning methods. Cambridge/New York: Cambridge University Press; 2000.

    Google Scholar 

  12. Jonsson P, Wohlin C. Benchmarking k-nearest neighbour imputation with homogeneous Likert data. Empirical Softw Eng. 2006;11(3):1382–3256.

    Google Scholar 

  13. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36.

    PubMed  CAS  Google Scholar 

  14. Lasko TA, Bhagwat JG, et al. The use of receiver operating characteristic curves in biomedical informatics. J Biomed Inform. 2005;38(5):404–15.

    Article  PubMed  Google Scholar 

  15. Cordero F, Botta M, et al. Microarray data analysis and mining approaches. Brief Funct Genomic Proteomic. 2007;6(4):265–81.

    Article  PubMed  CAS  Google Scholar 

  16. Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford/New York: Oxford University Press; 2003.

    Google Scholar 

  17. Genomeweb. Persistent systems helps first european university deploy caBIG’s caTissue repository. BioInform [serial on the Internet]. 2009; (May 22, 2009): Available from: http://www.genomeweb.com/persistent-systems-helps-first-european-university-deploy-cabigs-catissue-reposi

  18. Ruttenberg A, Clark T, et al. Advancing translational research with the Semantic Web. BMC Bioinformatics. 2007;8 Suppl 3:S2.

    Article  PubMed  Google Scholar 

  19. Breiman L. Statistical modeling: the two cultures. Stat Sci. 2001;16(3):199–231.

    Article  Google Scholar 

  20. Matheny ME, Ohno-Machado L, et al. Discrimination and calibration of mortality risk prediction models in interventional cardiology. J Biomed Inform. 2005;38(5):367–75.

    Article  PubMed  CAS  Google Scholar 

  21. Minsky ML. The society of mind. New York: Simon and Schuster; 1986.

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mollie R. Cummins Ph.D., APRN .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag London Limited

About this chapter

Cite this chapter

Cummins, M.R. (2012). Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery. In: Richesson, R., Andrews, J. (eds) Clinical Research Informatics. Health Informatics. Springer, London. https://doi.org/10.1007/978-1-84882-448-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-1-84882-448-5_15

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84882-447-8

  • Online ISBN: 978-1-84882-448-5

  • eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics