As a new concept that emerged in the middle of 1990’s, data mining can help researchers gain both novel and deep insights and can facilitate unprecedented understanding of large biomedical datasets. Data mining can uncover new biomedical and healthcare knowledge for clinical and administrative decision making as well as generate scientific hypotheses from large experimental data, clinical databases, and/or biomedical literature. This review first introduces data mining in general (e.g., the background, definition, and process of data mining), discusses the major differences between statistics and data mining and then speaks to the uniqueness of data mining in the biomedical and healthcare fields. A brief summarization of various data mining algorithms used for classification, clustering, and association as well as their respective advantages and drawbacks is also presented. Suggested guidelines on how to use data mining algorithms in each area of classification, clustering, and association are offered along with three examples of how data mining has been used in the healthcare industry. Given the successful application of data mining by health related organizations that has helped to predict health insurance fraud and under-diagnosed patients, and identify and classify at-risk people in terms of health with the goal of reducing healthcare cost, we introduce how data mining technologies (in each area of classification, clustering, and association) have been used for a multitude of purposes, including research in the biomedical and healthcare fields. A discussion of the technologies available to enable the prediction of healthcare costs (including length of hospital stay), disease diagnosis and prognosis, and the discovery of hidden biomedical and healthcare patterns from related databases is offered along with a discussion of the use of data mining to discover such relationships as those between health conditions and a disease, relationships among diseases, and relationships among drugs. The article concludes with a discussion of the problems that hamper the clinical use of data mining by health professionals.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price includes VAT (USA)
Tax calculation will be finalised during checkout.
MeSH is National Library of Medicine (NLM)’s controlled vocabulary used for indexing MEDLINE articles.
For example, if it takes for a hierarchical algorithm 60 s to cluster 1000 objects (records), to cluster 3000 objects it takes 1620 s (=(3000/1000)3*60) (if there is enough system memory).
Some classification algorithms can mine only either numeric data or categorical data.
Clustering accuracies can be measured only if class (i.e., a dependent variable) is available.
The Technology Review Ten, MIT Technology Review (January/February 2001).
Larose, D. T., Discovering knowledge in data: an introduction to data mining. Wiley, 2004.
Hand, D., Mannila, H., Smyth, P., Principles of data mining. MIT, 2001.
Yoo, I., Song, M., Biomedical ontologies and text mining for biomedicine and healthcare: a survey. Journal of Computing Science and Engineering 2(2):109–36, 2008. (http://jcse.kiise.org/html/download.asp?id=17).
Richards, G., Rayward-Smith, V. J., Sönksen, P. H., Carey, S., and Weng, C., Data mining for indicators of early mortality in a database of clinical records. Artif. Intell. Med. 22:215–231, 2001.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., The KDD process of extracting useful knowledge from volumes of data. Commun. ACM 39(11):27–34, 1996.
Berger, A., and Berger, C., Data mining as a tool for research and knowledge development in nursing. Comput. Inform. Nurs. 22(3):123–131, 2004.
Shearer, C., The CRISP-DM model: the new blueprint for data mining. J Data Warehous 5(4):13–22, 2000.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., From data mining to knowledge discovery in databases. Commun. ACM 39(11):24–26, 1996.
Han, J., Kamber, M., Data mining: concepts and techniques. 2nd ed. The Morgan Kaufmann Series, 2006.
Silver, M., Sakara, T., Su, H. C., Herman, C., Dolins, S. B., and O’shea, M. J., Case study: how to apply data mining techniques in a healthcare data warehouse. J. Healthc. Inf. Manage. 15(2):155–164, 2001.
Harper, P. R., A review and comparison of classification algorithms for medical decision making. Health Policy 71:315–331, 2005.
Sierra, B., and Larranaga, P., Predicting survival in malignant skin melanoma using Bayesian networks automatically induced by genetic algorithms. An empirical comparison between different approaches. Artif. Intell. Med. 14:215–230, 1998.
Eastwood, E. A., Magaziner, J., Wang, J., Silberzweig, S. B., Hannan, E. L., Strauss, E., et al., Patients with hip fracture: subgroups and their outcomes. J. Am. Geriatr. Soc. 50:1240–1249, 2002.
Stel, V. S., Pluijm, S. M., Deeg, D. J., Smit, J. H., Bouter, L. M., and Lips, P., A classification tree for predicting recurrent falling in community-dwelling older persons. J. Am. Geriatr. Soc. 51:1356–1364, 2003.
Yu, J. S., Ongarello, S., Fiedler, R., Chen, X. W., Toffolo, G., Cobelli, C., and Trajanoski, Z., Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics 21:2200–2209, 2005.
Adam, B. L., Qu, Y., Davis, J. W., Ward, M. D., Clements, M. A., Cazares, L. H., et al., Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res. 62:3609–3614, 2002.
Petricoin, E. F., Ardekani, A. M., Hitt, B. A., Levine, P. J., Fusaro, V. A., Steinberg, S. M., et al., Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359:572–577, 2002.
Bellazzi, R., and Zupan, B., Predictive data mining in clinical medicine: current issues and guidelines. Int. J. Med. Inform. 77:81–97, 2008.
Hand, D., Data mining: statistic or more? Am. Stat. 52(2):112–118, 1998.
Seifert, J. W., Data mining: An overview. CRS Report for Congress, The Library of Congress, Dec 2004.
Hand, D., Statistics and data mining: intersecting disciplines. ACM SIGKDD 1(1):16–19, 1999.
Ichise, R., and Numao Learning, M., First-order rules to handle medical data. NII Journal 2:9–14, 2001.
Jolins, J., Ancukiewicz, M., DeLong, E., Pryor, D., Muhlbaier, L., and Mark, D., Discordance of databases designed for claims payment versus clinical information systems: implications for outcomes research. Ann. Intern. Med. 119:844–850, 1993.
Dans, P., Looking for answers in all the wrong places. Ann. Intern. Med. 119:855–857, 1993.
Prather, J. C., Lobach, D. F., Goodwin, L. F., Hales, J. W., Hage, M. L., and Hammond, W. E., Medical data mining knowledge discovery in a clinical data warehouse. AMIA 1091–8280:101–105, 1997.
Berman, J. J., Confidentiality issues for medical data miners. Artif. Intell. Med. 26:25–36, 2002.
Cios, K., and Moore, G. W., Uniqueness of medical data mining. Artif. Intell. Med. 26(1–2):1–24, 2002.
Brachman, R. J., Khabaza, T., Kloesgen, W., Piatetsky-Shapiro, G., and Simoudis, E., Mining business databases. Commun. ACM 39(11):42–48, 1996.
Velickov, S., Solomatine, D., Predictive data mining: practical examples. 2nd Joint Workshop on Applied AI in Civil Engineering, Cottbus, Germany, March 2000.
Dunham, M., Data mining—Introductory and advanced topics. Pearson Education, 2003.
Kononenko, I., Machine learning for medical diagnosis: history, state of the art and perspective. Artif. Intell. Med. 23:89–109, 2001.
Delen, D., Walker, G., and Kadam, A., Predicting breast cancer survivability: a comparison of three data mining methods. Artif. Intell. Med. 34:113–127, 2005.
Anderson, J. A., and Davis, J., An introduction to neural networks. MIT, Cambride, 1995.
Obenshain, M. K., Application of data mining techniques to healthcare data. Infect. Control Hosp. Epidemiol. 25(8):690–695, 2004.
Übeyli, E. D., Comparison of different classification algorithms in clinical decision making. Expert syst 24(1):17–31, 2007.
Kaur, H., and Wasan, S. K., Empirical study on applications of data mining techniques in healthcare. J. Comput. Sci. 2(2):194–200, 2006.
Romeo, M., Burden, F., Quinn, M., Wood, B., and McNaughton, D., Infrared microspectroscopy and artificial neural networks in the diagnosis of cervical cancer. Cell. Mol. Biol. (Noisy-le-Grand, France) 44(1):179, 1998.
Ball, G., Mian, S., Holding, F., Allibone, R., Lowe, J., Ali, S., et al., An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers. Bioinformatics 18(3):395–404, 2002.
Aleynikov, S., and Micheli-Tzanakou, E., Classification of retinal damage by a neural network based system. J. Med. Syst. 22(3):129–136, 1998.
Potter, R., Comparison of classification algorithms applied to breast cancer diagnosis and prognosis, advances in data mining, 7th Industrial Conference, ICDM 2007, Leipzig, Germany, July 2007, pp.40–49.
Kononenko, I., Bratko, I., and Kukar, M., Application of machine learning to medical diagnosis. Machine Learning and Data Mining: Methods and Applications 389:408, 1997.
Sharma, A., and Roy, R. J., Design of a recognition system to predict movement during anesthesia. IEEE Trans. Biomed. Eng. 44(6):505–511, 1997.
Einstein, A. J., Wu, H. S., Sanchez, M., and Gil, J., Fractal characterization of chromatin appearance for diagnosis in breast cytology. J. Pathol. 185(4):366–381, 1998.
Brickley, M., Shepherd, J. P., and Armstrong, R. A., Neural networks: a new technique for development of decision support systems in dentistry. J. Dent. 26(4):305–309, 1998.
Schwarzer, G., Vach, W., and Schumacher, M., On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology. Stat. Med. 19:541–561, 2000.
Craven, M. W., Shavlik, J. W., Learning symbolic rules using artificial neural networks. Proc. 10th International Conference on Machine Learning. Amherst, MA, 1993.
Quinlan, J. R., Discovering rules by induction from large collections of examples. In: Michie, D., (Ed.), Expert Systems in the Micro Electronic Age. Edinburgh University Press, 1979.
Quinlan, J. R., Learning efficient classification procedures and their application to chess endgames. In: Michalski, R. S., Carbonell, J. G., and Mitchell, T. M. (Eds.), Machine learning: an artificial intelligence approach. Tioga Publishing Company, Palo Alto, 1983.
Quinlan, J. R., C4.5: programs for machine learning. Morgan Kaufmann, Amsterdam, 1993.
Boser, B. E., Guyon, I. M., and Vapnik, V. N., A training algorithm for optimal margin classifiers, Fifth Annual Workshop on Computational Learning Theory. ACM, Pittsburgh, pp. 144–152, 1992.
Vapnik, V. N., The nature of statistical learning theory. Springer, NY, 1995.
Vapnik, V. N., and Lerner, A., Pattern recognition using generalized portrait method. Autom. Remote Control 24:774–780, 1963.
Vapnik, V. N., and Chervonenkis, Y., On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16:264–280, 1971.
Meyer, D., Leischa, F., and Hornikb, K., The support vector machine under test. Neurocomputing 55(1–2):169–186, 2003.
Liu, B., Hsu, W., Ma, Y., Integrating classification and association rule mining, KDD’98. New York, NY, Aug. 1998.
Cho, S. B., and Won, H. H., Cancer classification using ensemble of neural networks with multiple significant gene subsets. Appl. Intell. 26:243–250, 2007.
Whitehead, M., and Yaeger, L., Sentiment mining using ensemble classification models. In: Sobh, T. (Ed.), Innovations and advances in computer sciences and engineering. Springer, Netherlands, pp. 509–514, 2010.
Moon, H., Ahn, H., Kodell, R. L., Baek, S., Lin, C. J., and Chen, J. J., Ensemble methods for classification of patients for personalized medicine with high-dimensional data. Artif. Intell. Med. 41(3):197–207, 2007.
Schapire, R. E., The strength of weak learnability. Mach. Learn. 5(2):197–227, 1990.
Breiman, L., Bagging predictors. Mach. Learn. 24(2):123–140, 1996.
Ho, T. K., The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8):832–844, 1998.
Ahn, H., Moon, H., Fazzari, M. J., Lim, N., Chen, J. J., and Kodell, R. L., Classification by ensembles from random partitions of high-dimensional data. Comput. Stat. Data Anal. 51:6166–6179, 2007.
Zhou, Z. H., et al., Lung cancer cell identification based on artificial neural network ensembles. Artif. Intell. Med. 24(1):25–36, 2002.
Santos-Garcia, G., Varela, G., Novoa, N., and Jiménez, M. F., Prediction of postoperative morbidity after lung resection using an artificial neural network ensemble. Artif. Intell. Med. 30(1):61–69, 2004.
Freund, Y., and Schapire, R., A desicion-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55:119–139, 1997.
Morra, J. H., Tu, Z., Apostolova, L. G., Green, A. E., Toga, A. W., and Thompson, P. M., Comparison of Adaboost and support vector machines for detecting Alzheimer’s disease through automated hippocampal segmentation. IEEE Trans. Med. Imag. 29(1):30–43, 2010.
Situ, N., Yuan, X., Zouridakis, G., Boosting instance prototypes to detect local dermoscopic features, 32nd Annual International Conference of the IEEE EMBS (Buenos Aires, Argentina, 2010, Aug 31–Sep 4), pp. 5561–5564.
Douglas, P. K., Harris, S., Yuille, A., Cohen, M. S., Performance comparison of machine learning algorithms and number of independent components used in fMRI decoding of belief vs. disbelief. Neuroimage, 2010. doi:10.1016/j.neuroimage.2010.11.002.
Lopes, R., Ayache, A., Makni, N., Puech, P., Villers, A., Mordon, S., et al., Prostate cancer characterization on MR images using fractal features. Med. Phys. 38:83–95, 2011.
Kaufman, L., Rousseeuw, P. J., Finding groups in data: an introduction to cluster analysis. Wiley, 1990.
Yoo, I., and Hu, X., A comprehensive comparison study of document clustering for a biomedical digital library MDELINE. ACM/IEEE Joint Conference on Digital Libraries 11–15:220–229, 2006. Chapel Hill, NC, June 11–15, 2006.
Yoo, I., Hu, X., and Song, I.-Y., Biomedical ontology improves biomedical literature clustering performance: a comparison study. Int. J. Bioinform. Res. Appl. 3(3):414–428, 2007.
Piatetsky-Shapiro, G., Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro, G., (Ed.), Knowledge Discovery in Databases. AAAI/MIT Press, 1991, pp. 229–248.
Agrawal, R., Imielinski, T., and Swami, A., Mining association rules between sets of items in large databases, Proceedings of the ACM SIGMOD International Conference on the Management of Data. ACM, Washington DC, pp. 207–216, 1993.
Agrawal, R., and Srikant, R., Fast algorithms for mining association rules, Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94). Morgan Kaufmann, Santiago, pp. 487–499, 1994.
Park, J. S., Chen, M. S., Yu, P. S., An effective hash-based algorithm for mining association rules, Proceedings 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD’95), San Jose, CA (May 1995), pp. 175–186.
Toivonen, H., Sampling large databases for association rules, Proceedings 1996 International Conference on Very Large Databases (VLDB’96), Bombay, India (Sept. 1996), pp.134–145.
Steinbach, M., Karypis, G., Kumar, V., A comparison of document clustering techniques, Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota, 2000.
SAS. First Things First—Highmark makes healthcare-fraud prevention top priority with SAS. 2006a. http://www.sas.com/success/pdf/highmarkfraud.pdf.
SAS. Highmark maximizes Medicare revenues with SAS. 2006b http://www.sas.com/success/pdf/highmark.pdf.
SAS. Healthways Heads Off Increased Costs with SAS. 2009. http://www.sas.com/success/pdf/healthways.pdf.
Golub, T. R., et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537, 1999.
Hu, H., Li, J., Plank, A., Wang, H., Daggard, G., A comparative study of classification methods for microarray data analysis. CRPIT Volume 61, Proceedings Fifth Australasian Data Mining Conference. 2006. p. 33–37.
Ries, L. A. G., Harkins, D., Krapcho, M., et al., SEER Cancer Statistics Review, 1975–2003. National Cancer Institute, Bethesda, 2006.
Van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., et al., Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536, 2002.
Weka Version 3.5.5, University of Waikato, Waikato, New Zealand, 1999–2007, http://www.cs.waikato.ac.nz/ml/weka/.
Cox, D. R., Analysis of survival data. Chapman & Hall, London, 1984.
Shah, S., Kusiak, A., and Dixon, B., Data Mining in predicting survival of kidney dialysis patients, Proceedings of Photonics West—Bios 2003. In: Bass, L. S., et al. (Eds.), Lasers in surgery: advanced characterization, therapeutics, and systems XIII, 4949. SPIE, Belingham, 2003.
Beller, G., The rising cost of health care in the United States: is it making the United States globally noncompetitive? J. Nucl. Cardiol. 15(4):481–482, 2008.
Bertsimas, D., Bjarnadóttir, M. V., Kane, M. A., Kryder, J. C., Pandey, R., Vempala, S., and Wang, G., Algorithmic prediction of health-care costs. Oper. Res. 56(6):1382–1392, 2008.
Kerr, G., Ruskin, H. J., Crane, M., and Doolan, P., Techniques for clustering gene expression data. Comput. Biol. Med. 38(3):283–293, 2008.
Do, J. H., and Choi, D. K., Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol. Cells 25(2):279–288, 2008.
Chae, Y. M., Ho, S. H., Cho, K. W., Lee, D. H., and Ji, S. H., Data mining approach to policy analysis in a health insurance domain. Int. J. Med. Inform. 62:103–111, 2001.
Adler, L. D., and Nierenberg, A. A., Review of medication adherence in children and adults with ADHD. Postgrad. Med. 122(1):184–191, 2010.
Tsai, M. H., and Huang, Y. S., Attention-deficit/hyperactivity disorder and sleep disorders in children. Med. Clin. North Am. 94(3):615–632, 2010.
Kessler, R. C., Adler, L. A., Barkley, R., et al., The prevalence and correlates of adult ADHD in the United States: results from the National Comorbidity Survey Replication. Am. J. Psychiatry 163(4):716–723, 2006.
Gau, S., Chong, M., Chen, T., and Cheng, A., A 3-year panel study of mental disorders among adolescents in Taiwan. Am. J. Psychiatry 162(7):1344–1350, 2005.
Tai, Y. M., and Chiu, H. W., Comorbidity study of ADHD: applying association rule mining (ARM) to National Health Insurance Database of Taiwan. Int. J. Med. Inform. 78:75–83, 2009.
Chen, T. J., Chou, L. F., and Hwang, S. J., Application of a data-mining technique to analyze coprescription patterns for antacids in Taiwan. Clin. Ther. 25(9):2453–2463, 2003.
Breault, J. L., Data mining diabetic databases: are rough sets a useful addition? Proceedings of the 33rd Symposium on the Interface. Computing Science and Statistics, Fairfax, 2001.
Goodwin, L., and Iannacchione, M. A., Data mining methods for improving birth outcomes prediction. Outcomes Manage. 6(2):80–85, 2002.
Breault, J. L., Goodall, C. R., and Fos, P. J., Data mining a diabetic data warehouse. Artif. Intell. Med. 26:37–54, 2002.
Andrews, P. J., Sleeman, D. H., Statham, P. F. X., Mcquatt, A., Corruble, V., Jones, P. A., et al., Predicting recovery in patients suffering from traumatic brain injury by using admission variables and physiological data: a comparison between decision tree analysis and logistic regression. J. Neurosurg. 97:326–336, 2002.
Goodwin, L., VanDyne, M., Lin, S., and Talbert, S., Data mining issues and opportunities for building nursing knowledge. J. Biomed. Inform. 36:379–388, 2003.
Nevins, J. R., Huang, E. S., Dressman, H., Pittman, J., Huang, A. T., and West, M., Towards integrated clinico-genomic models for personalized medicine: combining gene expression signatures and clinical factors in breast cancer outcomes prediction, Human Molecular Genetics 12. Review Issue 2:R153–R157, 2003.
Sigurdardottir, A. K., Jonsdottir, H., and Benediktsson, R., Outcomes of educational interventions in type 2 diabetes: WEKA data-mining analysis. Patient Educ. Couns. 67:21–31, 2007.
Huang, L., Hsu, S., Lin, E., A comparison of classification methods for predicting Chronic Fatigue Syndrome based on genetic data. Journal of Translational Medicine. 7–81, 2009.
Toussi, M., Lamy, J., Le Toumelin, P., Venot, A., Using data mining techniques to explore physicians’ therapeutic decisions when clinical guidelines do not provide recommendations: methods and example for type 2 diabetes. BMC Med. Informat. Decis. Making 9–28, 2009.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. H., The WEKA data mining software: an update. SIGKDD Explorations 11(1), 2009.
About this article
Cite this article
Yoo, I., Alafaireet, P., Marinov, M. et al. Data Mining in Healthcare and Biomedicine: A Survey of the Literature. J Med Syst 36, 2431–2448 (2012). https://doi.org/10.1007/s10916-011-9710-5
- Data mining