Big-Data Analysis, Cluster Analysis, and Machine-Learning Approaches

  • Amparo Alonso-BetanzosEmail author
  • Verónica Bolón-Canedo
Part of the Advances in Experimental Medicine and Biology book series (AEMB, volume 1065)


Medicine will experience many changes in the coming years because the so-called “medicine of the future” will be increasingly proactive, featuring four basic elements: predictive, personalized, preventive, and participatory. Drivers for these changes include the digitization of data in medicine and the availability of computational tools that deal with massive volumes of data. Thus, the need to apply machine-learning methods to medicine has increased dramatically in recent years while facing challenges related to an unprecedented large number of clinically relevant features and highly specific diagnostic tests. Advances regarding data-storage technology and the progress concerning genome studies have enabled collecting vast amounts of patient clinical details, thus permitting the extraction of valuable information. In consequence, big-data analytics is becoming a mandatory technology to be used in the clinical domain.

Machine learning and big-data analytics can be used in the field of cardiology, for example, for the prediction of individual risk factors for cardiovascular disease, for clinical decision support, and for practicing precision medicine using genomic information. Several projects employ machine-learning techniques to address the problem of classification and prediction of heart failure (HF) subtypes and unbiased clustering analysis using dense phenomapping to identify phenotypically distinct HF categories. In this chapter, these ideas are further presented, and a computerized model allowing the distinction between two major HF phenotypes on the basis of ventricular-volume data analysis is discussed in detail.


Machine learning Big-data analysis Cluster analysis Precision medicine Heart failure phenotyping Support vector machine 


  1. 1.
    Hood L. A vision for personalized medicine. MIT Technology Review. Available at: Accessed 6 Apr 2017.
  2. 2.
    Deo RC. Machine learning in medicine. Circulation. 2015;132:1920–30.CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Krittanawong C, et al. Future physicians in the era of precision cardiovascular medicine. Circulation. 2017;136:1572–4.CrossRefPubMedGoogle Scholar
  4. 4.
    Shah SJ, Katz DH, Selvaraj S, Burke MA, Yancy CW, Gheorghiade M, Bonow RO, Huang CC, Deo RC. Phenomapping for novel classification of heart failure with preserved ejection fraction. Circulation. 2015;131:269–79.CrossRefPubMedGoogle Scholar
  5. 5.
    Zhang X, Ambale-Venkatesh B, Bluemke DA, Cowan BR, Finn JP, Kadish AH, Lee DC, Lima JA, Hundley WG, Suinesiaputra A, Young AA, Medrano-Gracia P. Information maximizing component analysis of left ventricular remodeling due to myocardial infarction. J Transl Med. 2015 Nov 3;13(1):343. Scholar
  6. 6.
    Rumsfeld JS, Joynt KE, Maddox TM. Big data analytics to improve cardiovascular care: promise and challenges. Nat Rev Cardiol. 2016 Jun;13(6):350–9.CrossRefPubMedGoogle Scholar
  7. 7.
    Tripoliti EE, Papadopoulos TG, Karanasiou GS, Naka KK, Fotiadis DI. Heart failure: diagnosis, severity estimation and prediction of adverse events through machine learning techniques. Comput Struct Biotechnol. 2017;15:26–47.CrossRefGoogle Scholar
  8. 8.
    Kerkhof PLM, Alonso-Betanzos A, Moret-Bonillo V. Medical expert systems. In: Wiley encyclopedia of electrical and electronics engineering. Wiley; 2017.Google Scholar
  9. 9.
    Austin PC, Tu JV, Ho JE, Levy D, Lee DS. Using methods from the data mining and machine learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes. J Clin Epidemiol. 2013;66:398–407.CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Alonso-Betanzos A, Bolón-Canedo V, Heyndrickx GR, Kerkhof PLM. Exploring guidelines for classification of major heart failure subtypes by using machine learning. Clin Med Insights Cardiol. 2015;9(Suppl 1):57–71. Scholar
  11. 11.
    Narula S, Shameer K, Omar AMS, Dudley JT, Sengupta PP. Machine-learning algorithms to Automate morphological and functional assessments in 2D echocardiography. J Am Coll Cardiol. 2016 Nov;68(21):2287–95. Scholar
  12. 12.
    Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81–106.Google Scholar
  13. 13.
    Ahmad T, Testani JM, Desai NR. Can big data simplify the complexity of modern medicine? Prediction of right ventricular failure after left ventricular assist device support as a test case. JACC Heart Failure. 2016;4(9):722–4.CrossRefPubMedGoogle Scholar
  14. 14.
    Motwani M, Dey D, Berman DS, Germano G, Achenbach S, Al-Mallah MH, Andreini D, Budoff MJ, Cademartiri F, Callister TQ, Chang HJ, Chinnaiyan K, Chow BJ, Cury RC, Delago A, Gomez M, Gransar H, Hadamitzky M, Hausleiter J, Hindoyan N, Feuchtner G, Kaufmann PA, Kim YJ, Leipsic J, Lin FY, Maffei E, Marques H, Pontone G, Raff G, Rubinshtein R, Shaw LJ, Stehli J, Villines TC, Dunning A, Min JK, Slomka PJ. Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: a 5-year multicentre prospective registry analysis. Eur Heart J. 2017;38(7):500–7. Scholar
  15. 15.
    Loghmanpour NA, Kormos RL, Kanwar MK, Teuteberg JJ, Murali S, Antaki JF. A Bayesian model to predict right ventricular failure following left ventricular assist device therapy. JACC Heart Failure. 2016;4(9):711–21.CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Ng K, Steinhubl SR, de Filippi C, Dey S, Stewart WF. Early detection of heart failure using electronic health records: practical implications for time before diagnosis, data diversity, data quantity, and data density. Circ Cardiovasc Qual Outcomes. 2016;9:649–58.CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Spertus JV, Normand ST, Wolf R, Cioffi M, Lovett A, Rose S. Assessing hospital performance after percutaneous coronary intervention using big data. Circ Cardiovasc Qual Outcomes. 2016;9:659–69.CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Bourne PE, Bonazzi V, Dunn M, Green ED, Guyer M, Komatsoulis G, Larkin J, Russell B. The NIH big data to knowledge (BD2K) initiative. J Am Med Inform Assoc. 2015;22(6):1114. Scholar
  19. 19.
    Shein E. Combating Cancer with data. Supercomputers will shift massive amounts of data in search of therapies that work. Commun ACM. 2017;60(5):10–2.CrossRefGoogle Scholar
  20. 20.
    Erl T, Khattak W, Buhler P. Big data fundamentals. Concepts, drivers & techniques. Boston: Prentice-Hall; 2016.Google Scholar
  21. 21.
    Ankam V. Big data analytics. Birmingham: Packt Publishing; 2016.Google Scholar
  22. 22.
    Ramírez-Gallego S, García S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Alonso-Betanzos S, Benítez JM, Herrera F. Data discretization: taxonomy and big data challenge. Wiley Interdiscip Rev Data Min Knowl Disc. 2016;6(1):5–21.CrossRefGoogle Scholar
  23. 23.
    Yang Y, Webb GI. Discretization for naive-Bayes learning: managing discretization bias and variance. Mach Learn. 2009;74(1):39–74.CrossRefGoogle Scholar
  24. 24.
    Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81–106.Google Scholar
  25. 25.
    Frank E, Hall MA, Witten IH. The WEKA Workbench. Online appendix for “Data mining: practical machine learning tools and techniques”. 4th ed. Cambridge, MA: Morgan Kaufmann; 2016.Google Scholar
  26. 26.
    Yang Y, Webb GI. Proportional k-interval discretization for naive-Bayes classifiers. In: European conference on machine learning. Berlin/Heidelberg: Springer; 2001. p. 564–575.CrossRefGoogle Scholar
  27. 27.
    Irani KB. Multi-interval discretization of continuous-valued attributes for classification learning. 1993.Google Scholar
  28. 28.
    Ramírez-Gallego S, García S, Mourino-Talin H, Martínez-Rego D, Bolón-Canedo V, Alonso-Betanzos A, Herrera F. Distributed entropy minimization discretizer for big data analysis under apache spark. In Trustcom/BigDataSE/ISPA, 2015 IEEE. Vol. 2. IEEE; 2015, August. p. 33–40.Google Scholar
  29. 29.
    Zhai Y, Ong YS, Tsang IW. The emerging “Big Dimensionality”. IEEE Comput Intell Mag. 2014;9(3):14–26.CrossRefGoogle Scholar
  30. 30.
    Bolón-Canedo V, Sanchez-Marono N, Alonso-Betanzos A. Feature selection for high-dimensional data. Cham: Springer; 2015.CrossRefGoogle Scholar
  31. 31.
    Hall MA. Correlation-based feature selection for machine learning. Doctoral dissertation, The University of Waikato. 1999.Google Scholar
  32. 32.
    Dash M, Liu H. Consistency-based search in feature selection. Artif Intell. 2003;151(1–2):155–76.CrossRefGoogle Scholar
  33. 33.
    Kononenko I. Estimating attributes: analysis and extensions of RELIEF. In European conference on machine learning. Berlin/Heidelberg: Springer; 1994, April. p. 171–82.CrossRefGoogle Scholar
  34. 34.
    Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005;27(8):1226–38.CrossRefPubMedGoogle Scholar
  35. 35.
    Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1):389–422.CrossRefGoogle Scholar
  36. 36.
    Ramírez-Gallego S, Lastra I, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Herrera F, Alonso-Betanzos A. Fast-mRMR: fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int J Intell Syst. 2016;0:1–19.Google Scholar
  37. 37.
    Bolón-Canedo V, Remeseiro B, Alonso-Betanzos A, Campilho A. Machine learning for medical applications. In: Proceedings of the European Symposium on Artificial Neural Networks (ESANN). 2016. p. 225–34.Google Scholar
  38. 38.
    Flach P. Machine learning. The art and science of algorithms that make sense of data. Cambridge: Cambridge University Press; 2012.CrossRefGoogle Scholar
  39. 39.
    Murphy KP. Machine learning. A probabilistic perspective. Cambridge, MA: MIT Press; 2012.Google Scholar
  40. 40.
    Shalev-Schwartz S, Ben-David S. Understanding machine learning. Cambridge: Cambridge University Press; 2014.CrossRefGoogle Scholar
  41. 41.
    Barber D. Bayesian reasoning and machine learning. Cambridge: Cambridge University Press; 2012.Google Scholar
  42. 42.
    Domingos P. The master algorithm. How the quest for the ultimate learning machine will remake our world. New York: Basic Books; 2015.Google Scholar
  43. 43.
    Bishop CM. Pattern recognition and machine learning. New York: Springer; 2006.Google Scholar
  44. 44.
    Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Do we need hundreds of classifiers to solve real worls classification problems? J Mach Learn Res. 2014;15:3133–81.Google Scholar
  45. 45.
    Little RJA, Rubin DB. Statistical analysis with missing data. Chicester: Wiley; 2002.CrossRefGoogle Scholar
  46. 46.
    Horton NJ, Kleiman KP. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression model. The American Statiscian. 2007;61(1):79–90.CrossRefGoogle Scholar
  47. 47.
    Kerkhof PL. Characterizing heart failure in the ventricular volume domain. Clin Med Insights Cardiol. 2015;2015(Suppl. 1):11.Google Scholar
  48. 48.
    Domingo-Ferrer J, Soria-Comas J. Anonymization in the time of big data. In: Domingo-Ferrer J, Pejić-Bach M, editors. Privacy in statistical databases. PSD 2016, Lecture notes in computer science, vol. 9867. Cham: Springer; 2016.Google Scholar
  49. 49.
    Domingo-Ferrer J, Soria-Comas J. From t-closeness to differential privacy and vice versa in data anonymization. Knowl-Based Syst. 2015;74:151–8. Scholar
  50. 50.
    Sweeney L, Abu A, Winn J. Identifying participants in the Personal Genome Project by Name. Harvard University. Data Privacy Lab. White Paper 1021-1. April 24, 2013.Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Amparo Alonso-Betanzos
    • 1
    Email author
  • Verónica Bolón-Canedo
    • 1
  1. 1.Department of Computer ScienceUniversity of A CoruñaA CoruñaSpain

Personalised recommendations