Skip to main content

Clustering Clinical Data in R

  • Protocol
  • First Online:
Mass Spectrometry Data Analysis in Proteomics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2051))

Abstract

We are currently witnessing a paradigm shift from evidence-based medicine to precision medicine, which has been made possible by the enormous development of technology. The advances in data mining algorithms will allow us to integrate trans-omics with clinical data, contributing to our understanding of pathological mechanisms and massively impacting on the clinical sciences. Cluster analysis is one of the main data mining techniques and allows for the exploration of data patterns that the human mind cannot capture.

This chapter focuses on the cluster analysis of clinical data, using the statistical software, R. We outline the cluster analysis process, underlining some clinical data characteristics. Starting with the data preprocessing step, we then discuss the advantages and disadvantages of the most commonly used clustering algorithms and point to examples of their applications in clinical work. Finally, we briefly discuss how to perform validation of clusters. Throughout the chapter we highlight R packages suitable for each computational step of cluster analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Raghupathi W, Raghupathi V (2014) Big data analytics in healthcare: promise and potential. Health Inf Sci Syst 2:1–10. https://doi.org/10.1186/2047-2501-2-3

    Article  Google Scholar 

  2. Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Appl Deliv Strat 949:4. https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Accessed 21 Jan 2019

    Google Scholar 

  3. Huang S, Chaudhary K, Garmire LX (2017) More is better: recent progress in multi-omics data integration methods. Front Genet 8:1–12. https://doi.org/10.3389/fgene.2017.00084

    Article  CAS  Google Scholar 

  4. Larose DT, Larose CD (2015) Clustering. In: Data mining and predictive analytics, 2nd edn. Wiley, Chichester, UK

    Google Scholar 

  5. Islam S, Hasan M, Wang X et al (2018) A systematic review on healthcare analytics: application and theoretical perspective of data mining. Healthcare 6(54):1–43. https://doi.org/10.3390/healthcare6020054

    Article  Google Scholar 

  6. R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Accessed 21 Jan 2019

    Google Scholar 

  7. Walkowiak S (2016) Big data analytics with R: utilize R to uncover hidden patterns in your big data. Packt Publishing Limited, Birmingham, UK

    Google Scholar 

  8. RStudio Team (2016) RStudio: integrated development environment for R. RStudio, Inc, Boston, MA. http://www.rstudio.com/. Accessed 21 Jan 2019

    Google Scholar 

  9. Kuhn M et al (2018) caret: classification and regression training. R package version 6.0-80. https://CRAN.R-project.org/package=caret. Accessed 21 Jan 2019

  10. Wickham H (2017) tidyverse: easily install and load the “tidyverse”. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse. Accessed 21 Jan 2019

  11. Wickham H, Henry L (2018) tidyr: easily tidy data with “spread()” and “gather()” functions. R package version 0.8.1. https://CRAN.R-project.org/package=tidyr. Accessed 21 Jan 2019

  12. Wickham H, François R, Henry L, Müller K (2018). dplyr: a grammar of data manipulation. R package version 0.7.6. https://CRAN.R-project.org/package=dplyr. Accessed 21 Jan 2019

  13. Bennett DA (2001) How can I deal with missing data in my study? Aust N Z J Public Health 25:464–469

    Article  CAS  Google Scholar 

  14. Wehrens R, Buydens L (2007) Self- and super-organizing maps in R: the kohonen package. J Stat Softw 21(5):1–19

    Article  Google Scholar 

  15. Fox J (2018) RcmdrMisc: R commander miscellaneous functions. R package version 2.5-1. https://CRAN.R-project.org/package=RcmdrMisc. Accessed 21 Jan 2019

  16. Wickham H (2007) Reshaping data with the reshape package. J Stat Softw 21(12):1–20

    Article  Google Scholar 

  17. Bellman R (1957) Dynamic programming. Princeton University Press, Princeton, NJ

    Google Scholar 

  18. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517. https://doi.org/10.1093/bioinformatics/btm344

    Article  CAS  PubMed  Google Scholar 

  19. Alelyani S, Tang J, Liu H (2013) Feature selection for clustering: a review. In: Data clustering: algorithms and applications. CRC Press, Boca Raton, FL, pp 110–121

    Google Scholar 

  20. Dy J, Brodley C (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889

    Google Scholar 

  21. Anukrishna PR, Paul V (2017) A review on feature selection for high dimensional data. In: I2017 International conference on inventive systems and control (ICISC), pp 1–4

    Google Scholar 

  22. Pacheco E (2015) Unsupervised learning with R: work with over 40 packages to draw inferences from complex datasets and find hidden patterns in raw unstructured data. Packt Publishing, Birmingham, UK

    Google Scholar 

  23. Romanski P, Kotthoff L (2018) FSelector: selecting attributes. R package version 0.31. https://CRAN.R-project.org/package=FSelector. Accessed 21 Jan 2019

  24. Raftery LS, Raftery AE (2018) clustvarsel: a package implementing variable selection for Gaussian model-based clustering in R. J Stat Softw 84:1–28. https://doi.org/10.18637/jss.v084.i01

    Article  PubMed  PubMed Central  Google Scholar 

  25. Williams G, Huang J, Chen X, Wang Q, Xiao L(2015) wskm: weighted k-means clustering. R package version 1.4.28. http://CRAN.R-project.org/package=wskm. Accessed 21 Jan 2019

  26. Jolliffe IT (2010) Principal component analysis. Springer, New York

    Google Scholar 

  27. Le S, Josse J, Husson F (2008) FactoMineR: an R package for multivariate analysis. J Stat Softw 25:1–18. https://doi.org/10.18637/jss.v025.i01

    Article  Google Scholar 

  28. Maechler M et al (2018) cluster: cluster analysis basics and extensions. R package version 2.0.7-1. https://cran.r-project.org/web/packages/cluster/cluster.pdf. Accessed 21 Jan 2019

  29. Kassambara A, Mundt F (2017) factoextra: extract and visualize the results of multivariate data analyses. R package version 1.0.5. https://CRAN.R-project.org/package=factoextra. Accessed 21 Jan 2019

  30. Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193. https://doi.org/10.1007/s40745-015-0040-1

    Article  Google Scholar 

  31. Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Google Scholar 

  32. Han J, Kamber M, Pei J (2012) Data mining: concepts and techniques. Elsevier, Amsterdam

    Google Scholar 

  33. Xu R, Wunsch DC (2010) Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 3:120–154. https://doi.org/10.1109/RBME.2010.2083647

    Article  PubMed  Google Scholar 

  34. Abdullah Z, Hamdan AR (2015) Hierarchical clustering algorithms in data mining. Int J Comput Elect Autom Control Inf Eng 9(10)

    Google Scholar 

  35. Williams E, Colasanti R, Wolffs K et al (2018) Classification of tidal breathing airflow profiles using statistical hierarchal cluster analysis in idiopathic pulmonary fibrosis. Med Sci 6:75. https://doi.org/10.3390/medsci6030075

    Article  Google Scholar 

  36. Vincent A, Hoskin TL, Whipple MO et al (2014) OMERACT-based fibromyalgia symptom subgroups: an exploratory cluster analysis. Arthritis Res Ther 16:1–11. https://doi.org/10.1186/s13075-014-0463-7

    Article  Google Scholar 

  37. Ahlqvist E, Storm P, Karajamaki A et al (2018) Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol 6:361–369. https://doi.org/10.1016/S2213-8587(18)30051-2

    Article  PubMed  Google Scholar 

  38. Toppila I (2016) Identifying novel phenotype profiles of diabetic complications and their genetic components using machine learning approaches. Aalto University, Helsinki, Finland

    Google Scholar 

  39. Burgel P-R, Paillasseur J-L, Roche N (2014) Identification of clinical phenotypes using cluster analyses in COPD patients with multiple comorbidities. BioMed Res Int 2014:420134, 9 pages. https://doi.org/10.1155/2014/420134

    Article  PubMed  PubMed Central  Google Scholar 

  40. Galili T (2015) dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. https://doi.org/10.1093/bioinformatics/btv428

    Article  CAS  Google Scholar 

  41. Swarndeep Saket J, Pandya S (2016) An overview of partitioning algorithms in clustering techniques. Int J Adv Res Comput Eng Technol 5:2278–1323

    Google Scholar 

  42. Berkin P (2006) Grouping multidimensional data. In: Grouping multidimensional data. Springer, Berlin, pp 25–71

    Chapter  Google Scholar 

  43. Boomija MD (2008) Comparison of partition based clustering algorithms. J Comput Appl 1:18–21

    Google Scholar 

  44. Lewis SJG, Foltynie T, Blackwell AD et al (2005) Heterogeneity of Parkinson’s disease in the early clinical stages using a data driven approach. J Neurol Neurosurg Psychiatry 76:343–348. https://doi.org/10.1136/jnnp.2003.033530

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Ha NT, Harris M, Preen D et al (2018) Identifying patterns of general practitioner service utilisation and their relationship with potentially preventable hospitalisations in people with diabetes: the utility of a cluster analysis approach. Diabetes Res Clin Pract 138:201–210. https://doi.org/10.1016/j.diabres.2018.01.027

    Article  PubMed  Google Scholar 

  46. Ahmad T, Lund LH, Rao P et al (2018) Machine learning methods improve prognostication, identify clinically distinct phenotypes, and detect heterogeneity in response to therapy in a large cohort of heart failure patients. J Am Heart Assoc 7:1–15. https://doi.org/10.1161/JAHA.117.008081

    Article  CAS  Google Scholar 

  47. Lucas A (2018) amap: another multidimensional analysis package. R package version 0.8-16. https://CRAN.R-project.org/package=amap. Accessed 21 Jan 2019

  48. Szepannek G (2018) clustMixType: k-prototypes clustering for mixed variable-type data. R package version 0.1-36. https://CRAN.R-project.org/package=clustMixType. Accessed 21 Jan 2019

  49. Velmurugan T (2015) Clustering lung cancer data by k-means and k-medoids algorithms. In: International conference on information and convergence technology for smart society, pp 17–21

    Google Scholar 

  50. Kohonen T (1990) The self-organizing map. Proc IEEE 78:1464–1480. https://doi.org/10.1109/5.58325

    Article  Google Scholar 

  51. Hajek P, Henriques R, Hajkova V (2014) Visualising components of regional innovation systems using self-organizing maps-evidence from European regions. Technol Forecast Soc Change 84:197–214. https://doi.org/10.1016/j.techfore.2013.07.013

    Article  Google Scholar 

  52. Paul M, Shaw CK, David W (1996) A comparison of SOM neural network and hierarchical clustering methods. Eur J Oper Res 93:402–417

    Article  Google Scholar 

  53. Cabanes G, Bennani Y (2010) Learning the number of clusters in self organizing map. In: Self-organizing maps. Intech, Croatia. https://doi.org/10.5772/9164

    Chapter  Google Scholar 

  54. Kohonen T (2001) Self-organizing maps. Springer, Berlin

    Book  Google Scholar 

  55. Henriques R, Bacao F, Lobo V (2012) Exploratory geospatial data analysis using the GeoSOM suite. Comput Environ Urban Syst 36:218–232. https://doi.org/10.1016/j.compenvurbsys.2011.11.003

    Article  Google Scholar 

  56. Ultsch A (2007) Emergence in self organizing feature maps. WSOM 2007 - 6th Int work self-organizing maps

    Google Scholar 

  57. Wehrens M, Kruisselbrink J (2018) Self- and super-organising maps in R: the kohonen package. J Stat Softw 21(5)

    Google Scholar 

  58. Markey MK, Lo JY, Tourassi GD, Floyd CE (2003) Self-organizing map for cluster analysis of a breast cancer database. Artif Intell Med 27:113–127. https://doi.org/10.1016/S0933-3657(03)00003-4

    Article  PubMed  Google Scholar 

  59. Vanfleteren LEGW, Spruit MA, Groenen M et al (2013) Clusters of comorbidities based on validated objective measurements and systemic inflammation in patients with chronic obstructive pulmonary disease. Am J Respir Crit Care Med 187:728–735. https://doi.org/10.1164/rccm.201209-1665OC

    Article  PubMed  Google Scholar 

  60. Pina AF, Patarrão RS, Ribeiro RT, Penha-Gonçalves C, Raposo JF, de Oliveira RM, Gardete-Correia L, Duarte R, Boavida JM, Andrade R, Correia I, Medina JL, Henriques R, Macedo MP (2018) Are the normal glucose tolerance individuals totally outside of the diabetes spectrum? Diabetologia 61:S143

    Google Scholar 

  61. Bhuyan R, Borah S (2013) A survey of some density based clustering techniques. In: Conf. advancements in information, computer and communication

    Google Scholar 

  62. Ankerst M, Breunig M, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: ACM SIGMOD international conference on Management of data. ACM Press, New York, pp 49–60

    Google Scholar 

  63. Hennig C (2018) fpc: flexible procedures for clustering. R package version 2.1-11.1. https://CRAN.R-project.org/package=fpc. Accessed 21 Jan 2019

  64. Hahsler M, Piekenbrock M (2018) dbscan: density based clustering of applications with noise (DBSCAN) and related algorithms. R package version 1.1-3. https://CRAN.R-project.org/package=dbscan. Accessed 21 Jan 2019

  65. Celebi ME, Aslandogan YA, Bergstresser PR (2005) Mining biomedical images with density-based clustering. In: International conference on information technology: coding and computing (ITCC‘05) - Volume II. IEEE, Washington, DC, pp 163–168

    Google Scholar 

  66. Bouveyron C, Brunet C (2013) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:1–47. https://doi.org/10.1016/j.csda.2012.12.008

    Article  Google Scholar 

  67. Rubin DB, Dempster AP, Laird N (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39:1–38

    Google Scholar 

  68. Couvreur C (1997) The EM algorithm: a guided tour. Comput Intens Methods Control Signal Process 1997:209–222. https://doi.org/10.1007/978-1-4612-1996-5

    Article  Google Scholar 

  69. Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588. https://doi.org/10.1093/comjnl/41.8.578

    Article  Google Scholar 

  70. Scrucca L, Fop M, Murphy TB, Raftery A (2017) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):205–233

    Google Scholar 

  71. Hwang S, Oh J, Cox J et al (2006) Blood detection in Wireless Capsule Endoscopy using expectation maximization clustering. In: Progress in Biomedical Optics and Imaging - Proceedings of SPIE

    Google Scholar 

  72. Saini S, Rani P (2017) A survey on STING and CLIQUE grid based clustering methods. Int J Adv Res Comput Sci 8:2015–2017

    Google Scholar 

  73. Mann AK, Kaur N (2013) Grid density based clustering algorithm. Int J Adv Res Comput Eng Technol 2:2143–2147

    Google Scholar 

  74. Ilango M, Mohan V (2010) A survey of grid based clustering algorithms. Int J Eng Sci Technol 2:3441–3446

    Google Scholar 

  75. Yue S, Shi T, Wang J, Wang P (2012) Application of grid-based K-means clustering algorithm for optimal image processing. Comput Sci Inf Syst 9:1679–1696. https://doi.org/10.2298/CSIS120126052S

    Article  Google Scholar 

  76. Waller L, Gotway C (2004) Applied spatial statistics for public health data. Wiley, Hoboken, NJ

    Book  Google Scholar 

  77. Kassambara A (2017) Practical guide to cluster analysis in R: unsupervised machine learning. STHDA

    Google Scholar 

  78. Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R Package for determining the relevant number of clusters in a data set. J Stat Softw 61:1–36

    Article  Google Scholar 

  79. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17:107–145. https://doi.org/10.1023/A:1012801612483

    Article  Google Scholar 

  80. Nieweglowski L (2013) clv: cluster validation techniques. R package version 0.3-2.1. https://CRAN.R-project.org/package=clv. Accessed 21 Jan 2019

  81. Brock G, Pihur V, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25:1–22

    Article  Google Scholar 

  82. Wolpert D, Macready G (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput. https://doi.org/10.1109/4235.585893

    Article  Google Scholar 

  83. Alqurashi T, Wang W (2018) Clustering ensemble method. Int J Mach Learn Cyber. https://doi.org/10.1007/s13042-017-0756-7

    Article  Google Scholar 

  84. Chiu DS, Talhouk A (2018) DiceR: an R package for class discovery using an ensemble driven approach. BMC Bioinformatics 19:17–20. https://doi.org/10.1186/s12859-017-1996-y

    Article  Google Scholar 

  85. Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, New York

    Book  Google Scholar 

  86. Wilkinson L (2005) The grammar or grammar of graphics. Springer, New York

    Google Scholar 

  87. Harrell FE Jr, with contributions from CD and many others (2018) Hmisc: Harrell miscellaneous. R package version 4.1-1. https://CRAN.R-project.org/package=Hmisc. Accessed 22 Jan 2019

  88. van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45:1–67

    Article  Google Scholar 

  89. Honaker J, King G, Blackwell M (2011) Amelia II: a program for missing data. J Stat Softw 45:1–47

    Article  Google Scholar 

  90. Stekhoven DJ, Bühlmann P (2012) Missforest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28:112–118. https://doi.org/10.1093/bioinformatics/btr597

    Article  CAS  PubMed  Google Scholar 

  91. Ilango V, Subramanian R, Vasudevan V (2012) A five step procedure for outlier analysis in data mining. Eur J Sci Res 75:327–339

    Google Scholar 

  92. Steinbach M, Ertöz L, Kumar V (2004) New directions in statistical physics. In: The challenges of clustering high dimensional data. Springer, Berlin, pp 273–309

    Google Scholar 

  93. Rendón E, Abundez I, Arizmendi A, Quiroz E (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34

    Google Scholar 

  94. Davis DL, Bouldin DW (1998) A cluster separation measure. IEEE Trans Pattern Anal MachIntel PAMI 1(2):224–227

    Google Scholar 

  95. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comp Appl Math 20:53–65

    Article  Google Scholar 

  96. Dunn JC (1973) A Fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cyber 3:32–57

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ana Pina .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Pina, A., Macedo, M.P., Henriques, R. (2020). Clustering Clinical Data in R. In: Matthiesen, R. (eds) Mass Spectrometry Data Analysis in Proteomics. Methods in Molecular Biology, vol 2051. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9744-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-9744-2_14

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-4939-9743-5

  • Online ISBN: 978-1-4939-9744-2

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics