Abstract
We are currently witnessing a paradigm shift from evidence-based medicine to precision medicine, which has been made possible by the enormous development of technology. The advances in data mining algorithms will allow us to integrate trans-omics with clinical data, contributing to our understanding of pathological mechanisms and massively impacting on the clinical sciences. Cluster analysis is one of the main data mining techniques and allows for the exploration of data patterns that the human mind cannot capture.
This chapter focuses on the cluster analysis of clinical data, using the statistical software, R. We outline the cluster analysis process, underlining some clinical data characteristics. Starting with the data preprocessing step, we then discuss the advantages and disadvantages of the most commonly used clustering algorithms and point to examples of their applications in clinical work. Finally, we briefly discuss how to perform validation of clusters. Throughout the chapter we highlight R packages suitable for each computational step of cluster analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Raghupathi W, Raghupathi V (2014) Big data analytics in healthcare: promise and potential. Health Inf Sci Syst 2:1–10. https://doi.org/10.1186/2047-2501-2-3
Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Appl Deliv Strat 949:4. https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Accessed 21 Jan 2019
Huang S, Chaudhary K, Garmire LX (2017) More is better: recent progress in multi-omics data integration methods. Front Genet 8:1–12. https://doi.org/10.3389/fgene.2017.00084
Larose DT, Larose CD (2015) Clustering. In: Data mining and predictive analytics, 2nd edn. Wiley, Chichester, UK
Islam S, Hasan M, Wang X et al (2018) A systematic review on healthcare analytics: application and theoretical perspective of data mining. Healthcare 6(54):1–43. https://doi.org/10.3390/healthcare6020054
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Accessed 21 Jan 2019
Walkowiak S (2016) Big data analytics with R: utilize R to uncover hidden patterns in your big data. Packt Publishing Limited, Birmingham, UK
RStudio Team (2016) RStudio: integrated development environment for R. RStudio, Inc, Boston, MA. http://www.rstudio.com/. Accessed 21 Jan 2019
Kuhn M et al (2018) caret: classification and regression training. R package version 6.0-80. https://CRAN.R-project.org/package=caret. Accessed 21 Jan 2019
Wickham H (2017) tidyverse: easily install and load the “tidyverse”. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse. Accessed 21 Jan 2019
Wickham H, Henry L (2018) tidyr: easily tidy data with “spread()” and “gather()” functions. R package version 0.8.1. https://CRAN.R-project.org/package=tidyr. Accessed 21 Jan 2019
Wickham H, François R, Henry L, Müller K (2018). dplyr: a grammar of data manipulation. R package version 0.7.6. https://CRAN.R-project.org/package=dplyr. Accessed 21 Jan 2019
Bennett DA (2001) How can I deal with missing data in my study? Aust N Z J Public Health 25:464–469
Wehrens R, Buydens L (2007) Self- and super-organizing maps in R: the kohonen package. J Stat Softw 21(5):1–19
Fox J (2018) RcmdrMisc: R commander miscellaneous functions. R package version 2.5-1. https://CRAN.R-project.org/package=RcmdrMisc. Accessed 21 Jan 2019
Wickham H (2007) Reshaping data with the reshape package. J Stat Softw 21(12):1–20
Bellman R (1957) Dynamic programming. Princeton University Press, Princeton, NJ
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517. https://doi.org/10.1093/bioinformatics/btm344
Alelyani S, Tang J, Liu H (2013) Feature selection for clustering: a review. In: Data clustering: algorithms and applications. CRC Press, Boca Raton, FL, pp 110–121
Dy J, Brodley C (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
Anukrishna PR, Paul V (2017) A review on feature selection for high dimensional data. In: I2017 International conference on inventive systems and control (ICISC), pp 1–4
Pacheco E (2015) Unsupervised learning with R: work with over 40 packages to draw inferences from complex datasets and find hidden patterns in raw unstructured data. Packt Publishing, Birmingham, UK
Romanski P, Kotthoff L (2018) FSelector: selecting attributes. R package version 0.31. https://CRAN.R-project.org/package=FSelector. Accessed 21 Jan 2019
Raftery LS, Raftery AE (2018) clustvarsel: a package implementing variable selection for Gaussian model-based clustering in R. J Stat Softw 84:1–28. https://doi.org/10.18637/jss.v084.i01
Williams G, Huang J, Chen X, Wang Q, Xiao L(2015) wskm: weighted k-means clustering. R package version 1.4.28. http://CRAN.R-project.org/package=wskm. Accessed 21 Jan 2019
Jolliffe IT (2010) Principal component analysis. Springer, New York
Le S, Josse J, Husson F (2008) FactoMineR: an R package for multivariate analysis. J Stat Softw 25:1–18. https://doi.org/10.18637/jss.v025.i01
Maechler M et al (2018) cluster: cluster analysis basics and extensions. R package version 2.0.7-1. https://cran.r-project.org/web/packages/cluster/cluster.pdf. Accessed 21 Jan 2019
Kassambara A, Mundt F (2017) factoextra: extract and visualize the results of multivariate data analyses. R package version 1.0.5. https://CRAN.R-project.org/package=factoextra. Accessed 21 Jan 2019
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193. https://doi.org/10.1007/s40745-015-0040-1
Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Han J, Kamber M, Pei J (2012) Data mining: concepts and techniques. Elsevier, Amsterdam
Xu R, Wunsch DC (2010) Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 3:120–154. https://doi.org/10.1109/RBME.2010.2083647
Abdullah Z, Hamdan AR (2015) Hierarchical clustering algorithms in data mining. Int J Comput Elect Autom Control Inf Eng 9(10)
Williams E, Colasanti R, Wolffs K et al (2018) Classification of tidal breathing airflow profiles using statistical hierarchal cluster analysis in idiopathic pulmonary fibrosis. Med Sci 6:75. https://doi.org/10.3390/medsci6030075
Vincent A, Hoskin TL, Whipple MO et al (2014) OMERACT-based fibromyalgia symptom subgroups: an exploratory cluster analysis. Arthritis Res Ther 16:1–11. https://doi.org/10.1186/s13075-014-0463-7
Ahlqvist E, Storm P, Karajamaki A et al (2018) Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol 6:361–369. https://doi.org/10.1016/S2213-8587(18)30051-2
Toppila I (2016) Identifying novel phenotype profiles of diabetic complications and their genetic components using machine learning approaches. Aalto University, Helsinki, Finland
Burgel P-R, Paillasseur J-L, Roche N (2014) Identification of clinical phenotypes using cluster analyses in COPD patients with multiple comorbidities. BioMed Res Int 2014:420134, 9 pages. https://doi.org/10.1155/2014/420134
Galili T (2015) dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. https://doi.org/10.1093/bioinformatics/btv428
Swarndeep Saket J, Pandya S (2016) An overview of partitioning algorithms in clustering techniques. Int J Adv Res Comput Eng Technol 5:2278–1323
Berkin P (2006) Grouping multidimensional data. In: Grouping multidimensional data. Springer, Berlin, pp 25–71
Boomija MD (2008) Comparison of partition based clustering algorithms. J Comput Appl 1:18–21
Lewis SJG, Foltynie T, Blackwell AD et al (2005) Heterogeneity of Parkinson’s disease in the early clinical stages using a data driven approach. J Neurol Neurosurg Psychiatry 76:343–348. https://doi.org/10.1136/jnnp.2003.033530
Ha NT, Harris M, Preen D et al (2018) Identifying patterns of general practitioner service utilisation and their relationship with potentially preventable hospitalisations in people with diabetes: the utility of a cluster analysis approach. Diabetes Res Clin Pract 138:201–210. https://doi.org/10.1016/j.diabres.2018.01.027
Ahmad T, Lund LH, Rao P et al (2018) Machine learning methods improve prognostication, identify clinically distinct phenotypes, and detect heterogeneity in response to therapy in a large cohort of heart failure patients. J Am Heart Assoc 7:1–15. https://doi.org/10.1161/JAHA.117.008081
Lucas A (2018) amap: another multidimensional analysis package. R package version 0.8-16. https://CRAN.R-project.org/package=amap. Accessed 21 Jan 2019
Szepannek G (2018) clustMixType: k-prototypes clustering for mixed variable-type data. R package version 0.1-36. https://CRAN.R-project.org/package=clustMixType. Accessed 21 Jan 2019
Velmurugan T (2015) Clustering lung cancer data by k-means and k-medoids algorithms. In: International conference on information and convergence technology for smart society, pp 17–21
Kohonen T (1990) The self-organizing map. Proc IEEE 78:1464–1480. https://doi.org/10.1109/5.58325
Hajek P, Henriques R, Hajkova V (2014) Visualising components of regional innovation systems using self-organizing maps-evidence from European regions. Technol Forecast Soc Change 84:197–214. https://doi.org/10.1016/j.techfore.2013.07.013
Paul M, Shaw CK, David W (1996) A comparison of SOM neural network and hierarchical clustering methods. Eur J Oper Res 93:402–417
Cabanes G, Bennani Y (2010) Learning the number of clusters in self organizing map. In: Self-organizing maps. Intech, Croatia. https://doi.org/10.5772/9164
Kohonen T (2001) Self-organizing maps. Springer, Berlin
Henriques R, Bacao F, Lobo V (2012) Exploratory geospatial data analysis using the GeoSOM suite. Comput Environ Urban Syst 36:218–232. https://doi.org/10.1016/j.compenvurbsys.2011.11.003
Ultsch A (2007) Emergence in self organizing feature maps. WSOM 2007 - 6th Int work self-organizing maps
Wehrens M, Kruisselbrink J (2018) Self- and super-organising maps in R: the kohonen package. J Stat Softw 21(5)
Markey MK, Lo JY, Tourassi GD, Floyd CE (2003) Self-organizing map for cluster analysis of a breast cancer database. Artif Intell Med 27:113–127. https://doi.org/10.1016/S0933-3657(03)00003-4
Vanfleteren LEGW, Spruit MA, Groenen M et al (2013) Clusters of comorbidities based on validated objective measurements and systemic inflammation in patients with chronic obstructive pulmonary disease. Am J Respir Crit Care Med 187:728–735. https://doi.org/10.1164/rccm.201209-1665OC
Pina AF, Patarrão RS, Ribeiro RT, Penha-Gonçalves C, Raposo JF, de Oliveira RM, Gardete-Correia L, Duarte R, Boavida JM, Andrade R, Correia I, Medina JL, Henriques R, Macedo MP (2018) Are the normal glucose tolerance individuals totally outside of the diabetes spectrum? Diabetologia 61:S143
Bhuyan R, Borah S (2013) A survey of some density based clustering techniques. In: Conf. advancements in information, computer and communication
Ankerst M, Breunig M, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: ACM SIGMOD international conference on Management of data. ACM Press, New York, pp 49–60
Hennig C (2018) fpc: flexible procedures for clustering. R package version 2.1-11.1. https://CRAN.R-project.org/package=fpc. Accessed 21 Jan 2019
Hahsler M, Piekenbrock M (2018) dbscan: density based clustering of applications with noise (DBSCAN) and related algorithms. R package version 1.1-3. https://CRAN.R-project.org/package=dbscan. Accessed 21 Jan 2019
Celebi ME, Aslandogan YA, Bergstresser PR (2005) Mining biomedical images with density-based clustering. In: International conference on information technology: coding and computing (ITCC‘05) - Volume II. IEEE, Washington, DC, pp 163–168
Bouveyron C, Brunet C (2013) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:1–47. https://doi.org/10.1016/j.csda.2012.12.008
Rubin DB, Dempster AP, Laird N (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39:1–38
Couvreur C (1997) The EM algorithm: a guided tour. Comput Intens Methods Control Signal Process 1997:209–222. https://doi.org/10.1007/978-1-4612-1996-5
Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588. https://doi.org/10.1093/comjnl/41.8.578
Scrucca L, Fop M, Murphy TB, Raftery A (2017) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):205–233
Hwang S, Oh J, Cox J et al (2006) Blood detection in Wireless Capsule Endoscopy using expectation maximization clustering. In: Progress in Biomedical Optics and Imaging - Proceedings of SPIE
Saini S, Rani P (2017) A survey on STING and CLIQUE grid based clustering methods. Int J Adv Res Comput Sci 8:2015–2017
Mann AK, Kaur N (2013) Grid density based clustering algorithm. Int J Adv Res Comput Eng Technol 2:2143–2147
Ilango M, Mohan V (2010) A survey of grid based clustering algorithms. Int J Eng Sci Technol 2:3441–3446
Yue S, Shi T, Wang J, Wang P (2012) Application of grid-based K-means clustering algorithm for optimal image processing. Comput Sci Inf Syst 9:1679–1696. https://doi.org/10.2298/CSIS120126052S
Waller L, Gotway C (2004) Applied spatial statistics for public health data. Wiley, Hoboken, NJ
Kassambara A (2017) Practical guide to cluster analysis in R: unsupervised machine learning. STHDA
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R Package for determining the relevant number of clusters in a data set. J Stat Softw 61:1–36
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17:107–145. https://doi.org/10.1023/A:1012801612483
Nieweglowski L (2013) clv: cluster validation techniques. R package version 0.3-2.1. https://CRAN.R-project.org/package=clv. Accessed 21 Jan 2019
Brock G, Pihur V, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25:1–22
Wolpert D, Macready G (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput. https://doi.org/10.1109/4235.585893
Alqurashi T, Wang W (2018) Clustering ensemble method. Int J Mach Learn Cyber. https://doi.org/10.1007/s13042-017-0756-7
Chiu DS, Talhouk A (2018) DiceR: an R package for class discovery using an ensemble driven approach. BMC Bioinformatics 19:17–20. https://doi.org/10.1186/s12859-017-1996-y
Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, New York
Wilkinson L (2005) The grammar or grammar of graphics. Springer, New York
Harrell FE Jr, with contributions from CD and many others (2018) Hmisc: Harrell miscellaneous. R package version 4.1-1. https://CRAN.R-project.org/package=Hmisc. Accessed 22 Jan 2019
van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45:1–67
Honaker J, King G, Blackwell M (2011) Amelia II: a program for missing data. J Stat Softw 45:1–47
Stekhoven DJ, Bühlmann P (2012) Missforest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28:112–118. https://doi.org/10.1093/bioinformatics/btr597
Ilango V, Subramanian R, Vasudevan V (2012) A five step procedure for outlier analysis in data mining. Eur J Sci Res 75:327–339
Steinbach M, Ertöz L, Kumar V (2004) New directions in statistical physics. In: The challenges of clustering high dimensional data. Springer, Berlin, pp 273–309
Rendón E, Abundez I, Arizmendi A, Quiroz E (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34
Davis DL, Bouldin DW (1998) A cluster separation measure. IEEE Trans Pattern Anal MachIntel PAMI 1(2):224–227
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comp Appl Math 20:53–65
Dunn JC (1973) A Fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cyber 3:32–57
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Pina, A., Macedo, M.P., Henriques, R. (2020). Clustering Clinical Data in R. In: Matthiesen, R. (eds) Mass Spectrometry Data Analysis in Proteomics. Methods in Molecular Biology, vol 2051. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9744-2_14
Download citation
DOI: https://doi.org/10.1007/978-1-4939-9744-2_14
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-9743-5
Online ISBN: 978-1-4939-9744-2
eBook Packages: Springer Protocols