Clustering Clinical Data in R

Pina, Ana; Macedo, Maria Paula; Henriques, Roberto

doi:10.1007/978-1-4939-9744-2_14

Ana Pina^3,4,5,
Maria Paula Macedo^3,5,6 &
Roberto Henriques⁷

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2051))

3647 Accesses
2 Citations
1 Altmetric

Abstract

We are currently witnessing a paradigm shift from evidence-based medicine to precision medicine, which has been made possible by the enormous development of technology. The advances in data mining algorithms will allow us to integrate trans-omics with clinical data, contributing to our understanding of pathological mechanisms and massively impacting on the clinical sciences. Cluster analysis is one of the main data mining techniques and allows for the exploration of data patterns that the human mind cannot capture.

This chapter focuses on the cluster analysis of clinical data, using the statistical software, R. We outline the cluster analysis process, underlining some clinical data characteristics. Starting with the data preprocessing step, we then discuss the advantages and disadvantages of the most commonly used clustering algorithms and point to examples of their applications in clinical work. Finally, we briefly discuss how to perform validation of clusters. Throughout the chapter we highlight R packages suitable for each computational step of cluster analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Raghupathi W, Raghupathi V (2014) Big data analytics in healthcare: promise and potential. Health Inf Sci Syst 2:1–10. https://doi.org/10.1186/2047-2501-2-3
Article Google Scholar
Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Appl Deliv Strat 949:4. https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Accessed 21 Jan 2019
Google Scholar
Huang S, Chaudhary K, Garmire LX (2017) More is better: recent progress in multi-omics data integration methods. Front Genet 8:1–12. https://doi.org/10.3389/fgene.2017.00084
Article CAS Google Scholar
Larose DT, Larose CD (2015) Clustering. In: Data mining and predictive analytics, 2nd edn. Wiley, Chichester, UK
Google Scholar
Islam S, Hasan M, Wang X et al (2018) A systematic review on healthcare analytics: application and theoretical perspective of data mining. Healthcare 6(54):1–43. https://doi.org/10.3390/healthcare6020054
Article Google Scholar
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Accessed 21 Jan 2019
Google Scholar
Walkowiak S (2016) Big data analytics with R: utilize R to uncover hidden patterns in your big data. Packt Publishing Limited, Birmingham, UK
Google Scholar
RStudio Team (2016) RStudio: integrated development environment for R. RStudio, Inc, Boston, MA. http://www.rstudio.com/. Accessed 21 Jan 2019
Google Scholar
Kuhn M et al (2018) caret: classification and regression training. R package version 6.0-80. https://CRAN.R-project.org/package=caret. Accessed 21 Jan 2019
Wickham H (2017) tidyverse: easily install and load the “tidyverse”. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse. Accessed 21 Jan 2019
Wickham H, Henry L (2018) tidyr: easily tidy data with “spread()” and “gather()” functions. R package version 0.8.1. https://CRAN.R-project.org/package=tidyr. Accessed 21 Jan 2019
Wickham H, François R, Henry L, Müller K (2018). dplyr: a grammar of data manipulation. R package version 0.7.6. https://CRAN.R-project.org/package=dplyr. Accessed 21 Jan 2019
Bennett DA (2001) How can I deal with missing data in my study? Aust N Z J Public Health 25:464–469
Article CAS Google Scholar
Wehrens R, Buydens L (2007) Self- and super-organizing maps in R: the kohonen package. J Stat Softw 21(5):1–19
Article Google Scholar
Fox J (2018) RcmdrMisc: R commander miscellaneous functions. R package version 2.5-1. https://CRAN.R-project.org/package=RcmdrMisc. Accessed 21 Jan 2019
Wickham H (2007) Reshaping data with the reshape package. J Stat Softw 21(12):1–20
Article Google Scholar
Bellman R (1957) Dynamic programming. Princeton University Press, Princeton, NJ
Google Scholar
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517. https://doi.org/10.1093/bioinformatics/btm344
Article CAS PubMed Google Scholar
Alelyani S, Tang J, Liu H (2013) Feature selection for clustering: a review. In: Data clustering: algorithms and applications. CRC Press, Boca Raton, FL, pp 110–121
Google Scholar
Dy J, Brodley C (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
Google Scholar
Anukrishna PR, Paul V (2017) A review on feature selection for high dimensional data. In: I2017 International conference on inventive systems and control (ICISC), pp 1–4
Google Scholar
Pacheco E (2015) Unsupervised learning with R: work with over 40 packages to draw inferences from complex datasets and find hidden patterns in raw unstructured data. Packt Publishing, Birmingham, UK
Google Scholar
Romanski P, Kotthoff L (2018) FSelector: selecting attributes. R package version 0.31. https://CRAN.R-project.org/package=FSelector. Accessed 21 Jan 2019
Raftery LS, Raftery AE (2018) clustvarsel: a package implementing variable selection for Gaussian model-based clustering in R. J Stat Softw 84:1–28. https://doi.org/10.18637/jss.v084.i01
Article PubMed PubMed Central Google Scholar
Williams G, Huang J, Chen X, Wang Q, Xiao L(2015) wskm: weighted k-means clustering. R package version 1.4.28. http://CRAN.R-project.org/package=wskm. Accessed 21 Jan 2019
Jolliffe IT (2010) Principal component analysis. Springer, New York
Google Scholar
Le S, Josse J, Husson F (2008) FactoMineR: an R package for multivariate analysis. J Stat Softw 25:1–18. https://doi.org/10.18637/jss.v025.i01
Article Google Scholar
Maechler M et al (2018) cluster: cluster analysis basics and extensions. R package version 2.0.7-1. https://cran.r-project.org/web/packages/cluster/cluster.pdf. Accessed 21 Jan 2019
Kassambara A, Mundt F (2017) factoextra: extract and visualize the results of multivariate data analyses. R package version 1.0.5. https://CRAN.R-project.org/package=factoextra. Accessed 21 Jan 2019
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193. https://doi.org/10.1007/s40745-015-0040-1
Article Google Scholar
Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Google Scholar
Han J, Kamber M, Pei J (2012) Data mining: concepts and techniques. Elsevier, Amsterdam
Google Scholar
Xu R, Wunsch DC (2010) Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 3:120–154. https://doi.org/10.1109/RBME.2010.2083647
Article PubMed Google Scholar
Abdullah Z, Hamdan AR (2015) Hierarchical clustering algorithms in data mining. Int J Comput Elect Autom Control Inf Eng 9(10)
Google Scholar
Williams E, Colasanti R, Wolffs K et al (2018) Classification of tidal breathing airflow profiles using statistical hierarchal cluster analysis in idiopathic pulmonary fibrosis. Med Sci 6:75. https://doi.org/10.3390/medsci6030075
Article Google Scholar
Vincent A, Hoskin TL, Whipple MO et al (2014) OMERACT-based fibromyalgia symptom subgroups: an exploratory cluster analysis. Arthritis Res Ther 16:1–11. https://doi.org/10.1186/s13075-014-0463-7
Article Google Scholar
Ahlqvist E, Storm P, Karajamaki A et al (2018) Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol 6:361–369. https://doi.org/10.1016/S2213-8587(18)30051-2
Article PubMed Google Scholar
Toppila I (2016) Identifying novel phenotype profiles of diabetic complications and their genetic components using machine learning approaches. Aalto University, Helsinki, Finland
Google Scholar
Burgel P-R, Paillasseur J-L, Roche N (2014) Identification of clinical phenotypes using cluster analyses in COPD patients with multiple comorbidities. BioMed Res Int 2014:420134, 9 pages. https://doi.org/10.1155/2014/420134
Article PubMed PubMed Central Google Scholar
Galili T (2015) dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. https://doi.org/10.1093/bioinformatics/btv428
Article CAS Google Scholar
Swarndeep Saket J, Pandya S (2016) An overview of partitioning algorithms in clustering techniques. Int J Adv Res Comput Eng Technol 5:2278–1323
Google Scholar
Berkin P (2006) Grouping multidimensional data. In: Grouping multidimensional data. Springer, Berlin, pp 25–71
Chapter Google Scholar
Boomija MD (2008) Comparison of partition based clustering algorithms. J Comput Appl 1:18–21
Google Scholar
Lewis SJG, Foltynie T, Blackwell AD et al (2005) Heterogeneity of Parkinson’s disease in the early clinical stages using a data driven approach. J Neurol Neurosurg Psychiatry 76:343–348. https://doi.org/10.1136/jnnp.2003.033530
Article CAS PubMed PubMed Central Google Scholar
Ha NT, Harris M, Preen D et al (2018) Identifying patterns of general practitioner service utilisation and their relationship with potentially preventable hospitalisations in people with diabetes: the utility of a cluster analysis approach. Diabetes Res Clin Pract 138:201–210. https://doi.org/10.1016/j.diabres.2018.01.027
Article PubMed Google Scholar
Ahmad T, Lund LH, Rao P et al (2018) Machine learning methods improve prognostication, identify clinically distinct phenotypes, and detect heterogeneity in response to therapy in a large cohort of heart failure patients. J Am Heart Assoc 7:1–15. https://doi.org/10.1161/JAHA.117.008081
Article CAS Google Scholar
Lucas A (2018) amap: another multidimensional analysis package. R package version 0.8-16. https://CRAN.R-project.org/package=amap. Accessed 21 Jan 2019
Szepannek G (2018) clustMixType: k-prototypes clustering for mixed variable-type data. R package version 0.1-36. https://CRAN.R-project.org/package=clustMixType. Accessed 21 Jan 2019
Velmurugan T (2015) Clustering lung cancer data by k-means and k-medoids algorithms. In: International conference on information and convergence technology for smart society, pp 17–21
Google Scholar
Kohonen T (1990) The self-organizing map. Proc IEEE 78:1464–1480. https://doi.org/10.1109/5.58325
Article Google Scholar
Hajek P, Henriques R, Hajkova V (2014) Visualising components of regional innovation systems using self-organizing maps-evidence from European regions. Technol Forecast Soc Change 84:197–214. https://doi.org/10.1016/j.techfore.2013.07.013
Article Google Scholar
Paul M, Shaw CK, David W (1996) A comparison of SOM neural network and hierarchical clustering methods. Eur J Oper Res 93:402–417
Article Google Scholar
Cabanes G, Bennani Y (2010) Learning the number of clusters in self organizing map. In: Self-organizing maps. Intech, Croatia. https://doi.org/10.5772/9164
Chapter Google Scholar
Kohonen T (2001) Self-organizing maps. Springer, Berlin
Book Google Scholar
Henriques R, Bacao F, Lobo V (2012) Exploratory geospatial data analysis using the GeoSOM suite. Comput Environ Urban Syst 36:218–232. https://doi.org/10.1016/j.compenvurbsys.2011.11.003
Article Google Scholar
Ultsch A (2007) Emergence in self organizing feature maps. WSOM 2007 - 6th Int work self-organizing maps
Google Scholar
Wehrens M, Kruisselbrink J (2018) Self- and super-organising maps in R: the kohonen package. J Stat Softw 21(5)
Google Scholar
Markey MK, Lo JY, Tourassi GD, Floyd CE (2003) Self-organizing map for cluster analysis of a breast cancer database. Artif Intell Med 27:113–127. https://doi.org/10.1016/S0933-3657(03)00003-4
Article PubMed Google Scholar
Vanfleteren LEGW, Spruit MA, Groenen M et al (2013) Clusters of comorbidities based on validated objective measurements and systemic inflammation in patients with chronic obstructive pulmonary disease. Am J Respir Crit Care Med 187:728–735. https://doi.org/10.1164/rccm.201209-1665OC
Article PubMed Google Scholar
Pina AF, Patarrão RS, Ribeiro RT, Penha-Gonçalves C, Raposo JF, de Oliveira RM, Gardete-Correia L, Duarte R, Boavida JM, Andrade R, Correia I, Medina JL, Henriques R, Macedo MP (2018) Are the normal glucose tolerance individuals totally outside of the diabetes spectrum? Diabetologia 61:S143
Google Scholar
Bhuyan R, Borah S (2013) A survey of some density based clustering techniques. In: Conf. advancements in information, computer and communication
Google Scholar
Ankerst M, Breunig M, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: ACM SIGMOD international conference on Management of data. ACM Press, New York, pp 49–60
Google Scholar
Hennig C (2018) fpc: flexible procedures for clustering. R package version 2.1-11.1. https://CRAN.R-project.org/package=fpc. Accessed 21 Jan 2019
Hahsler M, Piekenbrock M (2018) dbscan: density based clustering of applications with noise (DBSCAN) and related algorithms. R package version 1.1-3. https://CRAN.R-project.org/package=dbscan. Accessed 21 Jan 2019
Celebi ME, Aslandogan YA, Bergstresser PR (2005) Mining biomedical images with density-based clustering. In: International conference on information technology: coding and computing (ITCC‘05) - Volume II. IEEE, Washington, DC, pp 163–168
Google Scholar
Bouveyron C, Brunet C (2013) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:1–47. https://doi.org/10.1016/j.csda.2012.12.008
Article Google Scholar
Rubin DB, Dempster AP, Laird N (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39:1–38
Google Scholar
Couvreur C (1997) The EM algorithm: a guided tour. Comput Intens Methods Control Signal Process 1997:209–222. https://doi.org/10.1007/978-1-4612-1996-5
Article Google Scholar
Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588. https://doi.org/10.1093/comjnl/41.8.578
Article Google Scholar
Scrucca L, Fop M, Murphy TB, Raftery A (2017) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):205–233
Google Scholar
Hwang S, Oh J, Cox J et al (2006) Blood detection in Wireless Capsule Endoscopy using expectation maximization clustering. In: Progress in Biomedical Optics and Imaging - Proceedings of SPIE
Google Scholar
Saini S, Rani P (2017) A survey on STING and CLIQUE grid based clustering methods. Int J Adv Res Comput Sci 8:2015–2017
Google Scholar
Mann AK, Kaur N (2013) Grid density based clustering algorithm. Int J Adv Res Comput Eng Technol 2:2143–2147
Google Scholar
Ilango M, Mohan V (2010) A survey of grid based clustering algorithms. Int J Eng Sci Technol 2:3441–3446
Google Scholar
Yue S, Shi T, Wang J, Wang P (2012) Application of grid-based K-means clustering algorithm for optimal image processing. Comput Sci Inf Syst 9:1679–1696. https://doi.org/10.2298/CSIS120126052S
Article Google Scholar
Waller L, Gotway C (2004) Applied spatial statistics for public health data. Wiley, Hoboken, NJ
Book Google Scholar
Kassambara A (2017) Practical guide to cluster analysis in R: unsupervised machine learning. STHDA
Google Scholar
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R Package for determining the relevant number of clusters in a data set. J Stat Softw 61:1–36
Article Google Scholar
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17:107–145. https://doi.org/10.1023/A:1012801612483
Article Google Scholar
Nieweglowski L (2013) clv: cluster validation techniques. R package version 0.3-2.1. https://CRAN.R-project.org/package=clv. Accessed 21 Jan 2019
Brock G, Pihur V, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25:1–22
Article Google Scholar
Wolpert D, Macready G (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput. https://doi.org/10.1109/4235.585893
Article Google Scholar
Alqurashi T, Wang W (2018) Clustering ensemble method. Int J Mach Learn Cyber. https://doi.org/10.1007/s13042-017-0756-7
Article Google Scholar
Chiu DS, Talhouk A (2018) DiceR: an R package for class discovery using an ensemble driven approach. BMC Bioinformatics 19:17–20. https://doi.org/10.1186/s12859-017-1996-y
Article Google Scholar
Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, New York
Book Google Scholar
Wilkinson L (2005) The grammar or grammar of graphics. Springer, New York
Google Scholar
Harrell FE Jr, with contributions from CD and many others (2018) Hmisc: Harrell miscellaneous. R package version 4.1-1. https://CRAN.R-project.org/package=Hmisc. Accessed 22 Jan 2019
van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45:1–67
Article Google Scholar
Honaker J, King G, Blackwell M (2011) Amelia II: a program for missing data. J Stat Softw 45:1–47
Article Google Scholar
Stekhoven DJ, Bühlmann P (2012) Missforest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28:112–118. https://doi.org/10.1093/bioinformatics/btr597
Article CAS PubMed Google Scholar
Ilango V, Subramanian R, Vasudevan V (2012) A five step procedure for outlier analysis in data mining. Eur J Sci Res 75:327–339
Google Scholar
Steinbach M, Ertöz L, Kumar V (2004) New directions in statistical physics. In: The challenges of clustering high dimensional data. Springer, Berlin, pp 273–309
Google Scholar
Rendón E, Abundez I, Arizmendi A, Quiroz E (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34
Google Scholar
Davis DL, Bouldin DW (1998) A cluster separation measure. IEEE Trans Pattern Anal MachIntel PAMI 1(2):224–227
Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comp Appl Math 20:53–65
Article Google Scholar
Dunn JC (1973) A Fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cyber 3:32–57
Article Google Scholar

Download references

Author information

Authors and Affiliations

Centro de Estudos de Doenças Crónicas (CEDOC), NOVA Medical School-Faculdade de Ciências Médicas, Universidade NOVA de Lisboa, Lisbon, Portugal
Ana Pina & Maria Paula Macedo
ProRegeM PhD Programme, NOVA Medical School/Faculdade de Ciências Médicas, Universidade NOVA de Lisboa, Lisbon, Portugal
Ana Pina
Department of Medical Sciences, Institute of Biomedicine, University of Aveiro, Aveiro, Portugal
Ana Pina & Maria Paula Macedo
APDP-Diabetes Portugal Education and Research Center (APDP-ERC), Lisbon, Portugal
Maria Paula Macedo
NOVA Information Management School (NOVA IMS), Universidade NOVA de Lisboa, Lisbon, Portugal
Roberto Henriques

Authors

Ana Pina
View author publications
You can also search for this author in PubMed Google Scholar
Maria Paula Macedo
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Henriques
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ana Pina .

Editor information

Editors and Affiliations

Computational and Experimental Biology Group, CEDOC, Chronic Diseases Research Centre, NOVA Medical School, Faculdade de Ciências Médicas, Universidade NOVA de Lisboa, Lisboa, Portugal
Rune Matthiesen

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Pina, A., Macedo, M.P., Henriques, R. (2020). Clustering Clinical Data in R. In: Matthiesen, R. (eds) Mass Spectrometry Data Analysis in Proteomics. Methods in Molecular Biology, vol 2051. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9744-2_14

Download citation

DOI: https://doi.org/10.1007/978-1-4939-9744-2_14
Published: 25 September 2019
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-9743-5
Online ISBN: 978-1-4939-9744-2
eBook Packages: Springer Protocols

Publish with us

Policies and ethics