A Hierarchical Clustering algorithm based on Silhouette Index for cancer subtype discovery from genomic data

Nidheesh, N.; Nazeer, K. A. Abdul; Ameer, P. M.

doi:10.1007/s00521-019-04636-5

A Hierarchical Clustering algorithm based on Silhouette Index for cancer subtype discovery from genomic data

Original Article
Published: 30 November 2019

Volume 32, pages 11459–11476, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

1325 Accesses
17 Citations
Explore all metrics

Abstract

Identifying potential novel subtypes of cancers from genomic data requires techniques to estimate the number of natural clusters in the data. Determining the number of natural clusters in a dataset has been a challenging problem in Machine Learning. Employing an internal cluster validity index such as Silhouette Index together with a clustering algorithm has been a widely used technique for estimating the number of natural clusters, which has limitations. We propose a Hierarchical Agglomerative Clustering algorithm which automatically estimates the numbers of natural clusters and gives the associated clustering solutions along with dendrograms for visualizing the clustering structure. The algorithm has a Silhouette Index-based criterion for selecting the pair of clusters to merge, in the process of building the clustering hierarchy. The proposed algorithm could find decent estimates for the number of natural clusters, and the associated clustering solutions when applied to a collection of cancer gene expression datasets and general datasets. The proposed method showed better overall performance when compared to that of a set of prominent methods for estimating the number of natural clusters, which are used for cancer subtype discovery from genomic data. The proposed method is deterministic. It can be a better alternative to contemporary approaches for identifying potential novel subtypes of cancers from genomic data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systems Biology Approach for Unsupervised Clustering of High-Dimensional Data

Recursive Consensus Clustering for novel subtype discovery from transcriptome data

Article Open access 03 July 2020

A Discussion on the Biological Relevance of Clustering Results

Notes

https://www.ncbi.nlm.nih.gov/geo/.

References

Azzalini A, Menardi G (2014) Clustering via nonparametric density estimation: the R package pdfCluster. J Stat Softw 57(11):1–26
Article Google Scholar
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M et al (2012) NCBI GEO: archive for functional genomics data sets-update. Nucleic Acids Res 41(D1):D991–D995. https://doi.org/10.1093/nar/gks1193
Article Google Scholar
Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Nat Acad Sci 98(24):13790–13795. https://doi.org/10.1073/pnas.191502998
Article Google Scholar
Cai M, Li L (2017) Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus. BMC Med Genomics 10(4):75. https://doi.org/10.1186/s12920-017-0306-x
Article Google Scholar
Cancer Genome Atlas Research Network et al (2012) Comprehensive molecular portraits of human breast tumors. Nature 490(7418):61. https://doi.org/10.1038/nature11412
Article Google Scholar
Cancer Genome Atlas Research et al (2013) Integrated genomic characterization of endometrial carcinoma. Nature 497(7447):67–73. https://doi.org/10.1038/nature12113
Article Google Scholar
Cavalli FM, Remke M, Rampasek L, Peacock J, Shih DJ, Luu B, Garzia L, Torchia J, Nor C, Morrissy AS et al (2017) Intertumoral heterogeneity within Medulloblastoma subgroups. Cancer Cell 31(6):737–754. https://doi.org/10.1016/j.ccell.2017.05.005
Article Google Scholar
Chalise P, Fridley BL (2017) Integrative clustering of multi-level omic data based on non-negative matrix factorization algorithm. PLoS ONE 12(5):1–18. https://doi.org/10.1371/journal.pone.0176278
Article Google Scholar
Ciriello G, Gatza M, Beck A, Wilkerson M, Rhie S, Pastore A, Zhang H, McLellan M, Yau C, Kandoth C, Bowlby R, Shen H, Hayat S, Fieldhouse R, Lester S, Tse G, Factor R, Collins L, Allison K et al (2015) Comprehensive molecular portraits of invasive lobular breast cancer. Cell 163(2):506–519. https://doi.org/10.1016/j.cell.2015.09.033
Article Google Scholar
de Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinform 9(1):497. https://doi.org/10.1186/1471-2105-9-497
Article Google Scholar
Galili T (2015) dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. https://doi.org/10.1093/bioinformatics/btv428
Article Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537. https://doi.org/10.1126/science.286.5439.531
Article Google Scholar
Gowrishankar B, Przybycin CG, Ma C, Nandula SV, Rini B, Campbell S, Klein E, Chaganti R, Magi-Galluzzi C, Houldsworth J (2015) A genomic algorithm for the molecular classification of common renal cortical neoplasms: development and validation. J Urol 193(5):1479–1485. https://doi.org/10.1016/j.juro.2014.11.099
Article Google Scholar
Griesinger AM, Josephson RJ, Donson AM, Levy JMM, Amani V, Birks DK, Hoffman LM, Furtek SL, Reigan P, Handler MH et al (2015) Interleukin-6/STAT3 pathway signaling drives an inflammatory phenotype in Group A ependymoma. Cancer Immunol Res. https://doi.org/10.1158/2326-6066.CIR-15-0061
Article Google Scholar
Guo Y, Zheng J, Shang X, Li Z (2018) A similarity regression fusion model for integrating multi-omics data to identify cancer subtypes. Genes. https://doi.org/10.3390/genes9070314
Article Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075
Article MATH Google Scholar
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an S4 package for kernel methods in R. J Stat Softw 11(9):1–20
Article Google Scholar
Karlsson A, Brunnström H, Micke P, Veerla S, Mattsson J, La Fleur L, Botling J, Jönsson M, Reuterswärd C, Planck M et al (2017) Gene expression profiling of large cell lung cancer links transcriptional phenotypes to the new histological WHO 2015 classification. J Thorac Oncol. https://doi.org/10.1016/j.jtho.2017.05.008
Article Google Scholar
Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods. Elsevier Science Pub. Co., Amsterdam, pp 405–416
Google Scholar
Laiho P, Kokko A, Vanharanta S, Salovaara R, Sammalkorpi H, Järvinen H, Mecklin J, Karttunen T, Tuppurainen K, Davalos V, Arango D, Aaltonen LA (2007) Serrated carcinomas form a subclass of colorectal cancer with distinct molecular basis. Oncogene 26(2):312
Article Google Scholar
Li Z, Chen Y, Hu S, Zhang J, Wu J, Ren W, Shao N, Ying X (2016) Integrative analysis of protein-coding and non-coding RNAs identifies clinically relevant subtypes of clear cell renal cell carcinoma. Oncotarget 7(50):82671. https://doi.org/10.18632/oncotarget.12340
Article Google Scholar
Lichman M (2013) UCI machine learning repository. https://doi.org/http://archive.ics.uci.edu/ml. Accessed 08 Mar 2018
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489
Article MathSciNet MATH Google Scholar
Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D, Sweet-Cordero A, Ebert BL, Mak RH, Ferrando AA, Downing JR, Jacks T, Horwitz HR, Golub TR (2005) MicroRNA expression profiles classify human cancers. Nature 435(7043):834
Article Google Scholar
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2017) Cluster: cluster analysis basics and extensions. R package version 2.0.6
Mehmood R, El-Ashram S, Bie R, Dawood H, Kos A (2017) Clustering by fast search and merge of local density peaks for gene expression microarray data. Sci Rep. https://doi.org/10.1038/srep45602
Article Google Scholar
Mertins P, Mani DR, Ruggles KV, Gillette MA, Clauser KR, Wang P, Wang X, Qiao JW, Cao S, Petralia F, Kawaler E, Mundt F, Krug K, Tu Z, Lei JT, Gatza ML, Wilkerson M, Perou CM, Yellapantula V, Kl Huang et al (2016) Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 534:55. https://doi.org/10.1038/nature18003
Article Google Scholar
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1):91–118. https://doi.org/10.1023/A:1023949509487
Article MATH Google Scholar
Mouselimis L (2018) ClusterR: Gaussian mixture models, K-means, mini-batch-KMeans and K-Medoids clustering. https://CRAN.R-project.org/package=ClusterR. R package version 1.1.1
Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, pp 849–856
Nidheesh N, Abdul Nazeer KA, Ameer PM (2017) An enhanced deterministic K-means clustering algorithm for cancer subtype prediction from gene expression data. Comput Biol Med 91:213–221. https://doi.org/10.1016/j.compbiomed.2017.10.014
Article Google Scholar
R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Reddy CK, Vinzamuri B (2013) A survey of partitional and hierarchical clustering algorithms. In: Aggarwal CC, Reddy CK (eds) Data clustering: algorithms and applications, Chap 4. Chapman & Hall/CRC, Boca Raton, pp 87–110
Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article Google Scholar
Senbabaoğlu Y, Michailidis G, Li JZ (2014) Critical limitations of consensus clustering in class discovery. Sci Rep 4:6207. https://doi.org/10.1038/srep06207
Article Google Scholar
Shen R, Olshen AB, Ladanyi M (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25(22):2906–2912. https://doi.org/10.1093/bioinformatics/btp543
Article Google Scholar
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905. https://doi.org/10.1109/34.868688
Article Google Scholar
Shi Q, Zhang C, Peng M, Yu X, Zeng T, Liu J, Chen L (2017) Pattern fusion analysis by adaptive alignment of multiple heterogeneous omics data. Bioinformatics 33(17):2706–2714. https://doi.org/10.1093/bioinformatics/btx176
Article Google Scholar
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston
Google Scholar
Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS et al (2010) Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17(1):98–110. https://doi.org/10.1016/j.ccr.2009.12.020
Article Google Scholar
Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Haibe-Kains B, Goldenberg A (2014) Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 11:333. https://doi.org/10.1038/nmeth.2810
Article Google Scholar
Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Article MathSciNet Google Scholar
Wilkerson MD, Hayes DN (2010) ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics 26(12):1572–1573
Article Google Scholar
Wiwie C, Baumbach J, Röttger R (2015) Comparing the performance of biomedical clustering methods. Nat Methods 12(11):1033. https://doi.org/10.1038/NMETH.3583
Article Google Scholar
Wu D, Wang D, Zhang MQ, Gu J (2015) Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification. BMC Genom 16(1):1022. https://doi.org/10.1186/s12864-015-2223-8
Article Google Scholar
Yu Z, Wongb HS, You J, Yang Q, Liao H (2011) Knowledge based cluster ensemble for cancer discovery from biomolecular data. IEEE Trans Nanobiosci 10(2):76–85
Article Google Scholar
Yu Z, Li L, You J, Wong HS, Han G (2012) Sc$^3$: triple spectral clustering-based consensus clustering framework for class discovery from cancer gene expression profiles. IEEE/ACM Trans Comput Biol Bioinform 9(6):1751–1765
Article Google Scholar
Yu Z, You J, Li L, Wong HS, Han G (2012) Representative distance: a new similarity measure for class discovery from gene expression data. IEEE Trans Nanobiosci 11(4):341–351
Article Google Scholar
Zheng S, Cherniack A, Dewal N, Moffitt R, Danilova L, Murray B, Lerario A, Else T, Knijnenburg T, Ciriello G, Kim S, Assie G, Morozova O, Akbani R et al (2016) Comprehensive pan-genomic characterization of adrenocortical carcinoma. Cancer Cell 29(5):723–736
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Calicut, Calicut, Kerala, 673601, India
N. Nidheesh & P. M. Ameer
Department of Computer Science and Engineering, National Institute of Technology Calicut, Calicut, Kerala, 673601, India
K. A. Abdul Nazeer

Authors

N. Nidheesh
View author publications
You can also search for this author in PubMed Google Scholar
K. A. Abdul Nazeer
View author publications
You can also search for this author in PubMed Google Scholar
P. M. Ameer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to N. Nidheesh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 462 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nidheesh, N., Nazeer, K.A.A. & Ameer, P.M. A Hierarchical Clustering algorithm based on Silhouette Index for cancer subtype discovery from genomic data. Neural Comput & Applic 32, 11459–11476 (2020). https://doi.org/10.1007/s00521-019-04636-5

Download citation

Received: 07 October 2018
Accepted: 22 November 2019
Published: 30 November 2019
Issue Date: August 2020
DOI: https://doi.org/10.1007/s00521-019-04636-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Hierarchical Clustering algorithm based on Silhouette Index for cancer subtype discovery from genomic data

Abstract

Access this article

Similar content being viewed by others

A Systems Biology Approach for Unsupervised Clustering of High-Dimensional Data

Recursive Consensus Clustering for novel subtype discovery from transcriptome data

A Discussion on the Biological Relevance of Clustering Results

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 462 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Hierarchical Clustering algorithm based on Silhouette Index for cancer subtype discovery from genomic data

Abstract

Access this article

Similar content being viewed by others

A Systems Biology Approach for Unsupervised Clustering of High-Dimensional Data

Recursive Consensus Clustering for novel subtype discovery from transcriptome data

A Discussion on the Biological Relevance of Clustering Results

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 462 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation