Abstract
Clustering cancer patients into different homogenous subgroups can facilitate the development of subgroup specific therapies. This forms the fundamental principle in personalised medicine. However, the process is complex because of greater variation in the phenotypic and genotypic characteristics of patients involved, even within the same cancer type. Consequently, most of the proposed methods fail to guarantee separability of patients with regard to subtype-specific Kaplan–Meier survival plots. In this study, we propose a novel clustering framework for patient subtyping based on the ideas from algebraic topology and manifold learning. The proposed method is able to discover subtypes that have statistically significant dissimilarity in survival outcome. The methodology is tested on three cancer datasets obtained via The Cancer Genome Atlas and the results are quantified in terms of Restricted Life Expectancy Difference and the \(cox\) log-rank p value. The novelty of our methodology is that it is independent of the notion of similarity used and able to discover subtypes that have significant difference in terms of Kaplan–Meier survival plots even if it uses a single omics profile of patients.
Similar content being viewed by others
Abbreviations
- TCGA:
-
The Cancer Genome Atlas
- CoD:
-
Curse of Dimensionality
- RLED:
-
Restricted Life Expectancy Difference
- RMST:
-
Restricted Mean Survival Time
- UMAP:
-
Uniform Manifold Approximation and Projection
- t-SNE:
-
t-Distributed Stochastic Neighbourhood Embedding
- SNF:
-
Similarity Network Fusion
- RSC:
-
Robust and Sparse Correlation
- PCA:
-
Principal Component Analysis
References
Saria S, Goldenberg A (2015) Subtyping: what it is and its role in precision medicine. IEEE Intell Syst 30:70–75. https://doi.org/10.1109/MIS.2015.60
Zhao L, Lee VHF, Ng MK et al (2019) Molecular subtyping of cancer: current status and moving toward clinical applications. Brief Bioinform 20:572–584. https://doi.org/10.1093/bib/bby026
Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science (80-) 286:531–527. https://doi.org/10.1126/science.286.5439.531
Seemann L, Shulman J, Gunaratne GH (2012) A robust topology-based algorithm for gene expression profiling. ISRN Bioinform 2012:1–11. https://doi.org/10.5402/2012/381023
Liu Y, Hayes DN, Nobel A, Marron JS (2008) Statistical significance of clustering for high-dimension, low-sample size data. J Am Stat Assoc 103:1281–1293. https://doi.org/10.1198/016214508000000454
Oyelade J, Isewon I, Oladipupo F et al (2016) Clustering algorithms: their application to gene expression data. Bioinform Biol Insights 10:237–253. https://doi.org/10.4137/BBI.S38316
Altman N, Krzywinski M (2018) The curse(s) of dimensionality this-month. Nat Methods 15:399–400. https://doi.org/10.1038/s41592-018-0019-x
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1998) When is “nearest neighbor” meaningful? In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence lecture notes in bioinformatics), vol. 1540. pp 217–235. https://doi.org/10.1007/3-540-49257-7_15
Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Mol Cell 58:586–597. https://doi.org/10.1016/j.molcel.2015.05.004
Brunet J, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci. https://doi.org/10.1073/pnas.0308531101
McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422. https://doi.org/10.1093/bioinformatics/18.3.413
Handhayani T, Hiryanto L (2015) Intelligent kernel K-means for clustering gene expression. Procedia Comput Sci 59:171–177. https://doi.org/10.1016/j.procs.2015.07.544
Perou CM, Sørile T, Eisen MB et al (2000) Molecular portraits of human breast tumours. Nature 406:747–752. https://doi.org/10.1038/35021093
Rappoport N, Shamir R, Schwartz R (2019) NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics 35:3348–3356. https://doi.org/10.1093/bioinformatics/btz058
Wang B, Mezlini AM, Demir F et al (2014) Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 11:333–337. https://doi.org/10.1038/nmeth.2810
Shen R, Olshen AB, Ladanyi M (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25:2906–2912. https://doi.org/10.1093/bioinformatics/btp543
Speicher NK, Pfeifer N (2015) Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery. Bioinformatics 31:i268–i275. https://doi.org/10.1093/bioinformatics/btv244
Andrew YN (2017) On spectral clustering: analysis and an algorithm. Encycl Mach Learn Data Min. https://doi.org/10.1007/978-1-4899-7687-1_100437
Coretto P, Serra A, Tagliaferri R (2018) Robust clustering of noisy high-dimensional gene expression data for patients subtyping. Bioinformatics 34:4064–4072. https://doi.org/10.1093/bioinformatics/bty502
Serra A, Coretto P, Fratello M, Tagliaferri R (2018) Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data. Bioinformatics 34:625–634. https://doi.org/10.1093/bioinformatics/btx642
Lin ZI, Zhang X (2005) Mining the structural knowledge of high-dimensional medical data using Isomap. Med Biol Eng Comput 43:410–412. https://doi.org/10.1007/BF02345820
Van Der ML, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Gan Y, Li N, Zou G et al (2018) Identification of cancer subtypes from single-cell RNA-seq data using a consensus clustering method. BMC Med Genom. https://doi.org/10.1186/s12920-018-0433-z
Rafique O, Mir AH (2020) Weighted dimensionality reduction and robust Gaussian mixture model based cancer patient subtyping from gene expression data. J Biomed Inform 112:103620. https://doi.org/10.1016/j.jbi.2020.103620
Becht E, McInnes L, Healy J et al (2019) Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol 37:38–47. https://doi.org/10.1038/nbt.4314
Hu F, Zhou Y, Wang Q et al (2019) Gene expression classification of lung adenocarcinoma into molecular subtypes. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/tcbb.2019.2905553
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118. https://doi.org/10.1023/A:1023949509487
Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53:457–481. https://doi.org/10.2307/2281868
Ahmad A, Fröhlich H, Fro H (2017) Towards clinically more relevant dissection of patient heterogeneity via survival-based Bayesian clustering. Bioinformatics 33:3558–3566. https://doi.org/10.1093/bioinformatics/btx464
Gurjeet S (2007) Topological methods for the analysis of high dimensional data sets and 3D object recognition. Eurographics Symp Point-Based Graph 151:2551–2552. https://doi.org/10.2312/SPBG/SPBG07/091-100
McInnes L, Healy J, Melville J (2018) UMAP: uniform manifold approximation and projection for dimension reduction. https://doi.org/10.48550/arXiv.1802.03426
Nicolau M, Levine AJ, Carlsson G (2011) Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc Natl Acad Sci USA 108:7265–7270. https://doi.org/10.1073/pnas.1102826108
Royston P, Parmar MKB (2013) Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Med Res Methodol 13:152. https://doi.org/10.1186/1471-2288-13-152
Diaz-Papkovich A, Anderson-Trocmé L, Gravel S (2018) Revealing multi-scale population structure in large cohorts. bioRxiv. https://doi.org/10.1101/423632
Rather AA, Chachoo MA (2022) Manifold learning based robust clustering of gene expression data for cancer subtyping. Inform Med Unlocked 30:100907. https://doi.org/10.1016/j.imu.2022.100907
Cao K, Bai X, Hong Y, Wan L (2020) Unsupervised topological alignment for single-cell multi-omics integration. bioRxiv. https://doi.org/10.1101/2020.02.02.931394
Lum PY, Singh G, Lehman A et al (2013) Extracting insights from the shape of complex data using topology. Sci Rep 3:1–8. https://doi.org/10.1038/srep01236
Xu T, Le TD, Liu L et al (2017) CancerSubtypes: an R/Bioconductor package for molecular cancer subtype identification, validation and visualization. Bioinformatics 33:3131–3133. https://doi.org/10.1093/bioinformatics/btx378
Yang J, Su AI, Li WH (2005) Gene expression evolves faster in narrowly than in broadly expressed mammalian genes. Mol Biol Evol 22:2113–2118. https://doi.org/10.1093/molbev/msi206
Månsson R, Tsapogas P, Åkerlund M et al (2004) Pearson correlation analysis of microarray data allows for the identification of genetic targets for early B-cell factor. J Biol Chem 279:17905–17913. https://doi.org/10.1074/jbc.M400589200
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Rather, A.A., Chachoo, M.A. UMAP guided topological analysis of transcriptomic data for cancer subtyping. Int. j. inf. tecnol. 14, 2855–2865 (2022). https://doi.org/10.1007/s41870-022-01048-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-022-01048-y