Skip to main content
Log in

UMAP guided topological analysis of transcriptomic data for cancer subtyping

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

Clustering cancer patients into different homogenous subgroups can facilitate the development of subgroup specific therapies. This forms the fundamental principle in personalised medicine. However, the process is complex because of greater variation in the phenotypic and genotypic characteristics of patients involved, even within the same cancer type. Consequently, most of the proposed methods fail to guarantee separability of patients with regard to subtype-specific Kaplan–Meier survival plots. In this study, we propose a novel clustering framework for patient subtyping based on the ideas from algebraic topology and manifold learning. The proposed method is able to discover subtypes that have statistically significant dissimilarity in survival outcome. The methodology is tested on three cancer datasets obtained via The Cancer Genome Atlas and the results are quantified in terms of Restricted Life Expectancy Difference and the \(cox\) log-rank p value. The novelty of our methodology is that it is independent of the notion of similarity used and able to discover subtypes that have significant difference in terms of Kaplan–Meier survival plots even if it uses a single omics profile of patients.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

(Reprinted from https://www.nature.com/articles/srep01236)

Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Abbreviations

TCGA:

The Cancer Genome Atlas

CoD:

Curse of Dimensionality

RLED:

Restricted Life Expectancy Difference

RMST:

Restricted Mean Survival Time

UMAP:

Uniform Manifold Approximation and Projection

t-SNE:

t-Distributed Stochastic Neighbourhood Embedding

SNF:

Similarity Network Fusion

RSC:

Robust and Sparse Correlation

PCA:

Principal Component Analysis

References

  1. Saria S, Goldenberg A (2015) Subtyping: what it is and its role in precision medicine. IEEE Intell Syst 30:70–75. https://doi.org/10.1109/MIS.2015.60

    Article  Google Scholar 

  2. Zhao L, Lee VHF, Ng MK et al (2019) Molecular subtyping of cancer: current status and moving toward clinical applications. Brief Bioinform 20:572–584. https://doi.org/10.1093/bib/bby026

    Article  Google Scholar 

  3. Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science (80-) 286:531–527. https://doi.org/10.1126/science.286.5439.531

    Article  Google Scholar 

  4. Seemann L, Shulman J, Gunaratne GH (2012) A robust topology-based algorithm for gene expression profiling. ISRN Bioinform 2012:1–11. https://doi.org/10.5402/2012/381023

    Article  Google Scholar 

  5. Liu Y, Hayes DN, Nobel A, Marron JS (2008) Statistical significance of clustering for high-dimension, low-sample size data. J Am Stat Assoc 103:1281–1293. https://doi.org/10.1198/016214508000000454

    Article  MathSciNet  MATH  Google Scholar 

  6. Oyelade J, Isewon I, Oladipupo F et al (2016) Clustering algorithms: their application to gene expression data. Bioinform Biol Insights 10:237–253. https://doi.org/10.4137/BBI.S38316

    Article  Google Scholar 

  7. Altman N, Krzywinski M (2018) The curse(s) of dimensionality this-month. Nat Methods 15:399–400. https://doi.org/10.1038/s41592-018-0019-x

    Article  Google Scholar 

  8. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1998) When is “nearest neighbor” meaningful? In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence lecture notes in bioinformatics), vol. 1540. pp 217–235. https://doi.org/10.1007/3-540-49257-7_15

  9. Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Mol Cell 58:586–597. https://doi.org/10.1016/j.molcel.2015.05.004

    Article  Google Scholar 

  10. Brunet J, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci. https://doi.org/10.1073/pnas.0308531101

    Article  Google Scholar 

  11. McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422. https://doi.org/10.1093/bioinformatics/18.3.413

    Article  Google Scholar 

  12. Handhayani T, Hiryanto L (2015) Intelligent kernel K-means for clustering gene expression. Procedia Comput Sci 59:171–177. https://doi.org/10.1016/j.procs.2015.07.544

    Article  Google Scholar 

  13. Perou CM, Sørile T, Eisen MB et al (2000) Molecular portraits of human breast tumours. Nature 406:747–752. https://doi.org/10.1038/35021093

    Article  Google Scholar 

  14. Rappoport N, Shamir R, Schwartz R (2019) NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics 35:3348–3356. https://doi.org/10.1093/bioinformatics/btz058

    Article  Google Scholar 

  15. Wang B, Mezlini AM, Demir F et al (2014) Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 11:333–337. https://doi.org/10.1038/nmeth.2810

    Article  Google Scholar 

  16. Shen R, Olshen AB, Ladanyi M (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25:2906–2912. https://doi.org/10.1093/bioinformatics/btp543

    Article  Google Scholar 

  17. Speicher NK, Pfeifer N (2015) Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery. Bioinformatics 31:i268–i275. https://doi.org/10.1093/bioinformatics/btv244

    Article  Google Scholar 

  18. Andrew YN (2017) On spectral clustering: analysis and an algorithm. Encycl Mach Learn Data Min. https://doi.org/10.1007/978-1-4899-7687-1_100437

    Article  Google Scholar 

  19. Coretto P, Serra A, Tagliaferri R (2018) Robust clustering of noisy high-dimensional gene expression data for patients subtyping. Bioinformatics 34:4064–4072. https://doi.org/10.1093/bioinformatics/bty502

    Article  Google Scholar 

  20. Serra A, Coretto P, Fratello M, Tagliaferri R (2018) Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data. Bioinformatics 34:625–634. https://doi.org/10.1093/bioinformatics/btx642

    Article  Google Scholar 

  21. Lin ZI, Zhang X (2005) Mining the structural knowledge of high-dimensional medical data using Isomap. Med Biol Eng Comput 43:410–412. https://doi.org/10.1007/BF02345820

    Article  Google Scholar 

  22. Van Der ML, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  23. Gan Y, Li N, Zou G et al (2018) Identification of cancer subtypes from single-cell RNA-seq data using a consensus clustering method. BMC Med Genom. https://doi.org/10.1186/s12920-018-0433-z

    Article  Google Scholar 

  24. Rafique O, Mir AH (2020) Weighted dimensionality reduction and robust Gaussian mixture model based cancer patient subtyping from gene expression data. J Biomed Inform 112:103620. https://doi.org/10.1016/j.jbi.2020.103620

    Article  Google Scholar 

  25. Becht E, McInnes L, Healy J et al (2019) Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol 37:38–47. https://doi.org/10.1038/nbt.4314

    Article  Google Scholar 

  26. Hu F, Zhou Y, Wang Q et al (2019) Gene expression classification of lung adenocarcinoma into molecular subtypes. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/tcbb.2019.2905553

    Article  Google Scholar 

  27. Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118. https://doi.org/10.1023/A:1023949509487

    Article  MATH  Google Scholar 

  28. Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53:457–481. https://doi.org/10.2307/2281868

    Article  MathSciNet  MATH  Google Scholar 

  29. Ahmad A, Fröhlich H, Fro H (2017) Towards clinically more relevant dissection of patient heterogeneity via survival-based Bayesian clustering. Bioinformatics 33:3558–3566. https://doi.org/10.1093/bioinformatics/btx464

    Article  Google Scholar 

  30. Gurjeet S (2007) Topological methods for the analysis of high dimensional data sets and 3D object recognition. Eurographics Symp Point-Based Graph 151:2551–2552. https://doi.org/10.2312/SPBG/SPBG07/091-100

    Article  Google Scholar 

  31. McInnes L, Healy J, Melville J (2018) UMAP: uniform manifold approximation and projection for dimension reduction. https://doi.org/10.48550/arXiv.1802.03426

  32. Nicolau M, Levine AJ, Carlsson G (2011) Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc Natl Acad Sci USA 108:7265–7270. https://doi.org/10.1073/pnas.1102826108

    Article  Google Scholar 

  33. Royston P, Parmar MKB (2013) Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Med Res Methodol 13:152. https://doi.org/10.1186/1471-2288-13-152

    Article  Google Scholar 

  34. Diaz-Papkovich A, Anderson-Trocmé L, Gravel S (2018) Revealing multi-scale population structure in large cohorts. bioRxiv. https://doi.org/10.1101/423632

    Article  Google Scholar 

  35. Rather AA, Chachoo MA (2022) Manifold learning based robust clustering of gene expression data for cancer subtyping. Inform Med Unlocked 30:100907. https://doi.org/10.1016/j.imu.2022.100907

    Article  Google Scholar 

  36. Cao K, Bai X, Hong Y, Wan L (2020) Unsupervised topological alignment for single-cell multi-omics integration. bioRxiv. https://doi.org/10.1101/2020.02.02.931394

    Article  Google Scholar 

  37. Lum PY, Singh G, Lehman A et al (2013) Extracting insights from the shape of complex data using topology. Sci Rep 3:1–8. https://doi.org/10.1038/srep01236

    Article  Google Scholar 

  38. Xu T, Le TD, Liu L et al (2017) CancerSubtypes: an R/Bioconductor package for molecular cancer subtype identification, validation and visualization. Bioinformatics 33:3131–3133. https://doi.org/10.1093/bioinformatics/btx378

    Article  Google Scholar 

  39. Yang J, Su AI, Li WH (2005) Gene expression evolves faster in narrowly than in broadly expressed mammalian genes. Mol Biol Evol 22:2113–2118. https://doi.org/10.1093/molbev/msi206

    Article  Google Scholar 

  40. Månsson R, Tsapogas P, Åkerlund M et al (2004) Pearson correlation analysis of microarray data allows for the identification of genetic targets for early B-cell factor. J Biol Chem 279:17905–17913. https://doi.org/10.1074/jbc.M400589200

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Arif Ahmad Rather or Manzoor Ahmad Chachoo.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rather, A.A., Chachoo, M.A. UMAP guided topological analysis of transcriptomic data for cancer subtyping. Int. j. inf. tecnol. 14, 2855–2865 (2022). https://doi.org/10.1007/s41870-022-01048-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-022-01048-y

Keywords

Navigation