Abstract
Background
One of the apparent characteristics of bioinformatics data is the combination of very large number of features and relatively small number of samples. The vast number of features makes intuitive understanding of a target domain difficult. Dimensionality reduction or manifold learning has potential to circumvent this obstacle, but restricted methods have been preferred.
Objective
The objective of this study is to observe the characteristics of various dimensionality reduction methods—locally linear embedding (LLE), multi-dimensional scaling (MDS), principal component analysis (PCA), spectral embedding (SE), and t-distributed Stochastic Neighbor Embedding (t-SNE)—on the RNA-Seq dataset from the genotype-tissue expression (GTEx) project.
Results
The characteristics of the dimensionality reduction methods are observed on the nine groups of three different tissues in the reduced space with dimensionality of two, three, and four. The visualization results report that each dimensionality reduction method produces a very distinct reduced space. The quantitative results are obtained as the performance of k-means clustering. Clustering in the reduced space from non-linear methods such as LLE, t-SNE and SE achieved better results than in the reduced space produced by linear methods like PCA and MDS.
Conclusions
The experimental results recommend the application of both linear and non-linear dimensionality reduction methods on the target data for grasping the underlying characteristics of the datasets intuitively.
Similar content being viewed by others
References
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15:1373–1396
Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. JASA 78:553–569
Gisbrecht A, Hammer B, Mokbel B, Sczyrba A (2013) Nonlinear dimensionality reduction for cluster identification in metagenomic samples. Paper presented at 17th international conference on information visualisation, IV13, pp 174–179
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Konishi T (2015) Principal component analysis for designed experiments. BMC Bioinform 16:S7
Kruskal JB (1964) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. J Psychom 29:1–27
Lee G, Rodrigues C, Madabhushi A (2008) Investigatinv the efficacy of nonlinear dimensionality reduction schemes in classifying gene- and protein-expression studies. IEEE/ACM Trans Comput Biol Bioinform 5:368–384
Ma Y, Fu Y (2012) Manifold learing theory and applications. CRC Press, Boca Raton
Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2:559–572
Rosenberg A, Hirschberg J (2007) V-measure: a conditional entropy-based external cluster evaluation measure. EMNLP-CoNLL 2007:410–420
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Roweis ST, Saul LK (2000) Nonliner dimensionality reduction by locally linear embedding. Science 290:2323–2326
The GTEx Consortium (2015) The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348:648–660
The GTEx Consortium (2017) Genetic effects on gene expression across human tissues. Nature 550:204–213
Yang J, Wang H, Ding H, An N, Alterovitz G (2017) Nonlinear dimensionality reduction methods for synthetic biology biobricks’ visulaization. BMC Bioinform 18:47
Yeung KY, Ruzzo WL (2001) Principal component analysis for clustering gene expression data. Bioinformatics 17:763–774
Zhou X, Mao J, Ai J, Deng Y, Roth MR, Pound C, Henegar J, Welti R, Bigler SA (2012) Identification of plasma lipid biomarkers for prostate cancer by lipidomics and bioinformatics. PLoS One 7:e48889
Acknowledgements
This study was supported by 2017 Research Grant from Kangwon National University and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2016R1D1A1B03931615 and 2018R1D1A1B07047156).
Author information
Authors and Affiliations
Contributions
H-SS conceived the study, carried out data analysis, and drafted the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The author declares that he has no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Seok, HS. Performance comparison of dimensionality reduction methods on RNA-Seq data from the GTEx project. Genes Genom 42, 225–234 (2020). https://doi.org/10.1007/s13258-019-00896-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13258-019-00896-6