Abstract
Uniform Manifold Approximation and Projection (UMAP) is a new and effective non-linear dimensionality reduction (DR) method recently applied in biomedical informatics analysis. UMAP’s data transformation process is complicated and lacks transparency. Principal component analysis (PCA) is a conventional and essential DR method for analysing single-cell datasets. PCA projection is linear and easy to interpret. The UMAP is more scalable and accurate, but the complex algorithm makes it challenging to endorse the users’ trust. Another challenge is that some single-cell data have too many dimensions, making the computational process inefficient and lacking accuracy. This paper uses linkable and interactive visualisations to understand UMAP results by comparing PCA results. An explainable machine learning model, SHapley Additive exPlanations (SHAP) run on Random Forest (RF), is used to optimise the input single-cell data to make UMAP and PCA processes more efficient. We demonstrate that this approach can be applied to high-dimensional omics data exploration to visually validate informative molecule markers and cell populations identified from the UMAP-reduced dimensionality space.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Wong, K.-C.: Big data challenges in genome informatics. Biophys. Rev. 11, 51–54 (2018)
Pierson, E., Yau, C.: ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 16(1), 241 (2015)
Yang, Y., et al.: SAFE-clustering: single-cell Aggregated (from Ensemble) clustering for single-cell RNA-seq data. Bioinformatics 35(8), 1269–1277 (2019)
Hosoya, H., Hyvärinen, A.: Learning visual spatial pooling by strong PCA dimension reduction. Neural Comput. 28(7), 1249 (2016)
Sumithra, V.S., Subu, S.: A review of various linear and non linear dimensionality reduction techniques. Int. J. Comput. Sci. Inf. Technol. 6(3), 2354–2360 (2015)
Nguyen, L.H., Holmes, S.: Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol. 15(6), e1006907 (2019)
Konstorum, A., et al.: Comparative analysis of linear and nonlinear dimension reduction techniques on mass cytometry data. bioRxiv, p. 273862 (2018)
Etienne, B., et al.: Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37(1), 38–44 (2018)
Trozzi, F., Wang, X., Tao, P.: UMAP as a dimensionality reduction tool for molecular dynamics simulations of biomacromolecules: a comparison study. J. Phys. Chem. B 125(19), 5022–5034 (2021)
Szabo, P.A., et al.: Single-cell transcriptomics of human T cells reveals tissue and activation signatures in health and disease. Nat .Commun. 10(1), 4706–4716 (2019)
Tegegne, Y., Qu, Z., Qian, Y., Nguyen, Q.V.: Parallel nonlinear dimensionality reduction using GPU Acceleration. In: Xu, Y., et al. (eds.) AusDM 2021. CCIS, vol. 1504, pp. 3–15. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-8531-6_1
Wang, Y., et al.: Understanding how dimension reduction tools work: an empirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data visualization (2020)
Nauta, M., et al.: From anecdotal evidence to quantitative evaluation methods: a systematic review on evaluating explainable AI. arXiv preprint arXiv:2201.08164 (2022)
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?”: explaining the predictions of any classifier. In: International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016)
Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777. Curran Associates Inc., Long Beach (2017)
Osborne, M.J.: A Course in Game Theory. In: Rubinstein, A. (ed.) MIT Press, Cambridge (2006)
Shapley, L.S., Kuhn, H., Tucker, A.: Contributions to the theory of games. Ann. Math. Stud. 28(2), 307–317 (1953)
Watson, D.: Interpretable machine learning for genomics (2021)
Fernando, Z.T., Singh, J., Anand, A.: A study on the interpretability of neural retrieval models using DeepSHAP. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019)
Vilone, G., Longo, L.: Explainable artificial intelligence: a systematic review. arXiv preprint arXiv:2006.00093 (2020)
Strobelt, H., et al.: Lstmvis: a tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Trans. Visual Comput. Graphics 24(1), 667–676 (2017)
Thelisson, E.: Towards trust, transparency and liability in AI/AS systems. In: IJCAI (2017)
Dimitriadis, S., Liparas, D.: How random is the random forest? Random forest algorithm on the service of structural imaging biomarkers for Alzheimer’s disease: from Alzheimer’s disease neuroimaging initiative (ADNI) database. Neural Regen. Res. 13(6), 962–970 (2018)
Python (2020). https://www.python.org/
Candela, M.G.J.B.G., et al.: NIST form-based handprint recognition system. Technical Report NISTIR 5469, Nat'l Inst. of Standards and Technology 91994)
Tableau (2020). https://www.tableau.com/
BioLegend: Comprehensive solutions for single-cell and bulk multiomics (2021). https://www.biolegend.com/en-us/totalseq?gclid=CjwKCAjwx8iIBhBwEiwA2quaq0V-IkCRsY9UZ6G1Lop5Tfd0dl1m_YF-_fyd-1Hgz5fUvpEvevRpcRoCIjUQAvD_BwE. Accessed 22 Aug 2021
Stoeckius, M., et al.: Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14(9), 865–868 (2017)
Radoš, S., et al.: Towards quantitative visual analytics with structured brushing and linked statistics. Comput. Graph. Forum 35(3), 251–260 (2016)
Acknowledgement
We appreciate Yu “Max” Qian of J. Craig Venter Institute and the BioLgend Company for providing the TotalSeq dataset used in the paper. All datasets used in the study are fully de-identified and do not contain any protected health information.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Qu, Z., Tegegne, Y., Simoff, S.J., Kennedy, P.J., Catchpoole, D.R., Nguyen, Q.V. (2022). Enhancing Understandability of Omics Data with SHAP, Embedding Projections and Interactive Visualisations. In: Park, L.A.F., et al. Data Mining. AusDM 2022. Communications in Computer and Information Science, vol 1741. Springer, Singapore. https://doi.org/10.1007/978-981-19-8746-5_5
Download citation
DOI: https://doi.org/10.1007/978-981-19-8746-5_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8745-8
Online ISBN: 978-981-19-8746-5
eBook Packages: Computer ScienceComputer Science (R0)