Skip to main content

Enhancing Understandability of Omics Data with SHAP, Embedding Projections and Interactive Visualisations

  • Conference paper
  • First Online:
Data Mining (AusDM 2022)

Abstract

Uniform Manifold Approximation and Projection (UMAP) is a new and effective non-linear dimensionality reduction (DR) method recently applied in biomedical informatics analysis. UMAP’s data transformation process is complicated and lacks transparency. Principal component analysis (PCA) is a conventional and essential DR method for analysing single-cell datasets. PCA projection is linear and easy to interpret. The UMAP is more scalable and accurate, but the complex algorithm makes it challenging to endorse the users’ trust. Another challenge is that some single-cell data have too many dimensions, making the computational process inefficient and lacking accuracy. This paper uses linkable and interactive visualisations to understand UMAP results by comparing PCA results. An explainable machine learning model, SHapley Additive exPlanations (SHAP) run on Random Forest (RF), is used to optimise the input single-cell data to make UMAP and PCA processes more efficient. We demonstrate that this approach can be applied to high-dimensional omics data exploration to visually validate informative molecule markers and cell populations identified from the UMAP-reduced dimensionality space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.tableau.com/.

References

  1. Wong, K.-C.: Big data challenges in genome informatics. Biophys. Rev. 11, 51–54 (2018)

    Article  Google Scholar 

  2. Pierson, E., Yau, C.: ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 16(1), 241 (2015)

    Article  Google Scholar 

  3. Yang, Y., et al.: SAFE-clustering: single-cell Aggregated (from Ensemble) clustering for single-cell RNA-seq data. Bioinformatics 35(8), 1269–1277 (2019)

    Article  Google Scholar 

  4. Hosoya, H., Hyvärinen, A.: Learning visual spatial pooling by strong PCA dimension reduction. Neural Comput. 28(7), 1249 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  5. Sumithra, V.S., Subu, S.: A review of various linear and non linear dimensionality reduction techniques. Int. J. Comput. Sci. Inf. Technol. 6(3), 2354–2360 (2015)

    Google Scholar 

  6. Nguyen, L.H., Holmes, S.: Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol. 15(6), e1006907 (2019)

    Article  Google Scholar 

  7. Konstorum, A., et al.: Comparative analysis of linear and nonlinear dimension reduction techniques on mass cytometry data. bioRxiv, p. 273862 (2018)

    Google Scholar 

  8. Etienne, B., et al.: Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37(1), 38–44 (2018)

    Google Scholar 

  9. Trozzi, F., Wang, X., Tao, P.: UMAP as a dimensionality reduction tool for molecular dynamics simulations of biomacromolecules: a comparison study. J. Phys. Chem. B 125(19), 5022–5034 (2021)

    Article  Google Scholar 

  10. Szabo, P.A., et al.: Single-cell transcriptomics of human T cells reveals tissue and activation signatures in health and disease. Nat .Commun. 10(1), 4706–4716 (2019)

    Article  Google Scholar 

  11. Tegegne, Y., Qu, Z., Qian, Y., Nguyen, Q.V.: Parallel nonlinear dimensionality reduction using GPU Acceleration. In: Xu, Y., et al. (eds.) AusDM 2021. CCIS, vol. 1504, pp. 3–15. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-8531-6_1

    Chapter  Google Scholar 

  12. Wang, Y., et al.: Understanding how dimension reduction tools work: an empirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data visualization (2020)

    Google Scholar 

  13. Nauta, M., et al.: From anecdotal evidence to quantitative evaluation methods: a systematic review on evaluating explainable AI. arXiv preprint arXiv:2201.08164 (2022)

  14. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?”: explaining the predictions of any classifier. In: International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016)

    Google Scholar 

  15. Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777. Curran Associates Inc., Long Beach (2017)

    Google Scholar 

  16. Osborne, M.J.: A Course in Game Theory. In: Rubinstein, A. (ed.) MIT Press, Cambridge (2006)

    Google Scholar 

  17. Shapley, L.S., Kuhn, H., Tucker, A.: Contributions to the theory of games. Ann. Math. Stud. 28(2), 307–317 (1953)

    Google Scholar 

  18. Watson, D.: Interpretable machine learning for genomics (2021)

    Google Scholar 

  19. Fernando, Z.T., Singh, J., Anand, A.: A study on the interpretability of neural retrieval models using DeepSHAP. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019)

    Google Scholar 

  20. Vilone, G., Longo, L.: Explainable artificial intelligence: a systematic review. arXiv preprint arXiv:2006.00093 (2020)

  21. Strobelt, H., et al.: Lstmvis: a tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Trans. Visual Comput. Graphics 24(1), 667–676 (2017)

    Article  Google Scholar 

  22. Thelisson, E.: Towards trust, transparency and liability in AI/AS systems. In: IJCAI (2017)

    Google Scholar 

  23. Dimitriadis, S., Liparas, D.: How random is the random forest? Random forest algorithm on the service of structural imaging biomarkers for Alzheimer’s disease: from Alzheimer’s disease neuroimaging initiative (ADNI) database. Neural Regen. Res. 13(6), 962–970 (2018)

    Article  Google Scholar 

  24. Python (2020). https://www.python.org/

  25. Candela, M.G.J.B.G., et al.: NIST form-based handprint recognition system. Technical Report NISTIR 5469, Nat'l Inst. of Standards and Technology 91994)

    Google Scholar 

  26. Tableau (2020). https://www.tableau.com/

  27. BioLegend: Comprehensive solutions for single-cell and bulk multiomics (2021). https://www.biolegend.com/en-us/totalseq?gclid=CjwKCAjwx8iIBhBwEiwA2quaq0V-IkCRsY9UZ6G1Lop5Tfd0dl1m_YF-_fyd-1Hgz5fUvpEvevRpcRoCIjUQAvD_BwE. Accessed 22 Aug 2021

  28. Stoeckius, M., et al.: Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14(9), 865–868 (2017)

    Article  Google Scholar 

  29. Radoš, S., et al.: Towards quantitative visual analytics with structured brushing and linked statistics. Comput. Graph. Forum 35(3), 251–260 (2016)

    Article  Google Scholar 

Download references

Acknowledgement

We appreciate Yu “Max” Qian of J. Craig Venter Institute and the BioLgend Company for providing the TotalSeq dataset used in the paper. All datasets used in the study are fully de-identified and do not contain any protected health information.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhonglin Qu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qu, Z., Tegegne, Y., Simoff, S.J., Kennedy, P.J., Catchpoole, D.R., Nguyen, Q.V. (2022). Enhancing Understandability of Omics Data with SHAP, Embedding Projections and Interactive Visualisations. In: Park, L.A.F., et al. Data Mining. AusDM 2022. Communications in Computer and Information Science, vol 1741. Springer, Singapore. https://doi.org/10.1007/978-981-19-8746-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-8746-5_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-8745-8

  • Online ISBN: 978-981-19-8746-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics