Skip to main content

Clustering Based Identification of SARS-CoV-2 Subtypes

  • Conference paper
  • First Online:
Computational Advances in Bio and Medical Sciences (ICCABS 2020)

Abstract

With the availability of more than half a million SARS-CoV-2 sequences and counting, many approaches have recently appeared which aim to leverage this information towards understanding the genomic diversity and dynamics of this virus. Early approaches involved building transmission networks or phylogenetic trees, the latter for which scalability becomes more of an issue with each day, due to its high computational complexity.

In this work, we propose an alternative approach based on clustering sequences to identify novel subtypes of SARS-CoV-2 using methods designed for haplotyping intra-host viral populations. We assess this approach using cluster entropy, a notion which very naturally captures the underlying process of viral mutation—the first time entropy was used in this context. Using our approach, we were able to identify the well-known B.1.1.7 subtype from the sequences of the EMBL-EBI (UK) database, and also show that the associated cluster is consistent with a measure of fitness. This demonstrates that our approach as a viable and scalable alternative to unveiling trends in the spread of SARS-CoV-2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. EMBL-EBI: Wellcome Genome Campus, Hinxton, Cambridgeshire

    Google Scholar 

  2. Ahn, S., Vikalo, H.: aBayesQR: a Bayesian method for reconstruction of viral populations characterized by low diversity. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 353–369. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_22

    Chapter  Google Scholar 

  3. Anderberg, M.: Cluster Analysis for Applications. Academic Press, Cambridge (1973)

    MATH  Google Scholar 

  4. Baaijens, J.A., El Aabidine, A.Z., Rivals, E., Schönhuth, A.: De novo assembly of viral quasispecies using overlap graphs. Genome Res. 27(5), 835–848 (2017)

    Article  Google Scholar 

  5. Bukhari, Q., Jameel, Y., Massaro, J., D’Agostino, R., Khan, S.: Periodic oscillations in daily reported infections and deaths for coronavirus disease 2019. JAMA Netw. Open 3(8), e2017521 (2020). https://doi.org/10.1001/jamanetworkopen.2020.17521

    Article  Google Scholar 

  6. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974). https://doi.org/10.1080/03610927408827101

    Article  MathSciNet  MATH  Google Scholar 

  7. Ciccolella, S., Patterson, M., Bonizzoni, P., Vedova, G.D.: Effective clustering for single cell sequencing cancer data. In: The 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, ACM-BCB, Niagara Falls, NY, USA, pp. 437–446. ACM (2019). https://doi.org/10.1145/3307339.3342149

  8. Ciccolella, S., et al.: Inferring cancer progression from single-cell sequencing while allowing mutation losses. Bioinformatics 37(3), 326–333 (2020). https://doi.org/10.1093/bioinformatics/btaa722

    Article  Google Scholar 

  9. Ciccolella, S., Soto, M., Patterson, M.D., Vedova, G.D., Hajirasouliha, I., Bonizzoni, P.: gpps: an ILP-based approach for inferring cancer progression with mutation losses from single cell data. BMC Bioinform. 21 (2020). Article number: 413. https://doi.org/10.1186/s12859-020-03736-7

  10. The COVID-19 Genomics UK (COG-UK) Consortium: An integrated national scale SARS-CoV-2 genomic surveillance network. Lancet Microbe 1(3), 99–100 (2020)

    Google Scholar 

  11. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979). https://doi.org/10.1109/TPAMI.1979.4766909

  12. Elbe, S., Buckland-Merrett, G.: Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob. Chall. 1, 33–46 (2017). https://doi.org/10.1002/gch2.1018

    Article  Google Scholar 

  13. Public Health England: Investigation of novel SARS-CoV-2 variant: variant of concern 202012/01. Technical briefing 1 (2021)

    Google Scholar 

  14. Hadfield, J., et al.: Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34(23), 4121–4123 (2018). https://doi.org/10.1093/bioinformatics/bty407

    Article  Google Scholar 

  15. Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: The SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 1–8 (1997)

    Google Scholar 

  16. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998). https://doi.org/10.1023/A:1009769707641

    Article  Google Scholar 

  17. Jahn, K., Kuipers, J., Beerenwinkel, N.: Tree inference for single-cell data. Genome Biol. 17(1) (2016). Article number: 86. https://doi.org/10.1186/s13059-016-0936-x

  18. James, B., Luczak, B., Girgis, H.: MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acid Res. 46(14) (2018). https://doi.org/10.1093/nar/gky315

  19. Kammonen, J.I., et al.: gapFinisher: a reliable gap filling pipeline for SSPACE-LongRead scaffolder output (2019). https://doi.org/10.1371/journal.pone.0216885

  20. Knyazev, S., Hughes, L., Skums, P., Zelikovsky, A.: Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief. Bioinform. (2020). https://doi.org/10.1093/bib/bbaa101

  21. Knyazev, S., et al.: CliqueSNV: scalable reconstruction of intra-host viral populations from NGS reads. bioRxiv (2018). https://doi.org/10.1101/264242

  22. Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: Twenty-First International Conference on Machine Learning (2004). https://doi.org/10.1145/1015330.1015404

  23. McQueen, J.: Some methods for classification and analysis of multivariate observations. In: The 5th Berkely Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

    Google Scholar 

  24. du Plessis, L., et al.: Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK. Science (2021). https://doi.org/10.1126/science.abf2946

    Article  Google Scholar 

  25. Prabhakaran, S., Rey, M., Zagordi, O., Beerenwinkel, N., Roth, V.: HIV haplotype inference using a propagating Dirichlet process mixture model. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 11(1), 182–191 (2014)

    Article  Google Scholar 

  26. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)90125-7

    Article  MATH  Google Scholar 

  27. Schneider, T.D., Stephens, R.: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18(20), 6097–6100 (1990). https://doi.org/10.1093/nar/18.20.6097

    Article  Google Scholar 

  28. Skums, P., Campo, D.S., Dimitrova, Z., Vaughan, G., Lau, D.T., Khudyakov, Y.: Numerical detection, measuring and analysis of differential interferon resistance for individual HCV intra-host variants and its influence on the therapy response. Silico Biol. 11(5), 263–269 (2011)

    Google Scholar 

  29. Skums, P., Kirpich, A., Baykal, P.I., Zelikovsky, A., Chowell, G.: Global transmission network of SARS-CoV-2: from outbreak to pandemic. medRxiv (2020). https://doi.org/10.1101/2020.03.22.20041145

  30. Tamura, K., Nei, M.: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10(3), 512–526 (1993)

    Google Scholar 

  31. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc. 63(2), 411–423 (2001)

    Article  MathSciNet  Google Scholar 

  32. Volz, E., et al.: Transmission of SARS-CoV-2 lineage b.1.1.7 in England: insights from linking epidemiological and genetic data. medRxiv (2021). https://doi.org/10.1101/2020.12.30.20249034

  33. Vrbik, I., Stephens, D.A., Roger, M., Brenner, B.G.: The Gap Procedure: for the identification of phylogenetic clusters in HIV-1 sequence data. BMC Bioinform. 16(1), 355 (2015). https://doi.org/10.1186/s12859-015-0791-x

    Article  Google Scholar 

  34. W.H.O.: update, December 2020

    Google Scholar 

  35. Wu, F., et al.: A new coronavirus associated with human respiratory disease in China. Nature 579(7798), 265–269 (2020). https://doi.org/10.1038/s41586-020-2008-3

    Article  Google Scholar 

  36. Zhou, P., et al.: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798), 270–273 (2020). https://doi.org/10.1038/s41586-020-2012-7

    Article  Google Scholar 

Download references

Acknowledgments

AM, SK, BS, RH and AZ were partially supported from NSF Grant 16119110, and NIH grant 1R01EB025022-01. FM and PS were partially supported from NIH grant 1R01EB025022-01. AM, BS and SK were partially supported by a GSU Molecular Basis of Disease Fellowship. MP was supported by a GSU startup grant.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Alex Zelikovsky or Murray Patterson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Melnyk, A. et al. (2021). Clustering Based Identification of SARS-CoV-2 Subtypes. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2020. Lecture Notes in Computer Science(), vol 12686. Springer, Cham. https://doi.org/10.1007/978-3-030-79290-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-79290-9_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-79289-3

  • Online ISBN: 978-3-030-79290-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics