Clustering Based Identification of SARS-CoV-2 Subtypes

Melnyk, Andrew; Mohebbi, Fatemeh; Knyazev, Sergey; Sahoo, Bikram; Hosseini, Roya; Skums, Pavel; Zelikovsky, Alex; Patterson, Murray

doi:10.1007/978-3-030-79290-9_11

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12686))

Included in the following conference series:

International Conference on Computational Advances in Bio and Medical Sciences

377 Accesses
7 Citations

Abstract

With the availability of more than half a million SARS-CoV-2 sequences and counting, many approaches have recently appeared which aim to leverage this information towards understanding the genomic diversity and dynamics of this virus. Early approaches involved building transmission networks or phylogenetic trees, the latter for which scalability becomes more of an issue with each day, due to its high computational complexity.

In this work, we propose an alternative approach based on clustering sequences to identify novel subtypes of SARS-CoV-2 using methods designed for haplotyping intra-host viral populations. We assess this approach using cluster entropy, a notion which very naturally captures the underlying process of viral mutation—the first time entropy was used in this context. Using our approach, we were able to identify the well-known B.1.1.7 subtype from the sequences of the EMBL-EBI (UK) database, and also show that the associated cluster is consistent with a measure of fitness. This demonstrates that our approach as a viable and scalable alternative to unveiling trends in the spread of SARS-CoV-2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

EMBL-EBI: Wellcome Genome Campus, Hinxton, Cambridgeshire
Google Scholar
Ahn, S., Vikalo, H.: aBayesQR: a Bayesian method for reconstruction of viral populations characterized by low diversity. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 353–369. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_22
Chapter Google Scholar
Anderberg, M.: Cluster Analysis for Applications. Academic Press, Cambridge (1973)
MATH Google Scholar
Baaijens, J.A., El Aabidine, A.Z., Rivals, E., Schönhuth, A.: De novo assembly of viral quasispecies using overlap graphs. Genome Res. 27(5), 835–848 (2017)
Article Google Scholar
Bukhari, Q., Jameel, Y., Massaro, J., D’Agostino, R., Khan, S.: Periodic oscillations in daily reported infections and deaths for coronavirus disease 2019. JAMA Netw. Open 3(8), e2017521 (2020). https://doi.org/10.1001/jamanetworkopen.2020.17521
Article Google Scholar
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974). https://doi.org/10.1080/03610927408827101
Article MathSciNet MATH Google Scholar
Ciccolella, S., Patterson, M., Bonizzoni, P., Vedova, G.D.: Effective clustering for single cell sequencing cancer data. In: The 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, ACM-BCB, Niagara Falls, NY, USA, pp. 437–446. ACM (2019). https://doi.org/10.1145/3307339.3342149
Ciccolella, S., et al.: Inferring cancer progression from single-cell sequencing while allowing mutation losses. Bioinformatics 37(3), 326–333 (2020). https://doi.org/10.1093/bioinformatics/btaa722
Article Google Scholar
Ciccolella, S., Soto, M., Patterson, M.D., Vedova, G.D., Hajirasouliha, I., Bonizzoni, P.: gpps: an ILP-based approach for inferring cancer progression with mutation losses from single cell data. BMC Bioinform. 21 (2020). Article number: 413. https://doi.org/10.1186/s12859-020-03736-7
The COVID-19 Genomics UK (COG-UK) Consortium: An integrated national scale SARS-CoV-2 genomic surveillance network. Lancet Microbe 1(3), 99–100 (2020)
Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979). https://doi.org/10.1109/TPAMI.1979.4766909
Elbe, S., Buckland-Merrett, G.: Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob. Chall. 1, 33–46 (2017). https://doi.org/10.1002/gch2.1018
Article Google Scholar
Public Health England: Investigation of novel SARS-CoV-2 variant: variant of concern 202012/01. Technical briefing 1 (2021)
Google Scholar
Hadfield, J., et al.: Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34(23), 4121–4123 (2018). https://doi.org/10.1093/bioinformatics/bty407
Article Google Scholar
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: The SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 1–8 (1997)
Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998). https://doi.org/10.1023/A:1009769707641
Article Google Scholar
Jahn, K., Kuipers, J., Beerenwinkel, N.: Tree inference for single-cell data. Genome Biol. 17(1) (2016). Article number: 86. https://doi.org/10.1186/s13059-016-0936-x
James, B., Luczak, B., Girgis, H.: MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acid Res. 46(14) (2018). https://doi.org/10.1093/nar/gky315
Kammonen, J.I., et al.: gapFinisher: a reliable gap filling pipeline for SSPACE-LongRead scaffolder output (2019). https://doi.org/10.1371/journal.pone.0216885
Knyazev, S., Hughes, L., Skums, P., Zelikovsky, A.: Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief. Bioinform. (2020). https://doi.org/10.1093/bib/bbaa101
Knyazev, S., et al.: CliqueSNV: scalable reconstruction of intra-host viral populations from NGS reads. bioRxiv (2018). https://doi.org/10.1101/264242
Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: Twenty-First International Conference on Machine Learning (2004). https://doi.org/10.1145/1015330.1015404
McQueen, J.: Some methods for classification and analysis of multivariate observations. In: The 5th Berkely Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
du Plessis, L., et al.: Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK. Science (2021). https://doi.org/10.1126/science.abf2946
Article Google Scholar
Prabhakaran, S., Rey, M., Zagordi, O., Beerenwinkel, N., Roth, V.: HIV haplotype inference using a propagating Dirichlet process mixture model. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 11(1), 182–191 (2014)
Article Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)90125-7
Article MATH Google Scholar
Schneider, T.D., Stephens, R.: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18(20), 6097–6100 (1990). https://doi.org/10.1093/nar/18.20.6097
Article Google Scholar
Skums, P., Campo, D.S., Dimitrova, Z., Vaughan, G., Lau, D.T., Khudyakov, Y.: Numerical detection, measuring and analysis of differential interferon resistance for individual HCV intra-host variants and its influence on the therapy response. Silico Biol. 11(5), 263–269 (2011)
Google Scholar
Skums, P., Kirpich, A., Baykal, P.I., Zelikovsky, A., Chowell, G.: Global transmission network of SARS-CoV-2: from outbreak to pandemic. medRxiv (2020). https://doi.org/10.1101/2020.03.22.20041145
Tamura, K., Nei, M.: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10(3), 512–526 (1993)
Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc. 63(2), 411–423 (2001)
Article MathSciNet Google Scholar
Volz, E., et al.: Transmission of SARS-CoV-2 lineage b.1.1.7 in England: insights from linking epidemiological and genetic data. medRxiv (2021). https://doi.org/10.1101/2020.12.30.20249034
Vrbik, I., Stephens, D.A., Roger, M., Brenner, B.G.: The Gap Procedure: for the identification of phylogenetic clusters in HIV-1 sequence data. BMC Bioinform. 16(1), 355 (2015). https://doi.org/10.1186/s12859-015-0791-x
Article Google Scholar
W.H.O.: update, December 2020
Google Scholar
Wu, F., et al.: A new coronavirus associated with human respiratory disease in China. Nature 579(7798), 265–269 (2020). https://doi.org/10.1038/s41586-020-2008-3
Article Google Scholar
Zhou, P., et al.: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798), 270–273 (2020). https://doi.org/10.1038/s41586-020-2012-7
Article Google Scholar

Download references

Acknowledgments

AM, SK, BS, RH and AZ were partially supported from NSF Grant 16119110, and NIH grant 1R01EB025022-01. FM and PS were partially supported from NIH grant 1R01EB025022-01. AM, BS and SK were partially supported by a GSU Molecular Basis of Disease Fellowship. MP was supported by a GSU startup grant.

Author information

Authors and Affiliations

Department of Computer Science, Georgia State University, Atlanta, GA, USA
Andrew Melnyk, Fatemeh Mohebbi, Sergey Knyazev, Bikram Sahoo, Roya Hosseini, Pavel Skums, Alex Zelikovsky & Murray Patterson

Authors

Andrew Melnyk
View author publications
You can also search for this author in PubMed Google Scholar
Fatemeh Mohebbi
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Knyazev
View author publications
You can also search for this author in PubMed Google Scholar
Bikram Sahoo
View author publications
You can also search for this author in PubMed Google Scholar
Roya Hosseini
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Skums
View author publications
You can also search for this author in PubMed Google Scholar
Alex Zelikovsky
View author publications
You can also search for this author in PubMed Google Scholar
Murray Patterson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Alex Zelikovsky or Murray Patterson .

Editor information

Editors and Affiliations

The University of Texas at San Antonio, San Antonio, TX, USA
Sumit Kumar Jha
University of Connecticut, Storrs, CT, USA
Ion Măndoiu
University of Connecticut, Storrs Mansfield, CT, USA
Sanguthevar Rajasekaran
Department of Computer Science, Georgia State University, Roswell, GA, USA
Pavel Skums
Department of Computer Science, Georgia State University, Atlanta, GA, USA
Alex Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Melnyk, A. et al. (2021). Clustering Based Identification of SARS-CoV-2 Subtypes. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2020. Lecture Notes in Computer Science(), vol 12686. Springer, Cham. https://doi.org/10.1007/978-3-030-79290-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-79290-9_11
Published: 03 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79289-3
Online ISBN: 978-3-030-79290-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics