Abstract
With the availability of more than half a million SARS-CoV-2 sequences and counting, many approaches have recently appeared which aim to leverage this information towards understanding the genomic diversity and dynamics of this virus. Early approaches involved building transmission networks or phylogenetic trees, the latter for which scalability becomes more of an issue with each day, due to its high computational complexity.
In this work, we propose an alternative approach based on clustering sequences to identify novel subtypes of SARS-CoV-2 using methods designed for haplotyping intra-host viral populations. We assess this approach using cluster entropy, a notion which very naturally captures the underlying process of viral mutation—the first time entropy was used in this context. Using our approach, we were able to identify the well-known B.1.1.7 subtype from the sequences of the EMBL-EBI (UK) database, and also show that the associated cluster is consistent with a measure of fitness. This demonstrates that our approach as a viable and scalable alternative to unveiling trends in the spread of SARS-CoV-2.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
EMBL-EBI: Wellcome Genome Campus, Hinxton, Cambridgeshire
Ahn, S., Vikalo, H.: aBayesQR: a Bayesian method for reconstruction of viral populations characterized by low diversity. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 353–369. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_22
Anderberg, M.: Cluster Analysis for Applications. Academic Press, Cambridge (1973)
Baaijens, J.A., El Aabidine, A.Z., Rivals, E., Schönhuth, A.: De novo assembly of viral quasispecies using overlap graphs. Genome Res. 27(5), 835–848 (2017)
Bukhari, Q., Jameel, Y., Massaro, J., D’Agostino, R., Khan, S.: Periodic oscillations in daily reported infections and deaths for coronavirus disease 2019. JAMA Netw. Open 3(8), e2017521 (2020). https://doi.org/10.1001/jamanetworkopen.2020.17521
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974). https://doi.org/10.1080/03610927408827101
Ciccolella, S., Patterson, M., Bonizzoni, P., Vedova, G.D.: Effective clustering for single cell sequencing cancer data. In: The 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, ACM-BCB, Niagara Falls, NY, USA, pp. 437–446. ACM (2019). https://doi.org/10.1145/3307339.3342149
Ciccolella, S., et al.: Inferring cancer progression from single-cell sequencing while allowing mutation losses. Bioinformatics 37(3), 326–333 (2020). https://doi.org/10.1093/bioinformatics/btaa722
Ciccolella, S., Soto, M., Patterson, M.D., Vedova, G.D., Hajirasouliha, I., Bonizzoni, P.: gpps: an ILP-based approach for inferring cancer progression with mutation losses from single cell data. BMC Bioinform. 21 (2020). Article number: 413. https://doi.org/10.1186/s12859-020-03736-7
The COVID-19 Genomics UK (COG-UK) Consortium: An integrated national scale SARS-CoV-2 genomic surveillance network. Lancet Microbe 1(3), 99–100 (2020)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979). https://doi.org/10.1109/TPAMI.1979.4766909
Elbe, S., Buckland-Merrett, G.: Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob. Chall. 1, 33–46 (2017). https://doi.org/10.1002/gch2.1018
Public Health England: Investigation of novel SARS-CoV-2 variant: variant of concern 202012/01. Technical briefing 1 (2021)
Hadfield, J., et al.: Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34(23), 4121–4123 (2018). https://doi.org/10.1093/bioinformatics/bty407
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: The SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 1–8 (1997)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998). https://doi.org/10.1023/A:1009769707641
Jahn, K., Kuipers, J., Beerenwinkel, N.: Tree inference for single-cell data. Genome Biol. 17(1) (2016). Article number: 86. https://doi.org/10.1186/s13059-016-0936-x
James, B., Luczak, B., Girgis, H.: MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acid Res. 46(14) (2018). https://doi.org/10.1093/nar/gky315
Kammonen, J.I., et al.: gapFinisher: a reliable gap filling pipeline for SSPACE-LongRead scaffolder output (2019). https://doi.org/10.1371/journal.pone.0216885
Knyazev, S., Hughes, L., Skums, P., Zelikovsky, A.: Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief. Bioinform. (2020). https://doi.org/10.1093/bib/bbaa101
Knyazev, S., et al.: CliqueSNV: scalable reconstruction of intra-host viral populations from NGS reads. bioRxiv (2018). https://doi.org/10.1101/264242
Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: Twenty-First International Conference on Machine Learning (2004). https://doi.org/10.1145/1015330.1015404
McQueen, J.: Some methods for classification and analysis of multivariate observations. In: The 5th Berkely Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
du Plessis, L., et al.: Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK. Science (2021). https://doi.org/10.1126/science.abf2946
Prabhakaran, S., Rey, M., Zagordi, O., Beerenwinkel, N., Roth, V.: HIV haplotype inference using a propagating Dirichlet process mixture model. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 11(1), 182–191 (2014)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)90125-7
Schneider, T.D., Stephens, R.: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18(20), 6097–6100 (1990). https://doi.org/10.1093/nar/18.20.6097
Skums, P., Campo, D.S., Dimitrova, Z., Vaughan, G., Lau, D.T., Khudyakov, Y.: Numerical detection, measuring and analysis of differential interferon resistance for individual HCV intra-host variants and its influence on the therapy response. Silico Biol. 11(5), 263–269 (2011)
Skums, P., Kirpich, A., Baykal, P.I., Zelikovsky, A., Chowell, G.: Global transmission network of SARS-CoV-2: from outbreak to pandemic. medRxiv (2020). https://doi.org/10.1101/2020.03.22.20041145
Tamura, K., Nei, M.: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10(3), 512–526 (1993)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc. 63(2), 411–423 (2001)
Volz, E., et al.: Transmission of SARS-CoV-2 lineage b.1.1.7 in England: insights from linking epidemiological and genetic data. medRxiv (2021). https://doi.org/10.1101/2020.12.30.20249034
Vrbik, I., Stephens, D.A., Roger, M., Brenner, B.G.: The Gap Procedure: for the identification of phylogenetic clusters in HIV-1 sequence data. BMC Bioinform. 16(1), 355 (2015). https://doi.org/10.1186/s12859-015-0791-x
W.H.O.: update, December 2020
Wu, F., et al.: A new coronavirus associated with human respiratory disease in China. Nature 579(7798), 265–269 (2020). https://doi.org/10.1038/s41586-020-2008-3
Zhou, P., et al.: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798), 270–273 (2020). https://doi.org/10.1038/s41586-020-2012-7
Acknowledgments
AM, SK, BS, RH and AZ were partially supported from NSF Grant 16119110, and NIH grant 1R01EB025022-01. FM and PS were partially supported from NIH grant 1R01EB025022-01. AM, BS and SK were partially supported by a GSU Molecular Basis of Disease Fellowship. MP was supported by a GSU startup grant.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Melnyk, A. et al. (2021). Clustering Based Identification of SARS-CoV-2 Subtypes. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2020. Lecture Notes in Computer Science(), vol 12686. Springer, Cham. https://doi.org/10.1007/978-3-030-79290-9_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-79290-9_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79289-3
Online ISBN: 978-3-030-79290-9
eBook Packages: Computer ScienceComputer Science (R0)