Clustering SARS-CoV-2 Variants from Raw High-Throughput Sequencing Reads Data

Chourasia, Prakash; Ali, Sarwan; Ciccolella, Simone; Della Vedova, Gianluca; Patterson, Murray

doi:10.1007/978-3-031-17531-2_11

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13254))

Included in the following conference series:

International Conference on Computational Advances in Bio and Medical Sciences

216 Accesses
6 Citations

Abstract

The massive amount of genomic data appearing over the past two years for SARS-CoV-2 has challenged traditional methods for studying the dynamics of the COVID-19 pandemic. As a result, new methods, such as the Pangolin tool, have appeared which can scale to the millions of samples of SARS-CoV-2 currently available. Such a tool is tailored to take assembled, aligned and curated full-length sequences, such as those provided by GISAID, as input. As high-throughput sequencing technologies continue to advance, such assembly, alignment and curation may become a bottleneck, creating a need for methods which can process raw sequencing reads directly.

In this paper, we propose several alignment-free embedding approaches, which can generate a fixed-length feature vector representation directly from the raw sequencing reads, without the need for assembly. Moreover, because such an embedding is a numerical representation, it can be passed to already highly optimized clustering methods such as k-means. We show that the clusterings we obtain with the proposed embeddings are more suited to this setting than the Pangolin tool, based on several internal clustering evaluation metrics. Moreover, we show that a disproportionate number of positions in the spike region of the SARS-CoV-2 genome are informing such clusterings (in terms of information gain), which is consistent with current biological knowledge of SARS-CoV-2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Ahmad, M., Ali, S., Tariq, J., et al.: Combinatorial trace method for network immunization. Inf. Sci. 519, 215–228 (2020)
Article MathSciNet Google Scholar
Ali, S., Mansoor, H., Khan, I., Arshad, N., Khan, M.A., Faizullah, S.: Short-term load forecasting using AMI data. arXiv:1912.12479 (2019)
Ali, S., Mansoor, H., et al.: Short term load forecasting using smart meter data. In: International Conference on Future Energy Systems (e-Energy), pp. 419–421 (2019)
Google Scholar
Ali, S., Shakeel, M., Khan, I., Faizullah, S., Khan, M.: Predicting attributes of nodes using network structure. ACM Trans. Intell. Syst. Technol. (TIST) 12(2), 1–23 (2021)
Article Google Scholar
Ali, S., Ali, T.E., Khan, M.A., Khan, I., Patterson, M.: Effective and scalable clustering of SARS-CoV-2 sequences. In: International Conference on Big Data Research (ICBDR), pp. 42–49 (2021)
Google Scholar
Ali, S., Alvi, M.K., Faizullah, S., Khan, M.A., Alshanqiti, A., Khan, I.: Detecting DDoS attack on SDN due to vulnerabilities in OpenFlow. In: International Conference on Advances in the Emerging Computing Technologies (AECT), pp. 1–6 (2020)
Google Scholar
Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3), 418 (2022)
Article Google Scholar
Ali, S., Patterson, M.: Spike2Vec: an efficient and scalable embedding approach for Covid-19 spike sequences. In: IEEE International Conference on Big Data (Big Data), pp. 1533–1540 (2021)
Google Scholar
Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M., Khan, I.: A k-mer based approach for SARS-CoV-2 variant identification. In: International Symposium on Bioinformatics Research and Applications, pp. 153–164 (2021)
Google Scholar
Ali, S., Zhou, Y., Patterson, M.: Efficient analysis of Covid-19 clinical data using machine learning models. Med. Biol. Eng. Comput. 1–16 (2022)
Google Scholar
Benesty, J., Chen, J., Huang, Y., Cohen, I.: Pearson correlation coefficient. In: Benesty, J., Chen, J., Huang, Y., Cohen, I. (eds.) Noise Reduction in Speech Processing. Springer Topics in Signal Processing, vol. 2, pp. 1–4. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00296-0_5
Chapter Google Scholar
Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 3(1), 1–27 (1974)
Article MathSciNet Google Scholar
Danecek, P., et al.: Twelve years of SAMtools and BCFtools. GigaScience 10(2) (2021)
Google Scholar
Davies, D., Bouldin, D.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)
Article Google Scholar
Du Plessis, L., et al.: Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK. Science 371(6530), 708–712 (2021)
Article Google Scholar
Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I.: Efficient approx algorithms for strings kernel based sequence classification. In: Advances in Neural info Processing System (NeurIPS), pp. 6935–6945 (2017)
Google Scholar
Fowlkes, E., Mallows, C.: A method for comparing two hierarchical clusterings. J. Am. Statist. Assoc. 78(383), 553–569 (1983)
Article Google Scholar
Galloway, S., et al.: Emerg. of SARS-CoV-2 b.1.1.7 lin. Morb. Mortal. Weekly Repo. 70(3), 95 (2021)
Google Scholar
Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)
Article Google Scholar
GISAID. https://www.gisaid.org/. Accessed 29 Jan 2022
Golubchik, T., Wise, M., Easteal, S., Jermiin, L.: Mind the gaps: evidence of bias in estimates of multiple sequence alignments. Mol. Biol. Evol. 24(11), 2433–2442 (2007)
Article Google Scholar
Hadfield, J., et al.: Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34, 4121–4123 (2018)
Article Google Scholar
Huang, Z.: Extensions to the k-modes algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)
Article Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classification 2(1), 193–218 (1985)
Article Google Scholar
Kawulok, J., Deorowicz, S.: CoMeta: classification of metagenomes using k-mers. Plos One (2015)
Google Scholar
Kuksa, P., Khan, I., Pavlovic, V.: Generalized similarity kernels for efficient sequence classification. In: SIAM International Conference on Data Mining (SDM), pp. 873–882 (2012)
Google Scholar
Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)
Article Google Scholar
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013)
Google Scholar
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Article MathSciNet Google Scholar
Melnyk, A., et al.: From alpha to zeta: identifying variants and subtypes of SARS-CoV-2 via clustering. J. Comput. Biol. 28(11), 1113–1129 (2021)
Article Google Scholar
Minh, B., et al.: IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020)
Article Google Scholar
Mölder, F., Jablonski, K.P., Letcher, B., et al.: Sustainable data analysis with Snakemake. F1000Res 10(33) (2021)
Google Scholar
Myers, L., Sirois, M.: Spearman correlation coefficients, differences between. Encyclopedia Stat. Sci. 12 (2004)
Google Scholar
Needham, K.: Chinese state fund invests in gene firm BGI. Reuters [Internet] (2021). https://www.reuters.com/article/us-china-genomics-state-idUSKBN2AM0AT
O’Toole, A., et al.: Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol. 7(2), veab064 (2021)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Phylogenetic Assignment of Named Global Outbreak Lineages (Pangolin). https://cov-lineages.org/resources/pangolin.html. Accessed 4 Jan 2022
Rambaut, A., et al.: A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat. Microbiol. 5(11), 1403–1407 (2020)
Article Google Scholar
Reporter, S.: CDC commits \$90m to create public health pathogen genomics research centers. Genomeweb. https://www.genomeweb.com/infectious-disease/cdc-commits-90m-create-public-health-pathogen-genomics-research-centers. Accessed 29 Jan 2022
Roberts, M., Hayes, W., Hunt, B., Mount, S., Yorke, J.: Reducing storage req for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)
Article Google Scholar
Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: The Joint Conference Empirical Methods NLP Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–420 (2007)
Google Scholar
Rousseeuw, P.: Silhouettes: a graphical aid to interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
SARS-CoV-2 Variant Classifications and Definitions. https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-info.html. Accessed 29 Jan 2022
Satopaa, V., Albrecht, J., Irwin, D., Raghavan, B.: Finding a “Kneedle” in a haystack: detecting knee points in system behavior. In: International Conference on Distributed Computing Systems Workshops, pp. 166–171 (2011)
Google Scholar
Sboner, A., Mu, X., Greenbaum, D., Auerbach, R., Gerstein, M.: The real cost of sequencing: higher than you think! Genome Biol. 12(8), 125 (2011)
Article Google Scholar
Shakeel, M., Faizullah, S., Alghamidi, T., Khan, I.: Language independent sentiment analysis. In: International Conference on Advances in the Emerging Computing Technologies (AECT), pp. 1–5 (2020)
Google Scholar
Solis, S., Avino, M., Poon, A., Kari, L.: An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes (2018)
Google Scholar
Stephens, Z., et al.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)
Google Scholar
Tayebi, Z., Ali, S., Patterson, M.: Robust representation and efficient feature selection allows for effective clustering of SARS-CoV-2 variants. Algorithms 14(12), 348 (2021)
Article Google Scholar
Ullah, A., Ali, S., Khan, I., Khan, M.A., Faizullah, S.: Effect of analysis window and feature selection on classification of hand movements using EMG signal. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 400–415. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_30
Chapter Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. (JMLR) 9(11) (2008)
Google Scholar
Walls, A., Park, Y., Tortorici, M.: Structure, function and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell 181(2), 281–292 (2020)
Article Google Scholar
Wood, D., Salzberg, S.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15 (2014)
Google Scholar
Xu, W., Wang, M., Yu, D., Zhang, X.: Variations in SARS-CoV-2 spike protein cell epitopes and glycosylation profiles during global transmission course of Covid-19. Front. Immunol. 11 (2020)
Google Scholar
Yadav, P., et al.: Neutralization potential of Covishield vaccinated individuals sera against B.1.617.1. Clin. Infect. Dis. 74, 558–559 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Georgia State University, Atlanta, GA, 30303, USA
Prakash Chourasia, Sarwan Ali & Murray Patterson
University of Milano-Bicocca, Milan, Italy
Simone Ciccolella & Gianluca Della Vedova

Authors

Prakash Chourasia
View author publications
You can also search for this author in PubMed Google Scholar
Sarwan Ali
View author publications
You can also search for this author in PubMed Google Scholar
Simone Ciccolella
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Della Vedova
View author publications
You can also search for this author in PubMed Google Scholar
Murray Patterson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Murray Patterson .

Editor information

Editors and Affiliations

University of Connecticut, Storrs, CT, USA
Mukul S. Bansal
University of Connecticut, Storrs, CT, USA
Ion Măndoiu
University of Connecticut Health Center, Farmington, CT, USA
Marmar Moussa
Georgia State University, Atlanta, GA, USA
Murray Patterson
University of Connecticut, Storrs, CT, USA
Sanguthevar Rajasekaran
Georgia State University, Atlanta, GA, USA
Pavel Skums
Georgia State University, Atlanta, GA, USA
Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chourasia, P., Ali, S., Ciccolella, S., Della Vedova, G., Patterson, M. (2022). Clustering SARS-CoV-2 Variants from Raw High-Throughput Sequencing Reads Data. In: Bansal, M.S., et al. Computational Advances in Bio and Medical Sciences. ICCABS 2021. Lecture Notes in Computer Science(), vol 13254. Springer, Cham. https://doi.org/10.1007/978-3-031-17531-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-17531-2_11
Published: 19 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17530-5
Online ISBN: 978-3-031-17531-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Clustering SARS-CoV-2 Variants from Raw High-Throughput Sequencing Reads Data