Skip to main content

Clustering SARS-CoV-2 Variants from Raw High-Throughput Sequencing Reads Data

  • Conference paper
  • First Online:
Computational Advances in Bio and Medical Sciences (ICCABS 2021)

Abstract

The massive amount of genomic data appearing over the past two years for SARS-CoV-2 has challenged traditional methods for studying the dynamics of the COVID-19 pandemic. As a result, new methods, such as the Pangolin tool, have appeared which can scale to the millions of samples of SARS-CoV-2 currently available. Such a tool is tailored to take assembled, aligned and curated full-length sequences, such as those provided by GISAID, as input. As high-throughput sequencing technologies continue to advance, such assembly, alignment and curation may become a bottleneck, creating a need for methods which can process raw sequencing reads directly.

In this paper, we propose several alignment-free embedding approaches, which can generate a fixed-length feature vector representation directly from the raw sequencing reads, without the need for assembly. Moreover, because such an embedding is a numerical representation, it can be passed to already highly optimized clustering methods such as k-means. We show that the clusterings we obtain with the proposed embeddings are more suited to this setting than the Pangolin tool, based on several internal clustering evaluation metrics. Moreover, we show that a disproportionate number of positions in the spike region of the SARS-CoV-2 genome are informing such clusterings (in terms of information gain), which is consistent with current biological knowledge of SARS-CoV-2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ncbi.nlm.nih.gov/sars-cov-2/.

  2. 2.

    https://covid-19.ensembl.org/index.html.

  3. 3.

    https://github.com/murraypatterson/ncbi-sra-runs-pipeline.

  4. 4.

    https://drive.google.com/drive/folders/1i4uRrnkjkwUA93EOl8YORBBLb7yIFIm1?usp=sharing.

  5. 5.

    https://github.com/murraypatterson/ncbi-sra-runs-pipeline.

References

  1. Ahmad, M., Ali, S., Tariq, J., et al.: Combinatorial trace method for network immunization. Inf. Sci. 519, 215–228 (2020)

    Article  MathSciNet  Google Scholar 

  2. Ali, S., Mansoor, H., Khan, I., Arshad, N., Khan, M.A., Faizullah, S.: Short-term load forecasting using AMI data. arXiv:1912.12479 (2019)

  3. Ali, S., Mansoor, H., et al.: Short term load forecasting using smart meter data. In: International Conference on Future Energy Systems (e-Energy), pp. 419–421 (2019)

    Google Scholar 

  4. Ali, S., Shakeel, M., Khan, I., Faizullah, S., Khan, M.: Predicting attributes of nodes using network structure. ACM Trans. Intell. Syst. Technol. (TIST) 12(2), 1–23 (2021)

    Article  Google Scholar 

  5. Ali, S., Ali, T.E., Khan, M.A., Khan, I., Patterson, M.: Effective and scalable clustering of SARS-CoV-2 sequences. In: International Conference on Big Data Research (ICBDR), pp. 42–49 (2021)

    Google Scholar 

  6. Ali, S., Alvi, M.K., Faizullah, S., Khan, M.A., Alshanqiti, A., Khan, I.: Detecting DDoS attack on SDN due to vulnerabilities in OpenFlow. In: International Conference on Advances in the Emerging Computing Technologies (AECT), pp. 1–6 (2020)

    Google Scholar 

  7. Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3), 418 (2022)

    Article  Google Scholar 

  8. Ali, S., Patterson, M.: Spike2Vec: an efficient and scalable embedding approach for Covid-19 spike sequences. In: IEEE International Conference on Big Data (Big Data), pp. 1533–1540 (2021)

    Google Scholar 

  9. Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M., Khan, I.: A k-mer based approach for SARS-CoV-2 variant identification. In: International Symposium on Bioinformatics Research and Applications, pp. 153–164 (2021)

    Google Scholar 

  10. Ali, S., Zhou, Y., Patterson, M.: Efficient analysis of Covid-19 clinical data using machine learning models. Med. Biol. Eng. Comput. 1–16 (2022)

    Google Scholar 

  11. Benesty, J., Chen, J., Huang, Y., Cohen, I.: Pearson correlation coefficient. In: Benesty, J., Chen, J., Huang, Y., Cohen, I. (eds.) Noise Reduction in Speech Processing. Springer Topics in Signal Processing, vol. 2, pp. 1–4. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00296-0_5

    Chapter  Google Scholar 

  12. Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 3(1), 1–27 (1974)

    Article  MathSciNet  Google Scholar 

  13. Danecek, P., et al.: Twelve years of SAMtools and BCFtools. GigaScience 10(2) (2021)

    Google Scholar 

  14. Davies, D., Bouldin, D.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)

    Article  Google Scholar 

  15. Du Plessis, L., et al.: Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK. Science 371(6530), 708–712 (2021)

    Article  Google Scholar 

  16. Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I.: Efficient approx algorithms for strings kernel based sequence classification. In: Advances in Neural info Processing System (NeurIPS), pp. 6935–6945 (2017)

    Google Scholar 

  17. Fowlkes, E., Mallows, C.: A method for comparing two hierarchical clusterings. J. Am. Statist. Assoc. 78(383), 553–569 (1983)

    Article  Google Scholar 

  18. Galloway, S., et al.: Emerg. of SARS-CoV-2 b.1.1.7 lin. Morb. Mortal. Weekly Repo. 70(3), 95 (2021)

    Google Scholar 

  19. Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)

    Article  Google Scholar 

  20. GISAID. https://www.gisaid.org/. Accessed 29 Jan 2022

  21. Golubchik, T., Wise, M., Easteal, S., Jermiin, L.: Mind the gaps: evidence of bias in estimates of multiple sequence alignments. Mol. Biol. Evol. 24(11), 2433–2442 (2007)

    Article  Google Scholar 

  22. Hadfield, J., et al.: Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34, 4121–4123 (2018)

    Article  Google Scholar 

  23. Huang, Z.: Extensions to the k-modes algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)

    Article  Google Scholar 

  24. Hubert, L., Arabie, P.: Comparing partitions. J. Classification 2(1), 193–218 (1985)

    Article  Google Scholar 

  25. Kawulok, J., Deorowicz, S.: CoMeta: classification of metagenomes using k-mers. Plos One (2015)

    Google Scholar 

  26. Kuksa, P., Khan, I., Pavlovic, V.: Generalized similarity kernels for efficient sequence classification. In: SIAM International Conference on Data Mining (SDM), pp. 873–882 (2012)

    Google Scholar 

  27. Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)

    Article  Google Scholar 

  28. Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013)

    Google Scholar 

  29. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)

    Article  MathSciNet  Google Scholar 

  30. Melnyk, A., et al.: From alpha to zeta: identifying variants and subtypes of SARS-CoV-2 via clustering. J. Comput. Biol. 28(11), 1113–1129 (2021)

    Article  Google Scholar 

  31. Minh, B., et al.: IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020)

    Article  Google Scholar 

  32. Mölder, F., Jablonski, K.P., Letcher, B., et al.: Sustainable data analysis with Snakemake. F1000Res 10(33) (2021)

    Google Scholar 

  33. Myers, L., Sirois, M.: Spearman correlation coefficients, differences between. Encyclopedia Stat. Sci. 12 (2004)

    Google Scholar 

  34. Needham, K.: Chinese state fund invests in gene firm BGI. Reuters [Internet] (2021). https://www.reuters.com/article/us-china-genomics-state-idUSKBN2AM0AT

  35. O’Toole, A., et al.: Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol. 7(2), veab064 (2021)

    Google Scholar 

  36. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  37. Phylogenetic Assignment of Named Global Outbreak Lineages (Pangolin). https://cov-lineages.org/resources/pangolin.html. Accessed 4 Jan 2022

  38. Rambaut, A., et al.: A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat. Microbiol. 5(11), 1403–1407 (2020)

    Article  Google Scholar 

  39. Reporter, S.: CDC commits \$90m to create public health pathogen genomics research centers. Genomeweb. https://www.genomeweb.com/infectious-disease/cdc-commits-90m-create-public-health-pathogen-genomics-research-centers. Accessed 29 Jan 2022

  40. Roberts, M., Hayes, W., Hunt, B., Mount, S., Yorke, J.: Reducing storage req for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)

    Article  Google Scholar 

  41. Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: The Joint Conference Empirical Methods NLP Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–420 (2007)

    Google Scholar 

  42. Rousseeuw, P.: Silhouettes: a graphical aid to interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  43. SARS-CoV-2 Variant Classifications and Definitions. https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-info.html. Accessed 29 Jan 2022

  44. Satopaa, V., Albrecht, J., Irwin, D., Raghavan, B.: Finding a “Kneedle” in a haystack: detecting knee points in system behavior. In: International Conference on Distributed Computing Systems Workshops, pp. 166–171 (2011)

    Google Scholar 

  45. Sboner, A., Mu, X., Greenbaum, D., Auerbach, R., Gerstein, M.: The real cost of sequencing: higher than you think! Genome Biol. 12(8), 125 (2011)

    Article  Google Scholar 

  46. Shakeel, M., Faizullah, S., Alghamidi, T., Khan, I.: Language independent sentiment analysis. In: International Conference on Advances in the Emerging Computing Technologies (AECT), pp. 1–5 (2020)

    Google Scholar 

  47. Solis, S., Avino, M., Poon, A., Kari, L.: An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes (2018)

    Google Scholar 

  48. Stephens, Z., et al.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)

    Google Scholar 

  49. Tayebi, Z., Ali, S., Patterson, M.: Robust representation and efficient feature selection allows for effective clustering of SARS-CoV-2 variants. Algorithms 14(12), 348 (2021)

    Article  Google Scholar 

  50. Ullah, A., Ali, S., Khan, I., Khan, M.A., Faizullah, S.: Effect of analysis window and feature selection on classification of hand movements using EMG signal. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 400–415. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_30

    Chapter  Google Scholar 

  51. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. (JMLR) 9(11) (2008)

    Google Scholar 

  52. Walls, A., Park, Y., Tortorici, M.: Structure, function and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell 181(2), 281–292 (2020)

    Article  Google Scholar 

  53. Wood, D., Salzberg, S.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15 (2014)

    Google Scholar 

  54. Xu, W., Wang, M., Yu, D., Zhang, X.: Variations in SARS-CoV-2 spike protein cell epitopes and glycosylation profiles during global transmission course of Covid-19. Front. Immunol. 11 (2020)

    Google Scholar 

  55. Yadav, P., et al.: Neutralization potential of Covishield vaccinated individuals sera against B.1.617.1. Clin. Infect. Dis. 74, 558–559 (2021)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Murray Patterson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chourasia, P., Ali, S., Ciccolella, S., Della Vedova, G., Patterson, M. (2022). Clustering SARS-CoV-2 Variants from Raw High-Throughput Sequencing Reads Data. In: Bansal, M.S., et al. Computational Advances in Bio and Medical Sciences. ICCABS 2021. Lecture Notes in Computer Science(), vol 13254. Springer, Cham. https://doi.org/10.1007/978-3-031-17531-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17531-2_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17530-5

  • Online ISBN: 978-3-031-17531-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics