Skip to main content

Advertisement

Log in

ViralVectors: compact and scalable alignment-free virome feature generation

  • Original Article
  • Published:
Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Abstract

The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide or amino acid sequencing reads pertaining to the whole genome or regions (e.g., spike) of interest. In this work, we propose ViralVectors, a compact feature vector generation from virome sequencing data that allows effective downstream analysis. Such generation is based on minimizers, a type of lightweight “signature” of a sequence, used traditionally in assembly and read mapping — to our knowledge, the first use minimizers in this way. We validate our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show robustness to more genomic variability); and (c) 4K raw WGS reads sets taken from nasal-swab PCR tests (to show the ability to process unassembled reads). Our results show that ViralVectors outperforms current benchmarks in most classification and clustering tasks.

Graphical Abstract

Graphical Abstract showing the all steps of proposed approach. We start by collecting the sequence-based data. Then Data cleaning and preprocessing is applied. After that, we generate the feature embeddings using minimizer based approach. Then Classification and clustering algorithms are applied on the resultant data and predictions are made on the test set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://www.ncbi.nlm.nih.gov/

  2. https://github.com/slundberg/shap

References

  1. Ali S, Ali TE, Khan MA, Khan I, Patterson M (2021) Effective and scalable clustering of SARS-COV-2 sequences. In: International conference on big data research (ICBDR). pp 42–49

  2. Ali S, Bello B, Chourasia P, Punathil RT, Zhou Y, Patterson M (2022) PWM2VEC: An efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3):418

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Ali S, Patterson M (2021) Spike2Vec: An efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE international conference on big data (Big Data). pp 1533–1540

  4. Ali S, Sahoo B, Ullah N, Zelikovskiy A, Patterson M, Khan I (2021) A k-mer based approach for SARS-COV-2 variant identification. In: International symposium on bioinformatics research and applications. pp 153–164

  5. Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Comm Stats-theory Methods 3(1):1–27

    Article  Google Scholar 

  6. Compeau PEC, Pevzner PA, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 29(11):987–991

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227

    Article  Google Scholar 

  8. De Silva NH, Bhai J, Chakiachvili M, Contreras-Moreira B, Cummins C, Frankish A, Gall, A, Genez T, Howe KL, Hunt SE, et al (2021) The Ensembl COVID-19 resource: Ongoing integration of public SARS-COV-2 data. bioRxiv pp 2020–12

  9. Devijver P, Kittler J (1982) Pattern recognition: A statistical approach. In: London, GB: Prentice-Hall. pp 1–448

  10. Ekim B, Berger B, Chikhi R (2021) Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in min on a PC. Cell Syst 12(10):958-968.e6

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. ElAbd H, Bromberg Y, Hoarfrost A, Lenz T, Franke A, Wendorff M (2020) Amino acid encoding for deep learning applications. Bioinformatics 21(1):1–14

    Google Scholar 

  12. Farhan M, Tariq J, Zaman A, Shabbir M, Khan I (2017) Efficient approx algorithms for strings kernel based sequence classification. In: Advances in neural info processing sys (NeurIPS). pp 6935–6945

  13. Gardy J, Loman N (2018) Towards a genomics-informed, real-time, global pathogen surveillance system. Nat Rev Genet 19:9–20

    Article  CAS  PubMed  Google Scholar 

  14. GISAID Website: https://www.gisaid.org/. Accessed 5 Jan 2022

  15. Hadfield J, Megill C, Bell S, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher R (2018) Nextstrain: real-time tracking of pathogen evo. Bioinformatics 34:4121–4123

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Hoerl AE, Kannard RW, Baldwin KF (1975) Ridge regression: some simulations. Comm Stat-Theory Methods 4(2):105-123

    Google Scholar 

  17. Howe KL, Contreras-Moreira B, De Silva N, Maslen G, Akanni W, Allen J, Alvarez-Jarreta J, Barba M, Bolser DM, Cambell L et al (2020) Ensembl genomes 2020–enabling non-vertebrate genomic research. Nucleic Acids Res 48(D1):D689–D695

    Article  CAS  PubMed  Google Scholar 

  18. Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292(2):195-202

    Article  CAS  PubMed  Google Scholar 

  19. Howe KL, Contreras-Moreira B, De Silva N, Maslen G, Akanni W, Allen J, Alvarez-Jarreta J, Barba M, Bolser MM, Cambell L et al (2020) Ensembl genomes 2020–enabling non-vertebrate genomic research. Nucleic Acids Res 48:D689–D695

    Article  CAS  PubMed  Google Scholar 

  20. Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, Azov AG, Bennett R, Bhai J, Billis K (2021) Ensembl 2021. Nucleic Acids Res 49:D884–D891

    Article  CAS  PubMed  Google Scholar 

  21. Kuzmin K, Adeniyi AE, DaSouza AK Jr, Lim D, Nguyen H, Molina NR, Xiong L, Weber IT, Harrison RW (2020) Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem Biophys Res Commun 533(3):553–558

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467–476

    Article  CAS  PubMed  Google Scholar 

  23. Li H (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32:2103-2110

  24. Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. pp 4765–4774

  25. Marçais G, DeBlasio D, Kingsford C (2018) Asymptotically optimal minimizers schemes. Bioinformatics 34:i13–i22

    Article  PubMed  PubMed Central  Google Scholar 

  26. Mei H, Liao ZH, Zhou Y, Li SZ (2005) A new set of amino acid descriptors and its application in peptide QSARs. Peptide Sci Original Res Biomol 80(6):775–786

    CAS  Google Scholar 

  27. Mölder F, Jab, K, Letcher B, et al (2021) Sustainable data analysis with snakemake. F1000Res 10(33)

  28. Ondov B, Treangen T, Melsted P, et al (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17(132)

  29. Phylogenetic assignment of named global outbreak LINeages (Pangolin): https://cov-lineages.org/resources/pangolin.html. Accessed 4 Jan 2022

  30. Pickett BE, Sadat EL, Zhang Y, Noronha JM, Squires RB, Hunt V, Liu M, Kumar S, Zaremba S, Gu Z et al (2012) ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res 40(D1):D593–D598

    Article  CAS  PubMed  Google Scholar 

  31. Rahimi A, Recht B, et al (2007) Random features for large-scale kernel machines. In: NIPS, vol 3. p 5

  32. Roberts M, Haynes W, Hunt B, Mount S, Yorke J (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20:3363–9

    Article  CAS  PubMed  Google Scholar 

  33. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  34. Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10):1569–76

    Article  CAS  PubMed  Google Scholar 

  35. Silva NHD, Bhai J, Chakiachvili M, et al (2021) The ensembl COVID-19 resource: ongoing integration of public SARS-COV-2 data. Nucleic Acids Research

  36. Solis-Reyes S, Avino M, Poon A, Kari L (2018) An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PloS ONE

  37. Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982) Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10(9):2997–3011

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288

    Google Scholar 

  39. Toussaint NC, Widmer C, Kohlbacher O, Rätsch G (2010) Exploiting physico-chemical properties in string kernels. BMC Bioinforma 11(8):1–9

    Google Scholar 

  40. Van DML, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res (JMLR) 9(11)

  41. Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1–3):37–52

    Article  CAS  Google Scholar 

  42. Wood D, Salzberg S (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15

  43. Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15(3):1–12

    Article  Google Scholar 

  44. Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian JH, Pei YY et al (2020) A new coronavirus associate with human respiratory disease. Nature 579(7798):265–269

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ et al (2015) Big data: Astronomical or genomical? PLoS Biol 13(7):e1002195

    Article  PubMed  PubMed Central  Google Scholar 

  46. Zheng H, Kingsford C, Marçais G (2020) Lower density selection schemes via small universal hitting sets with short remaining path len. In: ICRCMB. Springer, pp 202–217

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarwan Ali.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Baseline Models

Appendix A: Baseline Models

We use different baseline and state-of-the-art (SOTA) methods to compare the results with ViralVectors. The baseline model that we are using is One Hot Embedding (OHE) [21] while the SOTA methods are Spike2Vec [3], PWM2Vec [2], and Pango Tool [29].

Table 7 Contingency tables of variants vs clusters after applying k-means on the OHE based feature embedding on GISAID data
Table 8 Contingency tables of variants vs clusters after applying k-means on the Spike2Vec based feature embedding on GISAID data
Table 9 Contingency tables of variants vs clusters after applying k-means on the ViralVectors based feature embedding on GISAID data
Table 10 Contingency tables of variants vs clusters after applying k-means on the OHE based feature embedding on ViPR data
Table 11 Contingency tables of variants vs clusters after applying k-means on the Spike2Vec based feature embedding on ViPR data
Table 12 Contingency tables of variants vs clusters after applying k-means on the ViralVectors based feature embedding on ViPR data
Table 13 Contingency tables of variants vs clusters after applying k-means on the OHE based feature embedding on NCBI short read data

1.1 A.1 One hot embedding (OHE) [21]

Since most machine learning methods does not work with the biological sequence based feature vectors, it is important to convert the them into a numerical representation. A traditional method to convert sequential information into numerical representation is called one-hot embedding [4, 21]. Given a finite set of symbols in a sequence (e.g., spike sequence), we call this set as an alphabet, denoted by \(\Sigma \). In the GISAID amino acid sequences, for example, we have 21 unique characters “ACDEFGHIKLMNPQRSTVWXY” (i.e., amino acids). To design a fixed-length feature vector representation, we generate a length 21 binary vector for each amino acid, which contains value 1 for the position of that specific character and zero everywhere else. At the end, we concatenate all these vectors to get a final feature vector representation for a given sequence. In GISAID amino acid sequences, since the length of each spike amino acid sequence is 1273, the length of each OHE based vector is \(1273 \times 21 = 26,733\) (more detail on the dataset can be found in Section 4.1). For the ViPR data, since the length of each spike amino acid sequence (after alignment) is 3498 (and the length of unique characters is 24), therefore, the length of OHE vector is \(3498 \times 24 = 83,952\). In the case of NCBI raw short reads sequencing data, the OHE does not apply, since we have variable-length unmapped reads rather than a single fixed length sequence. After generating the feature vectors, we can give these vectors as an input to machine learning algorithms for classification and clustering purposes.

Remark 1

Note that one problem with OHE is that it required all sequences in a data to be of fixed-length [1, 4].

1.2 A.2 Spike2Vec [3]

Since OHE does not work with the variable length sequences, a popular alignment-free method is using k-mers to preserve the order of amino acids and then generating a fixed-length feature vector that contains the frequency of each k-mer in a virome sequence. In this setting, the first step is to compute the substrings (called mers) of length k, where k is the user defined parameter. The k-mers are generated using sliding window approach with the increment of 1 (see Fig. 1). The total number of possible k-mers that can be generated from a virome sequence is “N - k + 1", where N is the length of sequence.

Table 14 Contingency tables of variants vs clusters after applying k-means on the Spike2Vec based feature embedding on NCBI short read data
Table 15 Contingency tables of variants vs clusters after applying k-means on the ViralVectors based feature embedding on NCBI short read data

1.2.1 A.2.1 Fixed-length representation

Since each virome sequence can have different number of k-mers, it is important to generate fixed-length numerical representation so that classification and clustering algorithms could be applied. For this purpose, we design a feature vector of length length \(|\Sigma |^{k}\) (where \(\Sigma \) is the alphabet and k is user defined parameter for k-mers) that contains the frequency/count of each k-mer within a sequence. In this paper, we are taking \(k=3\) for all experiments unless specifically mentioned otherwise (decided using standard validation set approach [9]). In the GISAID dataset, since the total number of alphabets are 21, the length of Spike2Vec based feature vector is \(21^{3} = 9261\). For the ViPR dataset, the length of Spike2Vec based vector is \(25^{3} = 15625\), for NCBI short reads data, the length of Spike2Vec based vector is \(24^{3} = 13842\).

1.3 A.3 PWM2Vec [2]

When using a Spike2Vec method, the frequency vectors obtained is comparatively low dimension but still is high dimensional. Moreover, while generating the frequency vectors, matching the k-mers to the appropriate location/bin in the vector (bin matching) can be computationally expensive. To solve these issues, PWM2Vec [2] can be used. It is a recently proposed method for producing a fixed-length numerical feature vector based using the well known position-weight matrix notion [37]. PWM2Vec creates a PWM from the sequence’s k-mers, and the final feature vector contains the score of each k-mer in the PWM. This enables the method to use k-mers ability to collect localization information while also capturing the significance of each amino acid’s position in the sequence (information that is lost in computing k-mer frequency vector). By combining these pieces of data in this way, a compact and broad feature embedding can be created that can be used for a variety of downstream machine learning tasks.

1.4 A.4 Pango tool [29]

For clustering purpose, we also use the state-of-the-art clustering benchmark called pango tool [29]. Since pango tool takes multiple aligned sequence as input, we needed to align each read set to the reference genome, call (genomic) variants, and introduce these variants into the reference sequence to generate a consensus sequence which represents this particular sample — the pipeline is available as a Snakefile [27] in our shared code repository above. The SARS-CoV-2 reference genome sequence (INSDC accession \(GCA\_009858895.3\), sequence MN9089047) used in this study is obtained from Ensemble COVID-19 browser database, Ensemble COVID-19 [19, 20]. It is a complete genome of 29903 bps. The genome the reference assembly of the viral RNA genome Isolates of the first cases Wuhan-HU-1, China [44] and has been reportedly used as the standard reference widely [35].

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ali, S., Chourasia, P., Tayebi, Z. et al. ViralVectors: compact and scalable alignment-free virome feature generation. Med Biol Eng Comput 61, 2607–2626 (2023). https://doi.org/10.1007/s11517-023-02837-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11517-023-02837-8

Keywords

Navigation