ViralVectors: compact and scalable alignment-free virome feature generation

Ali, Sarwan; Chourasia, Prakash; Tayebi, Zahra; Bello, Babatunde; Patterson, Murray

doi:10.1007/s11517-023-02837-8

ViralVectors: compact and scalable alignment-free virome feature generation

Original Article
Published: 03 July 2023

Volume 61, pages 2607–2626, (2023)
Cite this article

Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Sarwan Ali ORCID: orcid.org/0000-0001-8121-2168¹,
Prakash Chourasia¹,
Zahra Tayebi¹,
Babatunde Bello¹ &
…
Murray Patterson¹

143 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide or amino acid sequencing reads pertaining to the whole genome or regions (e.g., spike) of interest. In this work, we propose ViralVectors, a compact feature vector generation from virome sequencing data that allows effective downstream analysis. Such generation is based on minimizers, a type of lightweight “signature” of a sequence, used traditionally in assembly and read mapping — to our knowledge, the first use minimizers in this way. We validate our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show robustness to more genomic variability); and (c) 4K raw WGS reads sets taken from nasal-swab PCR tests (to show the ability to process unassembled reads). Our results show that ViralVectors outperforms current benchmarks in most classification and clustering tasks.

Graphical Abstract

Graphical Abstract showing the all steps of proposed approach. We start by collecting the sequence-based data. Then Data cleaning and preprocessing is applied. After that, we generate the feature embeddings using minimizer based approach. Then Classification and clustering algorithms are applied on the resultant data and predictions are made on the test set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PSSM2Vec: A Compact Alignment-Free Embedding Approach for Coronavirus Spike Sequence Classification

Clustering SARS-CoV-2 Variants from Raw High-Throughput Sequencing Reads Data

Application of Continuous Embedding of Viral Genome Sequences and Machine Learning in the Prediction of SARS-CoV-2 Variants

Notes

References

Ali S, Ali TE, Khan MA, Khan I, Patterson M (2021) Effective and scalable clustering of SARS-COV-2 sequences. In: International conference on big data research (ICBDR). pp 42–49
Ali S, Bello B, Chourasia P, Punathil RT, Zhou Y, Patterson M (2022) PWM2VEC: An efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3):418
Article CAS PubMed PubMed Central Google Scholar
Ali S, Patterson M (2021) Spike2Vec: An efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE international conference on big data (Big Data). pp 1533–1540
Ali S, Sahoo B, Ullah N, Zelikovskiy A, Patterson M, Khan I (2021) A k-mer based approach for SARS-COV-2 variant identification. In: International symposium on bioinformatics research and applications. pp 153–164
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Comm Stats-theory Methods 3(1):1–27
Article Google Scholar
Compeau PEC, Pevzner PA, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 29(11):987–991
Article CAS PubMed PubMed Central Google Scholar
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227
Article Google Scholar
De Silva NH, Bhai J, Chakiachvili M, Contreras-Moreira B, Cummins C, Frankish A, Gall, A, Genez T, Howe KL, Hunt SE, et al (2021) The Ensembl COVID-19 resource: Ongoing integration of public SARS-COV-2 data. bioRxiv pp 2020–12
Devijver P, Kittler J (1982) Pattern recognition: A statistical approach. In: London, GB: Prentice-Hall. pp 1–448
Ekim B, Berger B, Chikhi R (2021) Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in min on a PC. Cell Syst 12(10):958-968.e6
Article CAS PubMed PubMed Central Google Scholar
ElAbd H, Bromberg Y, Hoarfrost A, Lenz T, Franke A, Wendorff M (2020) Amino acid encoding for deep learning applications. Bioinformatics 21(1):1–14
Google Scholar
Farhan M, Tariq J, Zaman A, Shabbir M, Khan I (2017) Efficient approx algorithms for strings kernel based sequence classification. In: Advances in neural info processing sys (NeurIPS). pp 6935–6945
Gardy J, Loman N (2018) Towards a genomics-informed, real-time, global pathogen surveillance system. Nat Rev Genet 19:9–20
Article CAS PubMed Google Scholar
GISAID Website: https://www.gisaid.org/. Accessed 5 Jan 2022
Hadfield J, Megill C, Bell S, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher R (2018) Nextstrain: real-time tracking of pathogen evo. Bioinformatics 34:4121–4123
Article CAS PubMed PubMed Central Google Scholar
Hoerl AE, Kannard RW, Baldwin KF (1975) Ridge regression: some simulations. Comm Stat-Theory Methods 4(2):105-123
Google Scholar
Howe KL, Contreras-Moreira B, De Silva N, Maslen G, Akanni W, Allen J, Alvarez-Jarreta J, Barba M, Bolser DM, Cambell L et al (2020) Ensembl genomes 2020–enabling non-vertebrate genomic research. Nucleic Acids Res 48(D1):D689–D695
Article CAS PubMed Google Scholar
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292(2):195-202
Article CAS PubMed Google Scholar
Howe KL, Contreras-Moreira B, De Silva N, Maslen G, Akanni W, Allen J, Alvarez-Jarreta J, Barba M, Bolser MM, Cambell L et al (2020) Ensembl genomes 2020–enabling non-vertebrate genomic research. Nucleic Acids Res 48:D689–D695
Article CAS PubMed Google Scholar
Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, Azov AG, Bennett R, Bhai J, Billis K (2021) Ensembl 2021. Nucleic Acids Res 49:D884–D891
Article CAS PubMed Google Scholar
Kuzmin K, Adeniyi AE, DaSouza AK Jr, Lim D, Nguyen H, Molina NR, Xiong L, Weber IT, Harrison RW (2020) Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem Biophys Res Commun 533(3):553–558
Article CAS PubMed PubMed Central Google Scholar
Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467–476
Article CAS PubMed Google Scholar
Li H (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32:2103-2110
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. pp 4765–4774
Marçais G, DeBlasio D, Kingsford C (2018) Asymptotically optimal minimizers schemes. Bioinformatics 34:i13–i22
Article PubMed PubMed Central Google Scholar
Mei H, Liao ZH, Zhou Y, Li SZ (2005) A new set of amino acid descriptors and its application in peptide QSARs. Peptide Sci Original Res Biomol 80(6):775–786
CAS Google Scholar
Mölder F, Jab, K, Letcher B, et al (2021) Sustainable data analysis with snakemake. F1000Res 10(33)
Ondov B, Treangen T, Melsted P, et al (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17(132)
Phylogenetic assignment of named global outbreak LINeages (Pangolin): https://cov-lineages.org/resources/pangolin.html. Accessed 4 Jan 2022
Pickett BE, Sadat EL, Zhang Y, Noronha JM, Squires RB, Hunt V, Liu M, Kumar S, Zaremba S, Gu Z et al (2012) ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res 40(D1):D593–D598
Article CAS PubMed Google Scholar
Rahimi A, Recht B, et al (2007) Random features for large-scale kernel machines. In: NIPS, vol 3. p 5
Roberts M, Haynes W, Hunt B, Mount S, Yorke J (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20:3363–9
Article CAS PubMed Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article Google Scholar
Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10):1569–76
Article CAS PubMed Google Scholar
Silva NHD, Bhai J, Chakiachvili M, et al (2021) The ensembl COVID-19 resource: ongoing integration of public SARS-COV-2 data. Nucleic Acids Research
Solis-Reyes S, Avino M, Poon A, Kari L (2018) An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PloS ONE
Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982) Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10(9):2997–3011
Article CAS PubMed PubMed Central Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288
Google Scholar
Toussaint NC, Widmer C, Kohlbacher O, Rätsch G (2010) Exploiting physico-chemical properties in string kernels. BMC Bioinforma 11(8):1–9
Google Scholar
Van DML, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res (JMLR) 9(11)
Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1–3):37–52
Article CAS Google Scholar
Wood D, Salzberg S (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15
Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15(3):1–12
Article Google Scholar
Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian JH, Pei YY et al (2020) A new coronavirus associate with human respiratory disease. Nature 579(7798):265–269
Article CAS PubMed PubMed Central Google Scholar
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ et al (2015) Big data: Astronomical or genomical? PLoS Biol 13(7):e1002195
Article PubMed PubMed Central Google Scholar
Zheng H, Kingsford C, Marçais G (2020) Lower density selection schemes via small universal hitting sets with short remaining path len. In: ICRCMB. Springer, pp 202–217

Download references

Author information

Authors and Affiliations

Georgia State University, Atlanta, GA, USA
Sarwan Ali, Prakash Chourasia, Zahra Tayebi, Babatunde Bello & Murray Patterson

Authors

Sarwan Ali
View author publications
You can also search for this author in PubMed Google Scholar
Prakash Chourasia
View author publications
You can also search for this author in PubMed Google Scholar
Zahra Tayebi
View author publications
You can also search for this author in PubMed Google Scholar
Babatunde Bello
View author publications
You can also search for this author in PubMed Google Scholar
Murray Patterson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sarwan Ali.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Baseline Models

We use different baseline and state-of-the-art (SOTA) methods to compare the results with ViralVectors. The baseline model that we are using is One Hot Embedding (OHE) [21] while the SOTA methods are Spike2Vec [3], PWM2Vec [2], and Pango Tool [29].

Table 7 Contingency tables of variants vs clusters after applying k-means on the OHE based feature embedding on GISAID data

Full size table

Table 8 Contingency tables of variants vs clusters after applying k-means on the Spike2Vec based feature embedding on GISAID data

Full size table

Table 9 Contingency tables of variants vs clusters after applying k-means on the ViralVectors based feature embedding on GISAID data

Full size table

Table 10 Contingency tables of variants vs clusters after applying k-means on the OHE based feature embedding on ViPR data

Full size table

Table 11 Contingency tables of variants vs clusters after applying k-means on the Spike2Vec based feature embedding on ViPR data

Full size table

Table 12 Contingency tables of variants vs clusters after applying k-means on the ViralVectors based feature embedding on ViPR data

Full size table

Table 13 Contingency tables of variants vs clusters after applying k-means on the OHE based feature embedding on NCBI short read data

Full size table

1.1 A.1 One hot embedding (OHE) [21]

Since most machine learning methods does not work with the biological sequence based feature vectors, it is important to convert the them into a numerical representation. A traditional method to convert sequential information into numerical representation is called one-hot embedding [4, 21]. Given a finite set of symbols in a sequence (e.g., spike sequence), we call this set as an alphabet, denoted by \(\Sigma \). In the GISAID amino acid sequences, for example, we have 21 unique characters “ACDEFGHIKLMNPQRSTVWXY” (i.e., amino acids). To design a fixed-length feature vector representation, we generate a length 21 binary vector for each amino acid, which contains value 1 for the position of that specific character and zero everywhere else. At the end, we concatenate all these vectors to get a final feature vector representation for a given sequence. In GISAID amino acid sequences, since the length of each spike amino acid sequence is 1273, the length of each OHE based vector is \(1273 \times 21 = 26,733\) (more detail on the dataset can be found in Section 4.1). For the ViPR data, since the length of each spike amino acid sequence (after alignment) is 3498 (and the length of unique characters is 24), therefore, the length of OHE vector is \(3498 \times 24 = 83,952\). In the case of NCBI raw short reads sequencing data, the OHE does not apply, since we have variable-length unmapped reads rather than a single fixed length sequence. After generating the feature vectors, we can give these vectors as an input to machine learning algorithms for classification and clustering purposes.

Remark 1

Note that one problem with OHE is that it required all sequences in a data to be of fixed-length [1, 4].

1.2 A.2 Spike2Vec [3]

Since OHE does not work with the variable length sequences, a popular alignment-free method is using k-mers to preserve the order of amino acids and then generating a fixed-length feature vector that contains the frequency of each k-mer in a virome sequence. In this setting, the first step is to compute the substrings (called mers) of length k, where k is the user defined parameter. The k-mers are generated using sliding window approach with the increment of 1 (see Fig. 1). The total number of possible k-mers that can be generated from a virome sequence is “N - k + 1", where N is the length of sequence.

Table 14 Contingency tables of variants vs clusters after applying k-means on the Spike2Vec based feature embedding on NCBI short read data

Full size table

Table 15 Contingency tables of variants vs clusters after applying k-means on the ViralVectors based feature embedding on NCBI short read data

Full size table

1.2.1 A.2.1 Fixed-length representation

Since each virome sequence can have different number of k-mers, it is important to generate fixed-length numerical representation so that classification and clustering algorithms could be applied. For this purpose, we design a feature vector of length length \(|\Sigma |^{k}\) (where \(\Sigma \) is the alphabet and k is user defined parameter for k-mers) that contains the frequency/count of each k-mer within a sequence. In this paper, we are taking \(k=3\) for all experiments unless specifically mentioned otherwise (decided using standard validation set approach [9]). In the GISAID dataset, since the total number of alphabets are 21, the length of Spike2Vec based feature vector is \(21^{3} = 9261\). For the ViPR dataset, the length of Spike2Vec based vector is \(25^{3} = 15625\), for NCBI short reads data, the length of Spike2Vec based vector is \(24^{3} = 13842\).

1.3 A.3 PWM2Vec [2]

When using a Spike2Vec method, the frequency vectors obtained is comparatively low dimension but still is high dimensional. Moreover, while generating the frequency vectors, matching the k-mers to the appropriate location/bin in the vector (bin matching) can be computationally expensive. To solve these issues, PWM2Vec [2] can be used. It is a recently proposed method for producing a fixed-length numerical feature vector based using the well known position-weight matrix notion [37]. PWM2Vec creates a PWM from the sequence’s k-mers, and the final feature vector contains the score of each k-mer in the PWM. This enables the method to use k-mers ability to collect localization information while also capturing the significance of each amino acid’s position in the sequence (information that is lost in computing k-mer frequency vector). By combining these pieces of data in this way, a compact and broad feature embedding can be created that can be used for a variety of downstream machine learning tasks.

1.4 A.4 Pango tool [29]

For clustering purpose, we also use the state-of-the-art clustering benchmark called pango tool [29]. Since pango tool takes multiple aligned sequence as input, we needed to align each read set to the reference genome, call (genomic) variants, and introduce these variants into the reference sequence to generate a consensus sequence which represents this particular sample — the pipeline is available as a Snakefile [27] in our shared code repository above. The SARS-CoV-2 reference genome sequence (INSDC accession \(GCA\_009858895.3\), sequence MN9089047) used in this study is obtained from Ensemble COVID-19 browser database, Ensemble COVID-19 [19, 20]. It is a complete genome of 29903 bps. The genome the reference assembly of the viral RNA genome Isolates of the first cases Wuhan-HU-1, China [44] and has been reportedly used as the standard reference widely [35].

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ali, S., Chourasia, P., Tayebi, Z. et al. ViralVectors: compact and scalable alignment-free virome feature generation. Med Biol Eng Comput 61, 2607–2626 (2023). https://doi.org/10.1007/s11517-023-02837-8

Download citation

Received: 23 June 2022
Accepted: 29 March 2023
Published: 03 July 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11517-023-02837-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions