Abstract
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide or amino acid sequencing reads pertaining to the whole genome or regions (e.g., spike) of interest. In this work, we propose ViralVectors, a compact feature vector generation from virome sequencing data that allows effective downstream analysis. Such generation is based on minimizers, a type of lightweight “signature” of a sequence, used traditionally in assembly and read mapping — to our knowledge, the first use minimizers in this way. We validate our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show robustness to more genomic variability); and (c) 4K raw WGS reads sets taken from nasal-swab PCR tests (to show the ability to process unassembled reads). Our results show that ViralVectors outperforms current benchmarks in most classification and clustering tasks.
Graphical Abstract
Graphical Abstract showing the all steps of proposed approach. We start by collecting the sequence-based data. Then Data cleaning and preprocessing is applied. After that, we generate the feature embeddings using minimizer based approach. Then Classification and clustering algorithms are applied on the resultant data and predictions are made on the test set.
Similar content being viewed by others
References
Ali S, Ali TE, Khan MA, Khan I, Patterson M (2021) Effective and scalable clustering of SARS-COV-2 sequences. In: International conference on big data research (ICBDR). pp 42–49
Ali S, Bello B, Chourasia P, Punathil RT, Zhou Y, Patterson M (2022) PWM2VEC: An efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3):418
Ali S, Patterson M (2021) Spike2Vec: An efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE international conference on big data (Big Data). pp 1533–1540
Ali S, Sahoo B, Ullah N, Zelikovskiy A, Patterson M, Khan I (2021) A k-mer based approach for SARS-COV-2 variant identification. In: International symposium on bioinformatics research and applications. pp 153–164
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Comm Stats-theory Methods 3(1):1–27
Compeau PEC, Pevzner PA, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 29(11):987–991
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227
De Silva NH, Bhai J, Chakiachvili M, Contreras-Moreira B, Cummins C, Frankish A, Gall, A, Genez T, Howe KL, Hunt SE, et al (2021) The Ensembl COVID-19 resource: Ongoing integration of public SARS-COV-2 data. bioRxiv pp 2020–12
Devijver P, Kittler J (1982) Pattern recognition: A statistical approach. In: London, GB: Prentice-Hall. pp 1–448
Ekim B, Berger B, Chikhi R (2021) Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in min on a PC. Cell Syst 12(10):958-968.e6
ElAbd H, Bromberg Y, Hoarfrost A, Lenz T, Franke A, Wendorff M (2020) Amino acid encoding for deep learning applications. Bioinformatics 21(1):1–14
Farhan M, Tariq J, Zaman A, Shabbir M, Khan I (2017) Efficient approx algorithms for strings kernel based sequence classification. In: Advances in neural info processing sys (NeurIPS). pp 6935–6945
Gardy J, Loman N (2018) Towards a genomics-informed, real-time, global pathogen surveillance system. Nat Rev Genet 19:9–20
GISAID Website: https://www.gisaid.org/. Accessed 5 Jan 2022
Hadfield J, Megill C, Bell S, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher R (2018) Nextstrain: real-time tracking of pathogen evo. Bioinformatics 34:4121–4123
Hoerl AE, Kannard RW, Baldwin KF (1975) Ridge regression: some simulations. Comm Stat-Theory Methods 4(2):105-123
Howe KL, Contreras-Moreira B, De Silva N, Maslen G, Akanni W, Allen J, Alvarez-Jarreta J, Barba M, Bolser DM, Cambell L et al (2020) Ensembl genomes 2020–enabling non-vertebrate genomic research. Nucleic Acids Res 48(D1):D689–D695
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292(2):195-202
Howe KL, Contreras-Moreira B, De Silva N, Maslen G, Akanni W, Allen J, Alvarez-Jarreta J, Barba M, Bolser MM, Cambell L et al (2020) Ensembl genomes 2020–enabling non-vertebrate genomic research. Nucleic Acids Res 48:D689–D695
Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, Azov AG, Bennett R, Bhai J, Billis K (2021) Ensembl 2021. Nucleic Acids Res 49:D884–D891
Kuzmin K, Adeniyi AE, DaSouza AK Jr, Lim D, Nguyen H, Molina NR, Xiong L, Weber IT, Harrison RW (2020) Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem Biophys Res Commun 533(3):553–558
Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467–476
Li H (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32:2103-2110
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. pp 4765–4774
Marçais G, DeBlasio D, Kingsford C (2018) Asymptotically optimal minimizers schemes. Bioinformatics 34:i13–i22
Mei H, Liao ZH, Zhou Y, Li SZ (2005) A new set of amino acid descriptors and its application in peptide QSARs. Peptide Sci Original Res Biomol 80(6):775–786
Mölder F, Jab, K, Letcher B, et al (2021) Sustainable data analysis with snakemake. F1000Res 10(33)
Ondov B, Treangen T, Melsted P, et al (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17(132)
Phylogenetic assignment of named global outbreak LINeages (Pangolin): https://cov-lineages.org/resources/pangolin.html. Accessed 4 Jan 2022
Pickett BE, Sadat EL, Zhang Y, Noronha JM, Squires RB, Hunt V, Liu M, Kumar S, Zaremba S, Gu Z et al (2012) ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res 40(D1):D593–D598
Rahimi A, Recht B, et al (2007) Random features for large-scale kernel machines. In: NIPS, vol 3. p 5
Roberts M, Haynes W, Hunt B, Mount S, Yorke J (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20:3363–9
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10):1569–76
Silva NHD, Bhai J, Chakiachvili M, et al (2021) The ensembl COVID-19 resource: ongoing integration of public SARS-COV-2 data. Nucleic Acids Research
Solis-Reyes S, Avino M, Poon A, Kari L (2018) An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PloS ONE
Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982) Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10(9):2997–3011
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288
Toussaint NC, Widmer C, Kohlbacher O, Rätsch G (2010) Exploiting physico-chemical properties in string kernels. BMC Bioinforma 11(8):1–9
Van DML, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res (JMLR) 9(11)
Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1–3):37–52
Wood D, Salzberg S (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15
Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15(3):1–12
Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian JH, Pei YY et al (2020) A new coronavirus associate with human respiratory disease. Nature 579(7798):265–269
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ et al (2015) Big data: Astronomical or genomical? PLoS Biol 13(7):e1002195
Zheng H, Kingsford C, Marçais G (2020) Lower density selection schemes via small universal hitting sets with short remaining path len. In: ICRCMB. Springer, pp 202–217
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Baseline Models
Appendix A: Baseline Models
We use different baseline and state-of-the-art (SOTA) methods to compare the results with ViralVectors. The baseline model that we are using is One Hot Embedding (OHE) [21] while the SOTA methods are Spike2Vec [3], PWM2Vec [2], and Pango Tool [29].
1.1 A.1 One hot embedding (OHE) [21]
Since most machine learning methods does not work with the biological sequence based feature vectors, it is important to convert the them into a numerical representation. A traditional method to convert sequential information into numerical representation is called one-hot embedding [4, 21]. Given a finite set of symbols in a sequence (e.g., spike sequence), we call this set as an alphabet, denoted by \(\Sigma \). In the GISAID amino acid sequences, for example, we have 21 unique characters “ACDEFGHIKLMNPQRSTVWXY” (i.e., amino acids). To design a fixed-length feature vector representation, we generate a length 21 binary vector for each amino acid, which contains value 1 for the position of that specific character and zero everywhere else. At the end, we concatenate all these vectors to get a final feature vector representation for a given sequence. In GISAID amino acid sequences, since the length of each spike amino acid sequence is 1273, the length of each OHE based vector is \(1273 \times 21 = 26,733\) (more detail on the dataset can be found in Section 4.1). For the ViPR data, since the length of each spike amino acid sequence (after alignment) is 3498 (and the length of unique characters is 24), therefore, the length of OHE vector is \(3498 \times 24 = 83,952\). In the case of NCBI raw short reads sequencing data, the OHE does not apply, since we have variable-length unmapped reads rather than a single fixed length sequence. After generating the feature vectors, we can give these vectors as an input to machine learning algorithms for classification and clustering purposes.
Remark 1
Note that one problem with OHE is that it required all sequences in a data to be of fixed-length [1, 4].
1.2 A.2 Spike2Vec [3]
Since OHE does not work with the variable length sequences, a popular alignment-free method is using k-mers to preserve the order of amino acids and then generating a fixed-length feature vector that contains the frequency of each k-mer in a virome sequence. In this setting, the first step is to compute the substrings (called mers) of length k, where k is the user defined parameter. The k-mers are generated using sliding window approach with the increment of 1 (see Fig. 1). The total number of possible k-mers that can be generated from a virome sequence is “N - k + 1", where N is the length of sequence.
1.2.1 A.2.1 Fixed-length representation
Since each virome sequence can have different number of k-mers, it is important to generate fixed-length numerical representation so that classification and clustering algorithms could be applied. For this purpose, we design a feature vector of length length \(|\Sigma |^{k}\) (where \(\Sigma \) is the alphabet and k is user defined parameter for k-mers) that contains the frequency/count of each k-mer within a sequence. In this paper, we are taking \(k=3\) for all experiments unless specifically mentioned otherwise (decided using standard validation set approach [9]). In the GISAID dataset, since the total number of alphabets are 21, the length of Spike2Vec based feature vector is \(21^{3} = 9261\). For the ViPR dataset, the length of Spike2Vec based vector is \(25^{3} = 15625\), for NCBI short reads data, the length of Spike2Vec based vector is \(24^{3} = 13842\).
1.3 A.3 PWM2Vec [2]
When using a Spike2Vec method, the frequency vectors obtained is comparatively low dimension but still is high dimensional. Moreover, while generating the frequency vectors, matching the k-mers to the appropriate location/bin in the vector (bin matching) can be computationally expensive. To solve these issues, PWM2Vec [2] can be used. It is a recently proposed method for producing a fixed-length numerical feature vector based using the well known position-weight matrix notion [37]. PWM2Vec creates a PWM from the sequence’s k-mers, and the final feature vector contains the score of each k-mer in the PWM. This enables the method to use k-mers ability to collect localization information while also capturing the significance of each amino acid’s position in the sequence (information that is lost in computing k-mer frequency vector). By combining these pieces of data in this way, a compact and broad feature embedding can be created that can be used for a variety of downstream machine learning tasks.
1.4 A.4 Pango tool [29]
For clustering purpose, we also use the state-of-the-art clustering benchmark called pango tool [29]. Since pango tool takes multiple aligned sequence as input, we needed to align each read set to the reference genome, call (genomic) variants, and introduce these variants into the reference sequence to generate a consensus sequence which represents this particular sample — the pipeline is available as a Snakefile [27] in our shared code repository above. The SARS-CoV-2 reference genome sequence (INSDC accession \(GCA\_009858895.3\), sequence MN9089047) used in this study is obtained from Ensemble COVID-19 browser database, Ensemble COVID-19 [19, 20]. It is a complete genome of 29903 bps. The genome the reference assembly of the viral RNA genome Isolates of the first cases Wuhan-HU-1, China [44] and has been reportedly used as the standard reference widely [35].
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ali, S., Chourasia, P., Tayebi, Z. et al. ViralVectors: compact and scalable alignment-free virome feature generation. Med Biol Eng Comput 61, 2607–2626 (2023). https://doi.org/10.1007/s11517-023-02837-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11517-023-02837-8