Abstract
With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2. Identifying particular variants helps to understand and model their spread patterns, design effective mitigation strategies, and prevent future outbreaks. It also plays a crucial role in studying the efficacy of known vaccines against each variant, and modeling the likelihood of breakthrough infections. It is well known that the spike protein contains most of the information/variation pertaining to coronavirus variants.
In this paper, we use spike sequences to classify different variants of the human SARS-CoV-2. We show that preserving order information of the amino acids helps the underlying classifiers to achieve better performance. We also show that we can train our model to outperform the baseline algorithms using only a small number of training samples (\(1\%\) of the data). Finally, we show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA’s Centers for Disease Control and Prevention (CDC).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ahmad, M., Ali, S., Tariq, J., Khan, I., Shabbir, M., Zaman, A.: Combinatorial trace method for network immunization. Inf. Sci. 519, 215–228 (2020)
Ahmad, M., Tariq, J., Farhan, M., Shabbir, M., Khan, I.: Who should receive the vaccine? In: Australasian Data Mining Conference (AusDM), pp. 137–145 (2016)
Ahmad, M., Tariq, J., Shabbir, M., Khan, I.: Spectral methods for immunization of large networks. Australas. J. Inf. Syst. 21, 1–27 (2017)
Ali, S., Alvi, M., Faizullah, S., Khan, M., Alshanqiti, A., Khan, I.: Detecting DDoS attack on SDN due to vulnerabilities in OpenFlow. In: International Conference on Advances in the Emerging Computing Technologies (AECT), pp. 1–6 (2020)
Ali, S., Mansoor, H., Arshad, N., Khan, I.: Short term load forecasting using smart meter data. In: International Conference on Future Energy Systems (e-Energy), pp. 419–421 (2019)
Ali, S., Mansoor, H., Khan, I., Arshad, N., Khan, M., Faizullah, S.: Short-term load forecasting using AMI data. CoRR abs/1912.12479 (2020)
Ali, S., Shakeel, M., Khan, I., Faizullah, S., Khan, M.: Predicting attributes of nodes using network structure. ACM Trans. Intell. Syst. Technol. (TIST) 12(2), 1–23 (2021)
Ali, S., Ciccolella, S., Lucarella, L., Della Vedova, G., Patterson, M.D.: Simpler and faster development of tumor phylogeny pipelines. J. Comput. Biol. (JCB) (2021, to appear). https://doi.org/10.1089/cmb.2021.0271
Ali, S., Khan, M.A., Khan, I., Patterson, M., et al.: Effective and scalable clustering of SARS-CoV-2 sequences. In: International Conference on Big Data Research (ICBDR) (2021, to appear)
Ali, S., Patterson, M.: Spike2Vec: an efficient and scalable embedding approach for Covid-19 spike sequences. In: 2021 IEEE International Conference on Big Data (2021, to appear)
Atzori, M., et al.: Electromyography data for non-invasive naturally-controlled robotic hand prostheses. Sci. Data 1(1), 1–13 (2014)
Blaisdell, B.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. 83, 5155–5159 (1986)
Dhar, S., et al.: TNet: phylogeny-based inference of disease transmission networks using within-host strain diversity. In: International Symposium on Bioinformatics Research and Applications (ISBRA), pp. 203–216 (2020)
Ewen, N., Khan, N.: Targeted self supervision for classification on a small Covid-19 CT scan dataset. In: International Symposium on Biomedical Imaging (ISBI), pp. 1481–1485 (2021)
Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I.: Efficient approximation algorithms for strings kernel based sequence classification. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 6935–6945 (2017)
Galloway, S., et al.: Emergence of SARS-CoV-2 B.1.1.7 lineage. Morb. Mortal. Wkly. Rep. 70(3), 95 (2021)
Hassan, Z., Khan, I., Shabbir, M., Abbas, W.: Computing graph descriptors on edge streams (2021). https://www.researchgate.net/publication/353671195_Computing_Graph_Descriptors_on_Edge_Streams
Hassan, Z., Shabbir, M., Khan, I., Abbas, W.: Estimating descriptors for large graphs. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 779–791 (2020)
Hoffmann, H.: Kernel PCA for novelty detection. Pattern Recogn. 40(3), 863–874 (2007)
Krishnan, G., Kamath, S., Sugumaran, V.: Predicting vaccine hesitancy and vaccine sentiment using topic modeling and evolutionary optimization. In: International Conference on Applications of Natural Language to Information Systems (NLDB), pp. 255–263 (2021)
Kuksa, P., Khan, I., Pavlovic, V.: Generalized similarity kernels for efficient sequence classification. In: SIAM International Conference on Data Mining (SDM), pp. 873–882 (2012)
Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 553(3), 553–558 (2020)
Laporte, M., et al.: The SARS-CoV-2 and other human coronavirus spike proteins are fine-tuned towards temperature and proteases of the human airways. bioRxiv (2020)
Leslie, C., Eskin, E., Weston, J., Noble, W.: Mismatch string kernels for SVM protein classification. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 1441–1448 (2003)
Lokman, S., et al.: Exploring the genomic and proteomic variations of SARS-CoV-2 spike glycoprotein: a computational biology approach. Infect. Genet. Evol. 84, 104389–104389 (2020)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. (JMLR) 9(11), 1–27 (2008)
Melnyk, A., et al.: Clustering based identification of SARS-CoV-2 subtypes. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds.) ICCABS 2020. LNCS, vol. 12686, pp. 127–141. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79290-9_11
Mousavizadeh, L., Ghasemi, S.: Genotype and phenotype of COVID-19: their roles in pathogenesis. J. Microbiol. Immunol. Infect. 54, 159–163 (2021)
Naveca, F., et al.: Phylogenetic relationship of SARS-CoV-2 sequences from Amazonas with emerging Brazilian variants harboring mutations e484k and n501y in the Spike protein. Virological. org 1, 1–8 (2021)
Shakeel., M., Karim, A., Khan, I.: A multi-cascaded deep model for bilingual SMS classification. In: International Conference on Neural Information Processing (ICONIP), pp. 287–298 (2019)
Shakeel, M., Karim, A., Khan, I.: A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts. Inf. Process. Manag. 57, 1–19 (2020)
Shakeel, M.H., Faizullah, S., Alghamidi, T., Khan, I.: Language independent sentiment analysis. In: International Conference on Advances in the Emerging Computing Technologies (AECT), pp. 1–5 (2020)
Su, S., Du, L., Jiang, S.: Learning from the past: development of safe and effective COVID-19 vaccines. Nat. Rev. Microbiol. 19(3), 211–219 (2021)
Tankisi, H., et al.: Critical illness myopathy as a consequence of COVID-19 infection. Clin. Neurophysiol. 131(8), 1931 (2020)
Tariq, J., Ahmad, M., Khan, I., Shabbir, M.: Scalable approximation algorithm for network immunization. In: Pacific Asia Conference on Information Systems (PACIS), p. 200 (2017)
Ullah, A., Ali, S., Khan, I., Khan, M., Faizullah, S.: Effect of analysis window and feature selection on classification of hand movements using EMG signal. In: SAI Intelligent Systems Conference (IntelliSys), pp. 400–415 (2020)
Yadav, P., et al.: Neutralization potential of covishield vaccinated individuals sera against B.1.617. 1. bioRxiv 1 (2021)
Zhang, W., et al.: Emergence of a novel SARS-CoV-2 variant in Southern California. JAMA 325(13), 1324–1326 (2021)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M., Khan, I. (2021). A k-mer Based Approach for SARS-CoV-2 Variant Identification. In: Wei, Y., Li, M., Skums, P., Cai, Z. (eds) Bioinformatics Research and Applications. ISBRA 2021. Lecture Notes in Computer Science(), vol 13064. Springer, Cham. https://doi.org/10.1007/978-3-030-91415-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-91415-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91414-1
Online ISBN: 978-3-030-91415-8
eBook Packages: Computer ScienceComputer Science (R0)