Skip to main content

A k-mer Based Approach for SARS-CoV-2 Variant Identification

  • Conference paper
  • First Online:
Bioinformatics Research and Applications (ISBRA 2021)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13064))

Included in the following conference series:

Abstract

With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2. Identifying particular variants helps to understand and model their spread patterns, design effective mitigation strategies, and prevent future outbreaks. It also plays a crucial role in studying the efficacy of known vaccines against each variant, and modeling the likelihood of breakthrough infections. It is well known that the spike protein contains most of the information/variation pertaining to coronavirus variants.

In this paper, we use spike sequences to classify different variants of the human SARS-CoV-2. We show that preserving order information of the amino acids helps the underlying classifiers to achieve better performance. We also show that we can train our model to outperform the baseline algorithms using only a small number of training samples (\(1\%\) of the data). Finally, we show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA’s Centers for Disease Control and Prevention (CDC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.gisaid.org/.

  2. 2.

    https://github.com/sarwanpasha/covid_variant_classification.

  3. 3.

    https://www.gisaid.org/.

  4. 4.

    https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-info.html.

References

  1. Ahmad, M., Ali, S., Tariq, J., Khan, I., Shabbir, M., Zaman, A.: Combinatorial trace method for network immunization. Inf. Sci. 519, 215–228 (2020)

    Article  Google Scholar 

  2. Ahmad, M., Tariq, J., Farhan, M., Shabbir, M., Khan, I.: Who should receive the vaccine? In: Australasian Data Mining Conference (AusDM), pp. 137–145 (2016)

    Google Scholar 

  3. Ahmad, M., Tariq, J., Shabbir, M., Khan, I.: Spectral methods for immunization of large networks. Australas. J. Inf. Syst. 21, 1–27 (2017)

    Google Scholar 

  4. Ali, S., Alvi, M., Faizullah, S., Khan, M., Alshanqiti, A., Khan, I.: Detecting DDoS attack on SDN due to vulnerabilities in OpenFlow. In: International Conference on Advances in the Emerging Computing Technologies (AECT), pp. 1–6 (2020)

    Google Scholar 

  5. Ali, S., Mansoor, H., Arshad, N., Khan, I.: Short term load forecasting using smart meter data. In: International Conference on Future Energy Systems (e-Energy), pp. 419–421 (2019)

    Google Scholar 

  6. Ali, S., Mansoor, H., Khan, I., Arshad, N., Khan, M., Faizullah, S.: Short-term load forecasting using AMI data. CoRR abs/1912.12479 (2020)

    Google Scholar 

  7. Ali, S., Shakeel, M., Khan, I., Faizullah, S., Khan, M.: Predicting attributes of nodes using network structure. ACM Trans. Intell. Syst. Technol. (TIST) 12(2), 1–23 (2021)

    Article  Google Scholar 

  8. Ali, S., Ciccolella, S., Lucarella, L., Della Vedova, G., Patterson, M.D.: Simpler and faster development of tumor phylogeny pipelines. J. Comput. Biol. (JCB) (2021, to appear). https://doi.org/10.1089/cmb.2021.0271

  9. Ali, S., Khan, M.A., Khan, I., Patterson, M., et al.: Effective and scalable clustering of SARS-CoV-2 sequences. In: International Conference on Big Data Research (ICBDR) (2021, to appear)

    Google Scholar 

  10. Ali, S., Patterson, M.: Spike2Vec: an efficient and scalable embedding approach for Covid-19 spike sequences. In: 2021 IEEE International Conference on Big Data (2021, to appear)

    Google Scholar 

  11. Atzori, M., et al.: Electromyography data for non-invasive naturally-controlled robotic hand prostheses. Sci. Data 1(1), 1–13 (2014)

    Article  Google Scholar 

  12. Blaisdell, B.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. 83, 5155–5159 (1986)

    Article  CAS  Google Scholar 

  13. Dhar, S., et al.: TNet: phylogeny-based inference of disease transmission networks using within-host strain diversity. In: International Symposium on Bioinformatics Research and Applications (ISBRA), pp. 203–216 (2020)

    Google Scholar 

  14. Ewen, N., Khan, N.: Targeted self supervision for classification on a small Covid-19 CT scan dataset. In: International Symposium on Biomedical Imaging (ISBI), pp. 1481–1485 (2021)

    Google Scholar 

  15. Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I.: Efficient approximation algorithms for strings kernel based sequence classification. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 6935–6945 (2017)

    Google Scholar 

  16. Galloway, S., et al.: Emergence of SARS-CoV-2 B.1.1.7 lineage. Morb. Mortal. Wkly. Rep. 70(3), 95 (2021)

    Google Scholar 

  17. Hassan, Z., Khan, I., Shabbir, M., Abbas, W.: Computing graph descriptors on edge streams (2021). https://www.researchgate.net/publication/353671195_Computing_Graph_Descriptors_on_Edge_Streams

  18. Hassan, Z., Shabbir, M., Khan, I., Abbas, W.: Estimating descriptors for large graphs. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 779–791 (2020)

    Google Scholar 

  19. Hoffmann, H.: Kernel PCA for novelty detection. Pattern Recogn. 40(3), 863–874 (2007)

    Article  Google Scholar 

  20. Krishnan, G., Kamath, S., Sugumaran, V.: Predicting vaccine hesitancy and vaccine sentiment using topic modeling and evolutionary optimization. In: International Conference on Applications of Natural Language to Information Systems (NLDB), pp. 255–263 (2021)

    Google Scholar 

  21. Kuksa, P., Khan, I., Pavlovic, V.: Generalized similarity kernels for efficient sequence classification. In: SIAM International Conference on Data Mining (SDM), pp. 873–882 (2012)

    Google Scholar 

  22. Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 553(3), 553–558 (2020)

    Article  Google Scholar 

  23. Laporte, M., et al.: The SARS-CoV-2 and other human coronavirus spike proteins are fine-tuned towards temperature and proteases of the human airways. bioRxiv (2020)

    Google Scholar 

  24. Leslie, C., Eskin, E., Weston, J., Noble, W.: Mismatch string kernels for SVM protein classification. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 1441–1448 (2003)

    Google Scholar 

  25. Lokman, S., et al.: Exploring the genomic and proteomic variations of SARS-CoV-2 spike glycoprotein: a computational biology approach. Infect. Genet. Evol. 84, 104389–104389 (2020)

    Article  CAS  Google Scholar 

  26. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. (JMLR) 9(11), 1–27 (2008)

    Google Scholar 

  27. Melnyk, A., et al.: Clustering based identification of SARS-CoV-2 subtypes. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds.) ICCABS 2020. LNCS, vol. 12686, pp. 127–141. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79290-9_11

    Chapter  Google Scholar 

  28. Mousavizadeh, L., Ghasemi, S.: Genotype and phenotype of COVID-19: their roles in pathogenesis. J. Microbiol. Immunol. Infect. 54, 159–163 (2021)

    Article  CAS  Google Scholar 

  29. Naveca, F., et al.: Phylogenetic relationship of SARS-CoV-2 sequences from Amazonas with emerging Brazilian variants harboring mutations e484k and n501y in the Spike protein. Virological. org 1, 1–8 (2021)

    Google Scholar 

  30. Shakeel., M., Karim, A., Khan, I.: A multi-cascaded deep model for bilingual SMS classification. In: International Conference on Neural Information Processing (ICONIP), pp. 287–298 (2019)

    Google Scholar 

  31. Shakeel, M., Karim, A., Khan, I.: A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts. Inf. Process. Manag. 57, 1–19 (2020)

    Article  Google Scholar 

  32. Shakeel, M.H., Faizullah, S., Alghamidi, T., Khan, I.: Language independent sentiment analysis. In: International Conference on Advances in the Emerging Computing Technologies (AECT), pp. 1–5 (2020)

    Google Scholar 

  33. Su, S., Du, L., Jiang, S.: Learning from the past: development of safe and effective COVID-19 vaccines. Nat. Rev. Microbiol. 19(3), 211–219 (2021)

    Article  CAS  Google Scholar 

  34. Tankisi, H., et al.: Critical illness myopathy as a consequence of COVID-19 infection. Clin. Neurophysiol. 131(8), 1931 (2020)

    Article  CAS  Google Scholar 

  35. Tariq, J., Ahmad, M., Khan, I., Shabbir, M.: Scalable approximation algorithm for network immunization. In: Pacific Asia Conference on Information Systems (PACIS), p. 200 (2017)

    Google Scholar 

  36. Ullah, A., Ali, S., Khan, I., Khan, M., Faizullah, S.: Effect of analysis window and feature selection on classification of hand movements using EMG signal. In: SAI Intelligent Systems Conference (IntelliSys), pp. 400–415 (2020)

    Google Scholar 

  37. Yadav, P., et al.: Neutralization potential of covishield vaccinated individuals sera against B.1.617. 1. bioRxiv 1 (2021)

    Google Scholar 

  38. Zhang, W., et al.: Emergence of a novel SARS-CoV-2 variant in Southern California. JAMA 325(13), 1324–1326 (2021)

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Murray Patterson or Imdadullah Khan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M., Khan, I. (2021). A k-mer Based Approach for SARS-CoV-2 Variant Identification. In: Wei, Y., Li, M., Skums, P., Cai, Z. (eds) Bioinformatics Research and Applications. ISBRA 2021. Lecture Notes in Computer Science(), vol 13064. Springer, Cham. https://doi.org/10.1007/978-3-030-91415-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91415-8_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91414-1

  • Online ISBN: 978-3-030-91415-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics