Skip to main content

PSSM2Vec: A Compact Alignment-Free Embedding Approach for Coronavirus Spike Sequence Classification

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2022)

Abstract

The coronavirus SARS-CoV-2 is the cause of the COVID-19 disease in humans. Like many coronaviruses, it can adapt to various hosts and evolve into different variants. It is well-known that the major SARS-CoV-2 variants are characterized by mutations that happen predominantly in the spike protein. Understanding the spike protein structure and determining its perturbations are vital for predicting coronavirus host specificity and determining if a variant is of concern. These are crucial to identifying and controlling current outbreaks and preventing future pandemics. Machine learning (ML) methods are a viable solution to this effort, given the volume of available sequencing data, much of which is unaligned or even unassembled. However, such ML methods require fixed-length numerical feature vectors in Euclidean space to be applicable. For this purpose, we design two feature embedding techniques to convert spike sequences into a compact representation, which serves as input to various ML classifiers. Such embeddings are alignment-free, unlike some previous approaches, avoiding computationally expensive alignment and assembly pipelines. Our proposed embeddings, PSSM2Vec and PSSMFreq2Vec, combine the power of the position weight matrix for compactness, and k-mers to be alignment-free. Experiments on both SARS-CoV-2 and more general coronavirus sequence data show that the proposed embeddings yield better predictive performance, in most cases, than the baseline and state-of-the-art methods, and are also scalable to millions of spike sequences. We also show that in terms of runtime, PSSM2Vec is extremely efficient, which makes it applicable on datasets composed of millions of sequences. Using statistical analysis, we also show the compactness of the proposed feature embeddings compared to the existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.gisaid.org/.

  2. 2.

    https://github.com/sarwanpasha/PSSM2Vec.

  3. 3.

    https://www.gisaid.org/.

References

  1. Majumder, J., Minko, T.: Recent developments on therapeutic and diagnostic approaches for covid-19. AAPS J. 23(1), 1–22 (2021)

    Article  Google Scholar 

  2. Zhou, P., et al.: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270–273 (2020)

    Article  Google Scholar 

  3. Haider, N., et al.: Covid-19-zoonosis or emerging infectious disease? Front. Public Health 8, 763 (2020)

    Article  Google Scholar 

  4. Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3), 418 (2022)

    Article  Google Scholar 

  5. Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M., Khan, I.: A k-mer based approach for sars-cov-2 variant identification. In: International Symposium on Bioinformatics Research and Applications, pp. 153–164 (2021)

    Google Scholar 

  6. Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)

    Article  Google Scholar 

  7. Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for covid-19 spike sequences. In: IEEE International Conference on Big Data (Big Data), pp. 1533–1540 (2021)

    Google Scholar 

  8. Tayebi, Z., Ali, S., Patterson, M.: Robust representation and efficient feature selection allows for effective clustering of sars-cov-2 variants. Algorithms 14(12), 348 (2021)

    Article  Google Scholar 

  9. Ali, S., Ali, T.E., Khan, M.A., Khan, I., Patterson, M.: Effective and scalable clustering of sars-cov-2 sequences. In: International Conference on Big Data Research (ICBDR), pp. 42–49 (2021)

    Google Scholar 

  10. Ali, S., Sahoo, B., Zelikovsky, A., Chen, P.Y., Patterson, M.: Benchmarking machine learning robustness in Covid-19 genome sequence classification. Sci. Rep. 13(1), 4154 (2023)

    Article  Google Scholar 

  11. Ali, S., Alvi, M.K., Faizullah, S., Khan, M.A., Alshanqiti, A., Khan, I.: Detecting ddos attack on sdn due to vulnerabilities in openflow. In: 2019 International Conference on Advances in the Emerging Computing Technologies (AECT), pp. 1–6 (2020)

    Google Scholar 

  12. Ali, S.: Cache replacement algorithm. arXiv preprint arXiv:2107.14646 (2021)

  13. King, A.M., Adams, M. J., Carstens, E. B., Lefkowitz, E.J. (eds.): Order - nidovirales. Virus Taxonomy, pp. 784–794 (2012)

    Google Scholar 

  14. Stormo, G.D., Schneider, T.D., Gold, L., Ehrenfeucht, A.: Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 10(9), 2997–3011 (1982)

    Article  Google Scholar 

  15. Ullah, A., Ali, S., Khan, I., Khan, M.A., Faizullah, S.: Effect of analysis window and feature selection on classification of hand movements using emg signal. In: SAI Intelligent Systems Conference (IntelliSys), pp. 400–415 (2020)

    Google Scholar 

  16. Ali, S., Shakeel, M.H., Khan, I., Faizullah, S., Khan, M.A.: Predicting attributes of nodes using network structure. ACM Trans. Intell. Syst. Technol. (TIST) 12(2), 1–23 (2021)

    Article  Google Scholar 

  17. Ali, S., Mansoor, H., Khan, I., Arshad, N., Khan, M.A., Faizullah, S.: Short-term load forecasting using ami data. arXiv preprint arXiv:1912.12479 (2019)

  18. Ali, S., Mansoor, H., Arshad, N., Khan, I.: Short term load forecasting using smart meter data. In: International Conference on Future Energy Systems, pp. 419–421 (2019)

    Google Scholar 

  19. Ali, S., Zhou, Y., Patterson, M.: Efficient analysis of covid-19 clinical data using machine learning models. Med. Biol. Eng. Comput., 1–16 (2022)

    Google Scholar 

  20. Ali, S., Bello, B., Patterson, M.: Classifying covid-19 spike sequences from geographic location using deep learning. arXiv preprint arXiv:2110.00809 (2021)

  21. Ali, S.: Information we can extract about a user from’ one minute mobile application usage. arXiv preprint arXiv:2207.13222 (2022)

  22. Ali, S., Ciccolella, S., Lucarella, L., Vedova, G.D., Patterson, M.: Simpler and faster development of tumor phylogeny pipelines. J. Comput. Biol. 28(11), 1142–1155 (2021)

    Article  Google Scholar 

  23. Ali, S., Sahoo, B., Khan, M.A., Zelikovsky, A., Khan, I.U., Patterson, M.: Efficient approximate kernel based spike sequence classification. IEEE/ACM Trans. Comput. Biol. Bioinf. (2022)

    Google Scholar 

  24. Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I.: Efficient approximation algorithms for strings kernel based sequence classification. In: Advances in neural information processing systems (NeurIPS), pp. 6935–6945 (2017)

    Google Scholar 

  25. Nishida, K., Frith, M., Nakai, K.: Pseudocounts for transcription factor binding sites. Nucleic Acids Res. 37(3), 939–944 (2009)

    Article  Google Scholar 

  26. Pickett, B., et al.: Vipr: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res. 40(D1), D593–D598 (2012)

    Article  Google Scholar 

  27. Hoffmann, H.: Kernel pca for novelty detection. Pattern Recogn. 40(3), 863–874 (2007)

    Article  MATH  Google Scholar 

  28. Van der, M.L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. (JMLR) 9(11) (2008)

    Google Scholar 

  29. Zhu, Y., Ting, K.M.: Improving the effectiveness and efficiency of stochastic neighbour embedding with isolation kernel. J. Artif. Intell. Res. 71, 667–695 (2021)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to acknowledge funding from an MBD fellowship to Sarwan Ali and a Georgia State University Computer Science Startup Grant to Murray Patterson.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarwan Ali .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ali, S., Murad, T., Patterson, M. (2023). PSSM2Vec: A Compact Alignment-Free Embedding Approach for Coronavirus Spike Sequence Classification. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1794. Springer, Singapore. https://doi.org/10.1007/978-981-99-1648-1_35

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-1648-1_35

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-1647-4

  • Online ISBN: 978-981-99-1648-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics