Abstract
The coronavirus SARS-CoV-2 is the cause of the COVID-19 disease in humans. Like many coronaviruses, it can adapt to various hosts and evolve into different variants. It is well-known that the major SARS-CoV-2 variants are characterized by mutations that happen predominantly in the spike protein. Understanding the spike protein structure and determining its perturbations are vital for predicting coronavirus host specificity and determining if a variant is of concern. These are crucial to identifying and controlling current outbreaks and preventing future pandemics. Machine learning (ML) methods are a viable solution to this effort, given the volume of available sequencing data, much of which is unaligned or even unassembled. However, such ML methods require fixed-length numerical feature vectors in Euclidean space to be applicable. For this purpose, we design two feature embedding techniques to convert spike sequences into a compact representation, which serves as input to various ML classifiers. Such embeddings are alignment-free, unlike some previous approaches, avoiding computationally expensive alignment and assembly pipelines. Our proposed embeddings, PSSM2Vec and PSSMFreq2Vec, combine the power of the position weight matrix for compactness, and k-mers to be alignment-free. Experiments on both SARS-CoV-2 and more general coronavirus sequence data show that the proposed embeddings yield better predictive performance, in most cases, than the baseline and state-of-the-art methods, and are also scalable to millions of spike sequences. We also show that in terms of runtime, PSSM2Vec is extremely efficient, which makes it applicable on datasets composed of millions of sequences. Using statistical analysis, we also show the compactness of the proposed feature embeddings compared to the existing methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Majumder, J., Minko, T.: Recent developments on therapeutic and diagnostic approaches for covid-19. AAPS J. 23(1), 1–22 (2021)
Zhou, P., et al.: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270–273 (2020)
Haider, N., et al.: Covid-19-zoonosis or emerging infectious disease? Front. Public Health 8, 763 (2020)
Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3), 418 (2022)
Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M., Khan, I.: A k-mer based approach for sars-cov-2 variant identification. In: International Symposium on Bioinformatics Research and Applications, pp. 153–164 (2021)
Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)
Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for covid-19 spike sequences. In: IEEE International Conference on Big Data (Big Data), pp. 1533–1540 (2021)
Tayebi, Z., Ali, S., Patterson, M.: Robust representation and efficient feature selection allows for effective clustering of sars-cov-2 variants. Algorithms 14(12), 348 (2021)
Ali, S., Ali, T.E., Khan, M.A., Khan, I., Patterson, M.: Effective and scalable clustering of sars-cov-2 sequences. In: International Conference on Big Data Research (ICBDR), pp. 42–49 (2021)
Ali, S., Sahoo, B., Zelikovsky, A., Chen, P.Y., Patterson, M.: Benchmarking machine learning robustness in Covid-19 genome sequence classification. Sci. Rep. 13(1), 4154 (2023)
Ali, S., Alvi, M.K., Faizullah, S., Khan, M.A., Alshanqiti, A., Khan, I.: Detecting ddos attack on sdn due to vulnerabilities in openflow. In: 2019 International Conference on Advances in the Emerging Computing Technologies (AECT), pp. 1–6 (2020)
Ali, S.: Cache replacement algorithm. arXiv preprint arXiv:2107.14646 (2021)
King, A.M., Adams, M. J., Carstens, E. B., Lefkowitz, E.J. (eds.): Order - nidovirales. Virus Taxonomy, pp. 784–794 (2012)
Stormo, G.D., Schneider, T.D., Gold, L., Ehrenfeucht, A.: Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 10(9), 2997–3011 (1982)
Ullah, A., Ali, S., Khan, I., Khan, M.A., Faizullah, S.: Effect of analysis window and feature selection on classification of hand movements using emg signal. In: SAI Intelligent Systems Conference (IntelliSys), pp. 400–415 (2020)
Ali, S., Shakeel, M.H., Khan, I., Faizullah, S., Khan, M.A.: Predicting attributes of nodes using network structure. ACM Trans. Intell. Syst. Technol. (TIST) 12(2), 1–23 (2021)
Ali, S., Mansoor, H., Khan, I., Arshad, N., Khan, M.A., Faizullah, S.: Short-term load forecasting using ami data. arXiv preprint arXiv:1912.12479 (2019)
Ali, S., Mansoor, H., Arshad, N., Khan, I.: Short term load forecasting using smart meter data. In: International Conference on Future Energy Systems, pp. 419–421 (2019)
Ali, S., Zhou, Y., Patterson, M.: Efficient analysis of covid-19 clinical data using machine learning models. Med. Biol. Eng. Comput., 1–16 (2022)
Ali, S., Bello, B., Patterson, M.: Classifying covid-19 spike sequences from geographic location using deep learning. arXiv preprint arXiv:2110.00809 (2021)
Ali, S.: Information we can extract about a user from’ one minute mobile application usage. arXiv preprint arXiv:2207.13222 (2022)
Ali, S., Ciccolella, S., Lucarella, L., Vedova, G.D., Patterson, M.: Simpler and faster development of tumor phylogeny pipelines. J. Comput. Biol. 28(11), 1142–1155 (2021)
Ali, S., Sahoo, B., Khan, M.A., Zelikovsky, A., Khan, I.U., Patterson, M.: Efficient approximate kernel based spike sequence classification. IEEE/ACM Trans. Comput. Biol. Bioinf. (2022)
Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I.: Efficient approximation algorithms for strings kernel based sequence classification. In: Advances in neural information processing systems (NeurIPS), pp. 6935–6945 (2017)
Nishida, K., Frith, M., Nakai, K.: Pseudocounts for transcription factor binding sites. Nucleic Acids Res. 37(3), 939–944 (2009)
Pickett, B., et al.: Vipr: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res. 40(D1), D593–D598 (2012)
Hoffmann, H.: Kernel pca for novelty detection. Pattern Recogn. 40(3), 863–874 (2007)
Van der, M.L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. (JMLR) 9(11) (2008)
Zhu, Y., Ting, K.M.: Improving the effectiveness and efficiency of stochastic neighbour embedding with isolation kernel. J. Artif. Intell. Res. 71, 667–695 (2021)
Acknowledgements
The authors would like to acknowledge funding from an MBD fellowship to Sarwan Ali and a Georgia State University Computer Science Startup Grant to Murray Patterson.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ali, S., Murad, T., Patterson, M. (2023). PSSM2Vec: A Compact Alignment-Free Embedding Approach for Coronavirus Spike Sequence Classification. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1794. Springer, Singapore. https://doi.org/10.1007/978-981-99-1648-1_35
Download citation
DOI: https://doi.org/10.1007/978-981-99-1648-1_35
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1647-4
Online ISBN: 978-981-99-1648-1
eBook Packages: Computer ScienceComputer Science (R0)