Unveiling the Robustness of Machine Learning Models in Classifying COVID-19 Spike Sequences

Ali, Sarwan; Chen, Pin-Yu; Patterson, Murray

doi:10.1007/978-981-99-7074-2_1

Sarwan Ali¹¹,
Pin-Yu Chen¹² &
Murray Patterson¹¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 14248))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

706 Accesses

Abstract

In the midst of the global COVID-19 pandemic, a wealth of data has become available to researchers, presenting a unique opportunity to investigate the behavior of the virus. This research aims to facilitate the design of efficient vaccinations and proactive measures to prevent future pandemics through the utilization of machine learning (ML) models for decision-making processes. Consequently, ensuring the reliability of ML predictions in these critical and rapidly evolving scenarios is of utmost importance. Notably, studies focusing on the genomic sequences of individuals infected with the coronavirus have revealed that the majority of variations occur within a specific region known as the spike (or S) protein. Previous research has explored the analysis of spike proteins using various ML techniques, including classification and clustering of variants. However, it is imperative to acknowledge the possibility of errors in spike proteins, which could lead to misleading outcomes and misguide decision-making authorities. Hence, a comprehensive examination of the robustness of ML and deep learning models in classifying spike sequences is essential. In this paper, we propose a framework for evaluating and benchmarking the robustness of diverse ML methods in spike sequence classification. Through extensive evaluation of a wide range of ML algorithms, ranging from classical methods like naive Bayes and logistic regression to advanced approaches such as deep neural networks, our research demonstrates that utilizing k-mers for creating the feature vector representation of spike proteins is more effective than traditional one-hot encoding-based embedding methods. Additionally, our findings indicate that deep neural networks exhibit superior accuracy and robustness compared to non-deep-learning baselines. To the best of our knowledge, this study is the first to benchmark the accuracy and robustness of machine-learning classification models against various types of random corruptions in COVID-19 spike protein sequences. The benchmarking framework established in this research holds the potential to assist future researchers in gaining a deeper understanding of the behavior of the coronavirus, enabling the implementation of proactive measures and the prevention of similar pandemics in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdisc. Rev. Comput. Stat. 2(4), 433–459 (2010)
Article Google Scholar
Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M.D., Khan, I.: A k-mer based approach for SARS-CoV-2 variant identification. In: International Symposium on Bioinformatics Research and Applications (ISBRA) (2021, accepted)
Google Scholar
Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. CoRR arXiv:2109.05019 (2021)
Ali, S., Sahoo, B., Zelikovsky, A., Chen, P.Y., Patterson, M.: Benchmarking machine learning robustness in COVID-19 genome sequence classification. Sci. Rep. 13(1), 4154 (2023)
Article CAS PubMed PubMed Central Google Scholar
Ali, S., Tamkanat-E-Ali, Khan, M.A., Khan, I., Patterson, M., et al.: Effective and scalable clustering of SARS-CoV-2 sequences. In: International Conference on Big Data Research (ICBDR) (2021, accepted)
Google Scholar
Arons, M.M., et al.: Presymptomatic SARS-CoV-2 infections and transmission in a skilled nursing facility. N. Engl. J. Med. 382(22), 2081–2090 (2020)
Article CAS PubMed Google Scholar
Baek, M., et al.: Accurate prediction of protein structures and interactions using a 3-track network. bioRxiv (2021)
Google Scholar
Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)
Article CAS PubMed Google Scholar
Denti, L., et al.: Shark: fishing relevant reads in an RNA-Seq sample. Bioinformatics 37(4), 464–472 (2021)
Article CAS PubMed Google Scholar
Dohm, J.C., Peters, P., Stralis-Pavese, N., Himmelbauer, H.: Benchmarking of long-read correction methods. NAR Genom. Bioinform. 2(2) (2020). https://doi.org/10.1093/nargab/lqaa037
Du, N., Shang, J., Sun, Y.: Improving protein domain classification for third-generation sequencing reads using deep learning. BMC Genom. 22(251) (2021)
Google Scholar
Frampton, D., et al.: Genomic characteristics and clinical effect of the emergent SARS-CoV-2 b.1.1.7 lineage in London, UK: a whole-genome sequencing and hospital-based cohort study. Lancet Infect. Diseases 21, 1246–1256 (2021). https://doi.org/10.1016/S1473-3099(21)00170-5
GISAID History (2021). https://www.gisaid.org/about-us/history/. Accessed 4 Oct 2021
GISAID Website (2021): https://www.gisaid.org/. Accessed 4 Sept 2021
Golubchik, T., Wise, M.J., Easteal, S., Jermiin, L.S.: Mind the gaps: evidence of bias in estimates of multiple sequence alignments. Mol. Biol. Evol. 24(11), 2433–2442 (2007). https://doi.org/10.1093/molbev/msm176
Hadfield, J., et al.: NextStrain: real-time tracking of pathogen evolution. Bioinformatics 34, 4121–4123 (2018)
Article CAS PubMed PubMed Central Google Scholar
Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv (2019)
Google Scholar
Jha, S.K., Ramanathan, A., Ewetz, R., Velasquez, A., Jha, S.: Protein folding neural networks are not robust. arXiv (2021)
Google Scholar
Jumper, J., et al.: Highly accurate protein structure prediction with AlphaFold. Nature (2021)
Google Scholar
Kuksa, P., Khan, I., Pavlovic, V.: Generalized similarity kernels for efficient sequence classification. In: SIAM International Conference on Data Mining (SDM), pp. 873–882 (2012)
Google Scholar
Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533, 553–558 (2020)
Article CAS PubMed PubMed Central Google Scholar
Leslie, C., Eskin, E., Weston, J., Noble, W.: Mismatch string kernels for SVM protein classification. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 1441–1448 (2003)
Google Scholar
Melnyk, A., et al.: Clustering based identification of SARS-CoV-2 subtypes. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds.) ICCABS 2020. LNCS, vol. 12686, pp. 127–141. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79290-9_11
Chapter Google Scholar
Minh, B.Q., et al.: IQ-tree 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37(5), 1530–1534 (2020)
Article CAS PubMed PubMed Central Google Scholar
Nelson, M.I.: Tracking the UK SARS-CoV-2 outbreak. Science 371(6530), 680–681 (2021)
Article CAS PubMed Google Scholar
Park, S.E.: Epidemiology, virology, and clinical features of severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2; coronavirus disease-19). Clin. Exp. Pediatr. 63(4), 119 (2020)
Article CAS PubMed PubMed Central Google Scholar
Rahimi, A., Recht, B., et al.: Random features for large-scale kernel machines. In: NIPS, vol. 3, p. 5 (2007)
Google Scholar
SARS-CoV-2 Variant Classifications and Definitions (2021). https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-info.html. Accessed 1 Sept 2021
Schwalbe-Koda, D., Tan, A., Gómez-Bombarelli, R.: Differentiable sampling of molecular geometries with uncertainty-based adversarial attacks. Nat. Commun. 12(5104) (2021)
Google Scholar
Stoler, N., Nekrutenko, A.: Sequencing error profiles of Illumina sequencing instruments. NAR Genom. Bioinform. 3(1) (2021)
Google Scholar
Wu, F., et al.: A new coronavirus associated with human respiratory disease in china. Nature 579(7798), 265–269 (2020)
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y.Z., Holmes, E.C.: A genomic perspective on the origin and emergence of SARS-CoV-2. Cell 181(2), 223–227 (2020)
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Georgia State University, Atlanta, GA, USA
Sarwan Ali & Murray Patterson
IBM Research, IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Pin-Yu Chen

Authors

Sarwan Ali
View author publications
You can also search for this author in PubMed Google Scholar
Pin-Yu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Murray Patterson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Murray Patterson .

Editor information

Editors and Affiliations

University of North Texas, Denton, TX, USA
Xuan Guo
University of Southern California, Los Angeles, CA, USA
Serghei Mangul
Georgia State University, Atlanta, GA, USA
Murray Patterson
Georgia State University, Atlanta, GA, USA
Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ali, S., Chen, PY., Patterson, M. (2023). Unveiling the Robustness of Machine Learning Models in Classifying COVID-19 Spike Sequences. In: Guo, X., Mangul, S., Patterson, M., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2023. Lecture Notes in Computer Science(), vol 14248. Springer, Singapore. https://doi.org/10.1007/978-981-99-7074-2_1

Download citation

DOI: https://doi.org/10.1007/978-981-99-7074-2_1
Published: 08 October 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7073-5
Online ISBN: 978-981-99-7074-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unveiling the Robustness of Machine Learning Models in Classifying COVID-19 Spike Sequences