Coronavirus Disease - COVID-19 pp 839-857 | Cite as
Retrieval and Investigation of Data on SARS-CoV-2 and COVID-19 Using Bioinformatics Approach
- 1 Citations
- 2.1k Downloads
Abstract
Sudden emergence and a rapid outbreak of SARS-CoV-2 accompanied by a devastating impact on the economy and public health has driven extensive scientific mobilization to study and elucidate the various associated concerns about SARS-CoV-2. Bioinformatics plays a crucial role in addressing and providing solutions to questions about SARS-CoV-2. It helps shorten the duration for the vaccine development process and the discovery of potential clinical interventions through the simulation and information retrieval, and the development of well-ordered information hubs and resources, which are essential to derive data and meaningful findings from the current massive information about SARS-CoV-2. Advanced algorithms in this field also provide approaches that are essential to elucidate the relationship, origin, and evolutionary process of SARS-CoV-2. Here, we report essential bioinformatics entities, such as database and platform development, molecular evolution and phylogenetic analyses, and vaccine designs, that are useful to solve the SARS-CoV-2 conundrum.
Keywords
COVID-19 Databases Information hubs Molecular evolution Phylogenetic inference SARS-CoV-2 Vaccine design47.1 Introduction
Viruses have been characterized as distinct biological entities for more than 100 years. They were initially discovered in 1892 when Dmitri Ivanovsky observed that the tobacco mosaic disease was caused by tiny rod-shaped infectious particles smaller than any known bacteria; later, these particles were named as “virus” by Martinus Beijerinck in 1898. Since then, they have been isolated from multiple organisms and even the virus itself. In this manner, viruses have been assigned to a unique taxonomic position. The abundance of viruses in the biosphere is estimated to be markedly high and approximately ten times greater than that of bacteria (Breitbart and Rohwer 2005; Suttle 2005). They have been studied extensively owing to their infectivity in humans and other organisms, which causes several diseases, as well as their ability to maintain ecosystem homeostasis and exhibit pathogen control (Flint et al. 2015). Several attributes are considered for the characterization and identification of viruses, including the clinical attributes, pathogenic properties, measurement of the physical structures, and the comparison of genetic material (Flint et al. 2015). Viral genomic sequences were some of the first available genomic information. The development of a more efficient method for virus attribute retrieval requires sophisticated computational tools and techniques for analyzing the extensive data available. Bioinformatics plays a crucial role in contending this challenge, thereby providing substantial support in addressing several common questions pertinent in virology (Chang 2015; Marz et al. 2014).
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the seventh coronavirus that has crossed the species barrier to infect the human population. It is a rapidly spreading virus that has posed a significant public threat and proved to have a considerable burden on the global economy and human health. A form of severe respiratory disease caused by this agent, known as the coronavirus disease (COVID-19), was first detected in late December 2019 (Zhu et al. 2020). Later on, it attained a pandemic status, as declared by the World Health Organization (WHO) in early March 2020, and the virus has spread to more than 150 countries and infected over 5.8 million people by the latter part of May 2020 (World Health Organization 2020a, b). The devastating effects exerted by this virus have motivated multiple researchers to study and elucidate the various associated concerns that may help strategize the process of obtaining a solution. Here, we discuss several state-of-the-art computational approaches that have been used to address questions on SARS-CoV-2, including the development of databases and platforms, studies on structural and evolutionary relationships, and designing potential interventions through virtual screening.
47.2 SARS-CoV-2-Related Databases and Platforms
Essential databases and platforms for SARS-CoV-2 literature
| Platform and tool | Description | Web site | References |
|---|---|---|---|
| LitCovid | A curated literature hub for tracking recent scientific information about SARS-CoV-2 | Chen et al. (2020) | |
| COVID-19: living map of the evidence | Map of the current evidence on COVID-19 by categorizing publications based on the study type | Lorenc T et al. (2020) | |
| Novel Coronavirus Research Compendium (NCRC) | A centralized, publicly available resource that rapidly curates and reviews the emerging scientific evidence about SARS-CoV-2 and COVID-19 | Johns Hopkins Bloomberg School of Public Health (2020) | |
| COVID-evidence | Continuously updated database of the available worldwide evidence on interventions for COVID-19 | COVID-evidence (2020) | |
| Global Coronavirus COVID-19 Clinical Trial Tracker | Map of COVID-19 trials according to geographical, trial, patient, and intervention characteristics | Thorlund et al. (2020) | |
| Database of privately and publicly funded clinical studies conducted around the world | ClinicalTrials.gov (2000) | ||
| International Clinical Trials Registry Platform | Clinical trials registry database by WHO | World Health Organization (2005) | |
| COVID-19 Open Research Dataset (CORD-19) | Resource of more than 130,000 scholarly articles about the novel coronavirus for use by the global research community | Wang et al. (2020) | |
| SciSight | A tool for exploring the evolving network of science in the CORD-19 | SciSight (2020) | |
| SPIKE-CORD | A powerful set of tools for effectively searching and interacting with the CORD-19 data | SPIKE-COVID (2020) | |
| ViralZone | A knowledge resource to understand virus diversity | Hulo et al. (2011) |
47.2.1 Platforms and Tools for SARS-CoV-2 Literature
Scientists are overburdened with the vast array of papers on COVID-19. In the 5 months since the outbreak, more than 23,000 papers on SARS-CoV-2 and COVID-19 have been published in public domains. This repository is proliferating and has doubled in approximately every 20 days within this period (Brainard 2020). Journals and scholarly publishers have provided outstanding support to such efforts for disseminating scientific contributions rapidly by accelerating their editorial assessment, peer review, and publication processes, which has raised concerns about the quality of the resulting publications (Horbach 2020). As a result, the concept of a curated body of literature is integral in dealing with information overload and deriving meaningful patterns from findings across published papers. Currently, several groups have commenced the utilization of state-of-the-art computational approaches, such as artificial intelligence, for curating hubs for easy and effective access or simply filtering papers with specific focus on high-quality publications (Chen et al. 2020; Lorenc et al. 2020; Johns Hopkins Bloomberg School of Public Health 2020; Balakrishnan 2020).
LitCovid is one such prominent hub of curated literature for tracking recent scientific information on COVID-19 (https://www.ncbi.nlm.nih.gov/research/coronavirus/). This open-resource literature hub was developed by researchers under the aegis of the National Center for Biotechnology Information (NCBI). It assists the retrieval of relevant literature using sophisticated search functions and categorized topics, including an overview, disease mechanism, transmission dynamics, treatment, case reports, and epidemic forecasting (Chen et al. 2020). The articles are also categorized based on geographic location with visualization for better access (Chen et al. 2020). Additionally, a platform developed by the Evidence for Policy and Practice Information and Co-ordinating Centre (EPPI-Centre) at the University College London is an alternative resource for accessing the overview and distribution of articles related to COVID-19 (http://eppi.ioe.ac.uk/COVID19_MAP/covid_map_v11.html) (Lorenc et al. 2020). This platform has been built from the Collaborative Approach to Meta-Analysis and Review of Animal Data from Experimental Studies (CAMARADES) infrastructure that had been developed previously, the infrastructure includes systematic review facility, crowdsourcing for screening, annotation and automation techniques, and a web-based portal. This platform provides an updated map of the current evidence on COVID-19 by categorizing publications based on the study type (Lorenc et al. 2020).
The 2019 Novel Coronavirus Research Compendium (NCRC) is an alternative centralized platform available to the public that rapidly curates and reviews emerging scientific evidence on SARS-CoV-2 and COVID-19, with a specific focus on providing accurate and relevant information from original, high-quality, and trending research (https://ncrc.jhsph.edu/) (Johns Hopkins Bloomberg School of Public Health 2020). The demand for reliable and rapidly curated evidence from the public, programs, policies, and researchers amid the fast growth in the body of literature on COVID-19 has driven numerous researchers from Johns Hopkins Schools of Public Health, Johns Hopkins School of Medicine, and other institutions worldwide to develop this platform (Johns Hopkins Bloomberg School of Public Health 2020). This group rapidly evaluates COVID-19-related publications and strictly selects high-quality articles to be categorized under several topics listed in this platform, including diagnostics, modeling, epidemiology, pharmaceutical interventions, non-pharmaceutical interventions, clinical presentation and prognostic risk factors, vaccines, and ecology and spillover (Johns Hopkins Bloomberg School of Public Health 2020).
The WHO initiated an international clinical trial called SOLIDARITY a week after the declaration of the COVID-19 pandemic, and over 100 countries have joined the SOLIDARITY trial and are working simultaneously to solve the issue at hand (Balakrishnan 2020). A clinical trial is a type of research methodology in which tests and treatments for the disease are evaluated by assessing the safety and efficacy of clinical candidate interventions (Friedman et al. 2010). It relies on a foundation of evidence to improve the quality of health care and help stakeholders control the costs involved through careful comparison with alternative interventions (Friedman et al. 2010). Amid the rapid growth in the number of COVID-19 patients globally and the absence of a confirmed drug or treatment method, the urgency to develop or discover an efficient therapeutic strategy for COVID-19 has triggered a spike in clinical trial research and unprecedented growth in the number of findings from studies on a specific disease within a brief period (Thorlund et al. 2020). Hence, an easily accessible platform that summarizes findings from COVID-19 clinical trials is necessary for convenient tracking of relevant information without including irrelevant data in the findings. Several groups of researchers have taken the initiative to develop media that is focused on the mapping and summarization of clinical trial research instead of providing an overview of all articles related to COVID-19. It includes COVID-evidence (covid-evidence.org) and Global Coronavirus COVID-19 Clinical Trial Tracker (covid-trials.org) (Thorlund et al. 2020; COVID-evidence 2020; Ruano et al. 2020). Both platforms use automatic search and expert manual extraction strategies for retrieving data from several sources, including published articles and the International Clinical Trials Registry Platforms, to minimize duplicated entries (Thorlund et al. 2020; COVID-evidence 2020). Additionally, ClinicalTrials.gov (clinicaltrials.gov) and the International Clinical Trials Registry Platform (ICTRP; www.who.int/ictrp/en/) by WHO are relevant resources for tracking clinical trials, although these platforms provide data without recapitulation to avoid unnecessary duplication (Thorlund et al. 2020; World Health Organization 2005; ClinicalTrials.gov 2000).
Additionally, the development of text mining and information retrieval systems is the key to the dynamic discovery and extraction of relevant, nontrivial information from the massive collection of literature. To accommodate such advancements for use in SARS-CoV-2- and COVID-19-related studies, the Allen Institute for AI (AI2) released a COVID-19 Open Research Dataset (CORD-19; www.semanticscholar.org/cord19) in collaboration with governments’ several leading institutions (Wang et al. 2020). This resource contains more than 128,000 scholarly articles on COVID-19, SARS-CoV2, and related historical research on coronavirus, which are assembled in a machine-readable format to provide an accessible structure and system for the retrieval process using computational methods such as data mining, machine learning, and natural language processing (Wang et al. 2020). The development and availability of this resource are provided to the global community to apply recent advances of AI techniques and to develop robust and user-friendly computational tools that help researchers find answers to their questions from published studies. The CORD-19 itself consists of several built-in tools such as SciSight and SPIKE-CORD for exploring the evolving network of scientific information and for effectively searching data, respectively (scisight.apps.allenai.org; spike.covid-19.apps.allenai.org/search/covid19) (SciSight 2020; SPIKE-COVID 2020). SciSight helps retrieve a comprehensive network of information using an AI-powered visualization framework. This tool has several features for exploring researcher working networks (the network of science), searching key facets (Faceted Search), and exploring the association between proteins, genes, and cells (proteins/genes/cells), as well as between diseases and chemicals (Diseases/Chemicals) (SciSight 2020). Conversely, SPIKE-CORD offers user-friendly text mining options for effective search and interaction with CORD-19 data. That is an advanced tool and comprises three query modes, including Boolean queries, sequential queries, and structured queries (SPIKE-COVID 2020).
ViralZone (www.expasy.org/viralzone/) is another important resource that provides a clear view of the biological processes of the complete identified virosphere (Hulo et al. 2011). This web resource provides fact sheets with information on sequence information, replication cycles, taxonomy, and epidemiology, as well as graphics describing virion organization, genome transcription, and translation strategies for every identified virus families/genera, thereby creating an accurate and concise information platform to improve the understanding of the complications associated with the massive diversity in viruses. The ViralZone platform has a dedicated resource for SARS-CoV-2 with additional information on antiviral drug development and COVID-19 treatment strategies (Hulo et al. 2011).
To deal with the devastating effects of SARS-CoV-2 infection, the global scientific community has made an effort to contribute to the current information on the disease, which has created a massive collection of readily accessible literature. A challenge involving the tracking of novel findings and the construction of a meaningful network from such a massive data repository awaits clinicians, researchers, and policymakers. Fortunately, several research groups are addressing the challenge and have attempted to create sophisticated platforms and tools that focus on providing user-friendly media. Given these efforts, the utilization of such media aids active tracking of important findings on COVID-19.
47.2.2 Sequence and Structure Database of SARS-CoV-2
The ability to sequence DNA at higher throughput and lower costs leads to the rapid increase in the available sequence data. Sequence analysis plays a vital role in viral surveillance, host reservoir identification, and public health policy debates (Brister et al. 2015). The development of a sequence database is essential in current bioinformatics research and applications for storage and retrieval locations of sequence data and annotation. Generally, three primary public nucleotide sequence databases facilitate the availability of DNA sequence data to the public: NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ). Over the years, the repository of sequences deposited in these databases has grown at an increasing rate. Since these databases form the International Nucleotide Sequence Database Collaboration and exchange updates daily, they follow certain common principles for the arrangement of sequence data and mainly provide the same accession number (Koonin and Galperin 2003).
Additionally, scientific journals generally specify that the sequence data should be cited with the accession number of information from these databases in a paper that describes a nucleotide or protein sequence (Brister et al. 2015). Following the growth in the number of viral genome sequences and the challenges of implementing a purely well-annotated representation of viral genome sequences, NCBI has a unique feature known as the NCBI Viral Genome Resource (www.ncbi.nlm.nih.gov/genome/viruses/) to serve better the needs of the community of virologists. This resource has a central browser for viral and viroid genomes, which lists all viral and viroid species indicated by the corresponding reference sequence and includes links to genome neighbor sequences (Brister et al. 2015).
Specific databases for viral sequences have also been established; this is an alternative of the general public nucleotide sequence databases. The development of a specific database for viruses is essential to avoid the challenges arising from the sharing of viral sequence information. For example, the Global Initiative on Sharing All Influenza Data (GISAID) EpiFlu (https://www.gisaid.org/) was initially developed to address problems experienced while sharing virus sequences, such as the potential intervention by government agencies for the international exchange of information and the hesitation of researchers to share data on sequences of lethal viruses (Elbe and Buckland-Merrett 2017). This database provides protection and assurances to data contributors through a unique data access agreement. Users must register themselves with official affiliation to access the data in this database and comply with several agreements, including the acknowledgment of the contributor while using their data in publications and refraining from data sharing with third parties outside the GISAID community (Elbe and Buckland-Merrett 2017). This mechanism allows researchers and the government to share their sequence data promptly and is important because the immediate availability of viral genomic data helps expedite the processes involved in tracking and recognizing emergent epidemics or pandemics (Elbe and Buckland-Merrett 2017). The GISAID EpiFlu system itself contains above 27,000 partial and complete genomic sequences of SARS-CoV-2 that have been contributed by clinicians and researchers worldwide, as of mid-May 2020.
Given the extent of available sequence information, the number of structurally characterized proteins, nucleic acids, and other biological macromolecules is relatively low. Acquiring three-dimensional molecular structure information helps characterize the molecular function and interaction. Additionally, elucidating the viral protein structure and interaction is fundamental to comprehending the virus receptor recognition mechanism, which is associated with its infectivity, pathogenesis, and host range, and is a requisite in the effort to develop structure-based drugs. The Protein Data Bank (PDB) is a prominent global repository of experimentally determined three-dimensional structures (Sussman et al. 1998). It is one of the first open-access digital resources for biological sciences established in 1971. The Worldwide Protein Data Bank currently manages this database (wwPDB; http://www.wwpdb.org/), which includes the RCSB Protein Data Bank (RCSB PDB; https://www.rcsb.org/), the Protein Data Bank Japan (PDBj; https://pdbj.org/), the Protein Data Bank in Europe (PDBe; https://www.ebi.ac.uk/pdbe/), and the Biological Magnetic Resonance Data Bank (BMRB; http://www.bmrb.wisc.edu/) (Burley et al. 2017). This community manages annotated protein structure information and provides convenient access to experimental data to the community of researchers and students of biological sciences. In response to the COVID-19 pandemic, RCSB PDB has developed COVID-19/SARS-CoV-2 Resources, which provide quick access to all PDB structures of SARS-CoV-2.
47.3 Molecular Evolution and Phylogenetic Analysis of SARS-CoV-2
The general workflow of molecular evolution and phylogenetic inferences
47.3.1 Comparative Genomics of SARS-CoV-2
Immediately after the initial report on 27 unusual cases of pneumonia in Wuhan on December 31, 2019, the complete genome sequence of the causative agent, SARS-CoV-2, was made available on January 10, 2020 (Wu et al. 2020; Lu et al. 2020). Measuring the genetic distance is a simple computational approach for understanding the origin of biodiversity. For this purpose, the detection of candidate homologous sequences is a fundamental step, which is achieved by searching sequence similarity in sequence databases or GenBank (Pearson 2013). Several available programs statistically estimate sequence similarity against existing sequences in databases, such as Basic Local Alignment Search Tool (BLAST), SSEARCH, FASTA, and HMMER (Altschul et al. 1997; Smith and Waterman 1981; Pearson and Lipman 1988; Johnson et al. 2010). Sequence candidates with significant similarity can be obtained using these programs, as the sequence data in GenBank is becoming increasingly comprehensive, based on the principle that scientific journals that require the data of sequences mentioned in the article should be registered with this platform (Pearson 2013). Once the homologs are detected, building a more accurate alignment model using multiple sequence alignments forms the basis for the calculation of genetic distance and determining genome organization, as well as developing other evolutionary analyses, such as the construction of phylogenetic trees and recombination detection. As computation has become more accessible recently, utilizing a more rigorous multiple sequence alignment method that employs iterative approaches such as MAFFT and MUSCLE is recommended for acquiring high throughput and high accuracy results (Pearson 2013; Katoh et al. 2002; Edgar 2004). Through these steps, the pairwise sequence identities of SARS-CoV-2 initially isolated from several patients were analyzed and were observed to share a nearly identical similarity of above 99.9%, and this indicates a recent host shift into the human population (Lu et al. 2020). Several groups have confirmed that SARS-CoV-2 belongs to the genus Betacoronavirus, with the bat SARS-related coronavirus (SARSr-CoV RaTG13) as the most closely related member (96.2% similarity). Human-infecting coronaviruses that are closely related to this virus are SARS-CoV (approximately 79% similarity) and MERS-CoV (approximately 50% similarity) (Lu et al. 2020; Fahmi et al. 2020; Zhou et al. 2020; Paraskevis et al. 2020; Chan et al. 2020).
As SARS-CoV-2 was observed to be related to the genus Betacoronavirus and the genome organization of betacoronaviruses has been elucidated previously, the genome organization of SARS-CoV-2 can be determined simply by sequence alignment with the genomes of other members of Betacoronavirus (Kim et al. 2020; Marra et al. 2003; Rota et al. 2003). It was performed by alignment to two representative members of Betacoronavirus, SARS-CoV Tor2 (GenBank accession number AY274119), and bat SL-CoVZC45 (GenBank accession number MG772933), which are associated with humans and bats, respectively (Wu et al. 2020). The SARS-CoV-2 genome contains a positive-sense single-stranded RNA with a 5′ cap structure and a poly-A 3′ tail, and it has ~29,900 nt encoding ~9860 amino acids (Marra et al. 2003; Rota et al. 2003). The 5′ end of the SARS-CoV-2 genome contains a predicted RNA leader sequence of ~70 nt that is fused with two open reading frames (ORF1a and ORF1b) through short motif transcription regulatory sequences (TRSs) (Wu et al. 2020). The TRSs precede each structural or accessory gene, which aids gene expression (Fehr and Perlman 2015). Usually, the ORF1a and ORF1b overlap in betacoronaviruses, and the region occupies two-thirds of the genome. This region encodes the viral replicase and a translational read-through comprising a − 1 ribosomal frameshift, which allows the translation of this overlapping reading frame into a single polyprotein (Fehr and Perlman 2015; Thiel et al. 2003). Subsequently, upon infection of an appropriate host cell and translation, this large polyprotein commonly undergoes proteolytic processing and is cleaved by virus-encoded proteases into several nonstructural proteins (nsps), including viral papain-like protease (PLpro), main protease (3CLpro, also known as 3-chymotrypsin-like protease), RNA-dependent RNA polymerase (RdRp), and helicase (Hel) (Marra et al. 2003). These proteins mediate the replication process of the viral genome and generate nested transcripts, which are essential for the synthesis of viral proteins (Marra et al. 2003). There are structural and accessory genes downstream of the ORF1a and ORF1b sequences that encode the spike glycoprotein (S), ORF3a, ORF3b, envelope (E), membrane (M), ORF6, ORF7a, ORF7b, ORF8a, ORF8b, and nucleocapsid (N), which are interspersed with the TRS motifs (Fehr and Perlman 2015; Thiel et al. 2003). The structural proteins are responsible for packaging the viral particles and the entry process in host cells, while the accessory proteins are likely to play vital roles in viral pathogenesis (Marra et al. 2003).
A notable feature within the specific region of SARS-CoV-2 can be identified by comparative analysis once the genome organization is known. The S protein is one of the most variable regions within the genomic sequence of SARS-CoV-2. It displays the lowest sequence identity of around 75% to bat-SL-CoVZC45 and bat-SL-CoVZXC21 (Lu et al. 2020). The trimeric S proteins of CoVs mediate binding to the cell receptor ACE2 and membrane fusion in the viral entry process; understanding this protein is compulsory in vaccine design as it elicits an antibody response (Li et al. 2005a; Belouzard et al. 2012; Babcock et al. 2004). The core structure and receptor-binding motif of S protein to the claw-like structure of ACE2 receptors occur in the receptor-binding domain (RBD) (Li 2015, 2008). There are six vital amino acids in the RBD for enhancing the viral binding of SARS-CoV to human ACE2 receptors. They are Y442, L472, N479, D480, T487, and Y491 based on S protein sequence (Wan et al. 2020). With coordinates based on SARS-CoV, those amino acids are L455, F486, Q493, S494, N501, and Y505 in SARS-CoV-2, of which is only one amino acid identical to SARS-CoV (Wan et al. 2020; Andersen et al. 2020). Another notable feature in the S protein of SARS-CoV-2 that has been identified is the presence of polybasic cleavage site (RRAR) at the junction of the two subunits of the S protein: S1 and S2 (Walls et al. 2020). This cleavage site is preceded by inserted proline, which is predicted to result in the o-linked glycosylation of S673, T678, and S686 in the S protein sequence (Andersen et al. 2020). Furin proteases cleave this site during biosynthesis; this process is essential to enable exposure of the fusion sequence for cell membranes, which crucially mediates the cell entry process (Walls et al. 2020). The fact that the cleaving process varies from coronavirus to coronavirus implies that the cleavage site is associated with transmissibility and pathogenesis in the host animal.
CoVs have developed several genetic mechanisms to control replication errors as they have an extraordinarily large RNA genome. Recombination is the capacity to create chimeric molecules from two parental genomes of different origins during coinfection and is one such mechanism that contributes to the genetic stability and diversity of CoVs (Simon-Loriere and Holmes 2011; Lai et al. 1994). Deciphering recombination event that contributes to the virus emergence hence is one way to identify SARS-CoV-2 origins. This analysis is also important to identify virus expansion of viral host range and the evolution of resistance to antivirals (Brown 1997; Gibbs and Weiller 1999; Nora et al. 2007). Numerous tools for examining recombination have been developed within the last two decades, such as SIMPLOT, RDP, TOPALi, and 3seq, all of which are optimized to detect recombination in different ways (Lole et al. 1999; Milne et al. 2008; Martin and Rybicki 2000; Lam et al. 2017). The comprehensive list of tools for analyzing recombination events can be found at http://bioinf.man.ac.uk/robertson/recombination/programs.shtml. Even though CoVs have been reported to be a highly recombinogenic group of viruses, detecting recombination event of the recently emerging CoV from the ancestor that likely develops multiple recombination events, however, is not trivial (Zhang and Holmes 2020; Forni et al. 2017; Hon et al. 2008). It has been reported that SARS-CoV-2 is not a recombinant of any viruses detected to date (Tang et al. 2020). This indicates that the sampled diversity of CoVs, especially within the subgenus sarbecovirus, is massively inefficient.
47.3.2 Molecular Phylogenetics
Elucidating the origin, relationships, and transmission routes of emerging infectious agents is a key to conveniently understanding their biological processes and the potential methods of intervention. These functions are related to molecular evolution, which can be elegantly illustrated by molecular phylogenetic analysis (Kühnert et al. 2011; Lemey et al. 2009). The phylogenetic tree has played a fundamental role in the efforts of inferring the evolutionary history of several infectious agents such as HIV, HCV, and SARS-CoV (Kühnert et al. 2011; Gao et al. 1999; Santiago et al. 2002; Pybus et al. 2009; Markov et al. 2009; Li et al. 2005b). The phylogenetic methods are pertinent for detecting orthology and paralogy, estimating divergence times, reconstructing ancient proteins, finding the important residues to natural selection, identifying recombination points, determining the identity on new pathogens, and identifying mutations likely to be associated with disease (Holder and Lewis 2003). Once the predetermined sequences are aligned, a wide range of methods and software packages are available for reconstructing the phylogenetic tree, which can, at times, make it challenging for researchers to choose the well-suited ones. Each method relies on a different algorithm and has its own advantages and limitations. There are two types of reconstruction methods: distance-based methods and character-based methods (Lemey et al. 2009). Distance-based methods reconstruct trees from the calculated distance matrix of every pair of sequences. These techniques can also be classified into two categories: clustering-based (UPGMA and neighbor-joining) and optimality-based (Fitch-Margoliash and minimum evolution) algorithms (Lemey et al. 2009). These methods generally have a rapid calculation time and a large number of applicable substitution models for calculating the genetic distance scores; however, their applicability in cases with divergent sequences and high variance of large distance estimates is poor (Lemey et al. 2009; Yang and Rannala 2012). Conversely, character-based methods are based on the mutational event scoring of given aligned characters (sites in alignment), which prevents information loss in pairwise distance calculation (Yang and Rannala 2012; Xiong 2006). Character-based methods include maximum likelihood (ML), maximum parsimony (MP), and Bayesian inference (BI). MP follows the “Ockham’s razor” principle by selecting the proposed tree with a minimum score of discrete changes to the given alignment data (Lemey et al. 2009). The resulting tree in MP generally has the minimum instances of homoplasy (Lemey et al. 2009). Owing to its simplicity, the computations in MP are less extensive than those in other character-based methods, although these lack explicit assumptions. The ML method estimates trees with the highest likelihood of development using discrete change data of alignment based on multiple proposed trees depicting different evolutionary hypotheses (topologies, branch lengths, and sequence substitution models) of taxa in an alignment. Conversely, BI calculates the posterior probability distribution of the proposed trees based on the model, prior probability distribution, and data. When the data are informative, the majority of the posterior probability is typically concentrated in one tree (or a small subset of trees in an ample tree space) (Lemey et al. 2009; Xiong 2006). At present, ML and BI are considered to represent the best accurate methods since they incorporate the most complex statistical models.
Several authors have constructed the phylogenetic trees of SARS-CoV-2 with that of previous CoVs using the whole genome and specific region sequences to understand its evolutionary history and recombination event (Li et al. 2020b; Wu et al. 2020; Lu et al. 2020; Tang et al. 2020). Based on the currently available genome samples of CoVs, the whole-genome phylogenetic tree indicates that SARS-CoV-2 is closest to SARSr-CoV RaTG13, followed by Pangolin SARSr-CoVs. Two possible scenarios can plausibly explain the origin of SARS-CoV2: natural selection in an animal host prior to zoonotic transfer and natural selection in humans following zoonotic transfer (Andersen et al. 2020). With the first scenario, the genuine progenitor of SARS-CoV2, however, is unrevealed; this uncertainty is attributed to the absence of animal coronaviruses that have the highest similarity on the entire genome location to SARS-CoV-2; this also indicates that subgenus sarbecovirus is likely to have massive hidden diversity (Boni et al. 2020; Andersen et al. 2020). Even though the RaTG13 bat coronavirus has been identified as the virus with the highest similarity to SARS-CoV-2, the RBD in the S protein of SARS-CoV2 is significantly related to that in the Pangolin (Manis javanica) coronavirus, which belongs to the sister lineage of the RaTG13 bat coronavirus (Wan et al. 2020; Andersen et al. 2020; Zhang et al. 2020). RBD is a region critical in SARS-CoV-2 transmission in humans because it has a high affinity for human ACE2 (Wan et al. 2020). This binding is likely a result of natural selection of human or human-like ACE2, as a novel binding pattern distinct from those previously predicted was observed (Wan et al. 2020; Andersen et al. 2020). This complication highlights the uncertainty involved in the identification of the direct ancestor of SARS-CoV-2. Conversely, pangolin SARSr-CoVs are likely to be the progenitor of SARS-CoV-2 if the second scenario is applied, following to their significant similarities of the RBD region (Andersen et al. 2020). Since pangolin SARSr-CoVs have RBD that binds with high affinity to humans ACE2, it is likely that this virus jumped to the human population and acquired genomic features that give rise to SARS-CoV-2 (Andersen et al. 2020).
47.4 Computational Approach for One Step of SARS-CoV-2 Vaccine Design
The spike glycoprotein has an important role in binding with host cell receptors; this protein is found on the surface of virions and allows it to be neutralized by antibodies. Vaccines have been proven effective in handling an infectious disease and aims to reduce the deterioration effect caused by that of the etiological agent. Currently, around 90 vaccines have been developed for the clinical trial of SARS-CoV-2 carried out by universities and companies across the world. A variety of vaccines with a distinct approach is tried against SARS-CoV-2, such as virus vaccines (weakened or inactivated virus), viral-vector vaccines (using adenovirus), nucleic-acid vaccines (DNA or RNA vaccines), and protein-based vaccines (protein subunits or virus-like particles) (Sun and Zhang 2014). The development of an effective vaccine against SARS-CoV-2 infection is urgently needed, and it is unlikely to use classical methods in vaccine design under current conditions (Doytchinova and Flower 2007; Lamiable et al. 2016). Currently, there are no inactivated and live-attenuated vaccines efficient enough for providing extensive protection against SARS-CoV-2 infection (Kharisma and Ansori 2020). During the current critical period, the immunoinformatics-based approach could help shorten the duration of the experiment for discovering potential SARS-CoV-2 vaccine candidates. Also, accumulated releases of SARS-CoV-2 genomes in GenBank, NCBI, and GISAID EpiCoV facilitate the immunoinformatics-based approach for the development of virus-based subunit vaccines. The subunit candidate, such as S1 protein or the RBD element of SARS-CoV-2, is one of the precious targets for vaccine development design (Jespersen et al. 2017).
Immunoinformatics tools of B-cell and T-cell epitope prediction
| Prediction | Tool | URL | References |
|---|---|---|---|
| B-cell epitope | BepiPred | Jespersen et al. (2017) | |
| DiscoTope | Kringelum et al. (2012) | ||
| ElliPro | Ponomarenko et al. (2008) | ||
| ABCpred | Saha and Raghava (2006) | ||
| BCPred | El-Manzalawy et al. (2008) | ||
| COBEpro | Sweredoski and Baldi (2009) | ||
| SVMTriP | Yao et al. (2012) | ||
| LBtope | Singh et al. (2013) | ||
| EpiPred | http://opig.stats.ox.ac.uk/webapps/newsabdab/sabpred/epipred/ | Dunbar et al. (2016) | |
| Pepitope | Mayrose et al. (2007) | ||
| T-cell epitope | IEDB-MHCI | Nielsen and Andreatta (2016) | |
| IEDB-MHCII | Nielsen et al. (2007) | ||
| NetMHC | Lundegaard et al. (2008) | ||
| NetMHCII | Jensen et al. (2018) | ||
| NetMHCIIpan | https://services.healthtech.dtu.dk/service.php?NetMHCIIpan-4.0 | Karosiene et al. (2013) | |
| IL4pred | Dhanda et al. (2013) | ||
| ProPred | Singh and Raghava (2001) | ||
| EpiTOP | Dimitrov et al. (2010) | ||
| MHCPred | Guan et al. (2003) | ||
| PEPVAC | Reche and Reinherz (2005) |
Prediction of B cell epitopes allows us to gain insight into the important position on the surface of antigen that can be recognized by BCR in the adaptive immune response. Immune Epitope Database and Analysis Resource (IEDB) (http://tools.iedb.org/main/) is one such prominent platform for epitope prediction, the approach for epitope prediction can be divided into linear and discontinuous predictions (Sun and Zhang 2014). Linear prediction works by introducing antibodies to the primary structure of the amino acid residue of the antigen. It is distinct to discontinuous prediction, which utilizes the 3D form of an epitope for the introduction. The methods of linear epitope prediction for B cells use the sequence characteristics of antigens through the amino acid scale and hidden Markov models (HMMs) (http://tools.iedb.org/bcell/), which consist of prediction of hydrophilicity, flexibility, accessibility, surface, and antigenic tendencies in polypeptide chains (Sun and Zhang 2014; Doytchinova and Flower 2007). BepiPred is one of methods to predict B cell epitopes in IEDB platform, this method is based on a random forest algorithm trained on epitopes annotated from antibody-antigen protein structures (Sun and Zhang 2014). The other methods include DiscoTope and ElliPro, these methods determine B-cell epitopes in 3D antigen structures based on solvent accessibility and flexibility (Sun and Zhang 2014). Conversely, T-cell epitope prediction aims to identify short peptides in antigens that allow CD4 or CD8 cell stimulation. There are three steps for the prediction of T-cell epitopes, namely, antigen processing, binding of peptides to MHC molecules, and recognition of T-cell receptor.
Schematic representation of peptide epitope based-vaccine candidate (green) from spike glycoprotein recognized by B-cell immune response to produce specific antibody against SARS-CoV-2 (Kharisma and Ansori 2020)
Previously, we characterized the spike glycoprotein of SARS-CoV-2 to obtain epitope-based peptide vaccine against SARS-CoV-2. In that study, SARS-CoV-2 isolates were retrieved from the GISAID EpiCoV and NCBI and then aligned to obtain the conserved region of SARS-CoV-2 spike glycoprotein. We identified Pep_4 ADHQPQTFVNTELH as potential B-cell epitope vaccine candidates to overcome the SARS-CoV-2 outbreak. Pep_4 vaccine candidates are predicted to trigger B-cell immune response through direct binding to the BCR/FAB molecule (Kharisma and Ansori 2020).
47.5 Conclusion
Computational and statistical approaches are essential entities in virology. Here, we reported numerous tools and platforms that have been developed to facilitate global research to study and elucidate the various associated concerns which may help strategize the process of obtaining a solution for SARS-CoV-2 devastating effects. The utilization of such media aids active tracking of important findings and also helps health authorities develop strategies to prevent cross-species transmissions and control outbreaks in the future. Additionally, advanced algorithms and databases facilitate the urgency to find potential interventions such as vaccine design through experimental simulation.
References
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17):3389–3402. https://doi.org/10.1093/nar/25.17.3389CrossRefPubMedPubMedCentralGoogle Scholar
- Andersen KG, Rambaut A, Lipkin WI, Holmes EC, Garry RF (2020) The proximal origin of SARS-CoV-2. Nat. Med. 26(4):450–452. https://doi.org/10.1038/s41591-020-0820-9CrossRefPubMedGoogle Scholar
- Babcock GJ, Esshaki DJ, Thomas WD, Ambrosino DM (2004) Amino acids 270 to 510 of the severe acute respiratory syndrome coronavirus spike protein are required for interaction with receptor. J. Virol. 78(9):4552–4560. https://doi.org/10.1128/jvi.78.9.4552-4560.2004CrossRefPubMedPubMedCentralGoogle Scholar
- Balakrishnan VS (2020) Increasing accessibility in COVID-19 clinical trials. The Lancet Microbe 1(1):e13CrossRefGoogle Scholar
- Belouzard S, Millet JK, Licitra BN, Whittaker GR (2012) Mechanisms of coronavirus cell entry mediated by the viral spike protein. Viruses 4(6):1011–1033. https://doi.org/10.3390/v4061011CrossRefPubMedPubMedCentralGoogle Scholar
- Boni MF, Lemey P, Jiang X, Lam TT-Y, Perry B, Castoe T, Rambaut A, Robertson DL (2020) Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. bioRxiv:2020.2003.2030.015008. https://doi.org/10.1101/2020.03.30.015008
- Brainard J (2020) Scientists are drowning in COVID-19 papers. Can new tools keep them afloat? https://www.sciencemag.org/news/2020/05/scientists-are-drowning-covid-19-papers-can-new-tools-keep-them-afloat. Accessed 15 May 2020
- Breitbart M, Rohwer F (2005) Here a virus, there a virus, everywhere the same virus? Trends Microbiol. 13(6):278–284. https://doi.org/10.1016/j.tim.2005.04.003CrossRefPubMedGoogle Scholar
- Brister JR, Ako-Adjei D, Bao Y, Blinkova O (2015) NCBI viral genomes resource. Nucleic Acids Res. 43(Database issue):D571–D577. https://doi.org/10.1093/nar/gku1207CrossRefPubMedGoogle Scholar
- Brown DWG (1997) Threat to humans from virus infections of non-human primates. Rev. Med. Virol. 7(4):239–246. https://doi.org/10.1002/(sici)1099-1654(199712)7:4<239::Aid-rmv210>3.0.Co;2-qCrossRefPubMedGoogle Scholar
- Burley SK, Berman HM, Kleywegt GJ, Markley JL, Nakamura H, Velankar S (2017) Protein data Bank (PDB): the single global macromolecular structure archive. Methods Mol. Biol. 1607:627–641. https://doi.org/10.1007/978-1-4939-7000-1_26CrossRefPubMedPubMedCentralGoogle Scholar
- Chan JF, Kok KH, Zhu Z, Chu H, To KK, Yuan S, Yuen KY (2020) Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan. Emerg Microbes Infect 9(1):221–236. https://doi.org/10.1080/22221751.2020.1719902CrossRefPubMedPubMedCentralGoogle Scholar
- Chang J (2015) Core services: reward bioinformaticians. Nature Nature 520(7546):151–152CrossRefGoogle Scholar
- Chen Q, Allot A, Lu Z (2020) Keep up with the latest coronavirus research. Nature 579(7798):193. https://doi.org/10.1038/d41586-020-00694-1CrossRefPubMedGoogle Scholar
- ClinicalTrials.gov (2000). ClinicalTrials.gov. https://clinicaltrials.gov/. Accessed 1 April 2020
- COVID-evidence (2020) Find evidence on interventions for COVID-19. https://covid-evidence.org/. Accessed 1 May 2020
- Delwart EL (2007) Viral metagenomics. Rev. Med. Virol. 17(2):115–131. https://doi.org/10.1002/rmv.532CrossRefPubMedPubMedCentralGoogle Scholar
- Dhanda SK, Gupta S, Vir P, Raghava GPS (2013) Prediction of IL4 inducing peptides. Clin Dev Immunol 2013:263952Google Scholar
- Dimitrov I, Garnev P, Flower DR, Doytchinova I (2010) EpiTOP—a proteochemometric tool for MHC class II binding prediction. Bioinformatics 26(16):2066–2068Google Scholar
- Doytchinova IA, Flower DR (2007) VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinformatics 8(1):4. https://doi.org/10.1186/1471-2105-8-4CrossRefPubMedPubMedCentralGoogle Scholar
- Dunbar J, Krawczyk K, Leem J, Marks C, Nowak J, Regep C, Georges G, Kelm S, Popovic B, Deane CM (2016) SAbPred: a structure-based antibody prediction server. Nucleic Acids Res 44(W1):W474–8Google Scholar
- Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792–1797. https://doi.org/10.1093/nar/gkh340CrossRefPubMedPubMedCentralGoogle Scholar
- Elbe S, Buckland-Merrett G (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Chall 1(1):33–46. https://doi.org/10.1002/gch2.1018CrossRefPubMedPubMedCentralGoogle Scholar
- EL-Manzalawy Y, Dobbs D, Honavar V (2008) Predicting linear B-cell epitopes using string kernels. J Mol Recognit 21(4):243–255Google Scholar
- Fahmi M, Kubota Y, Ito M (2020) Nonstructural proteins NS7b and NS8 are likely to be phylogenetically associated with evolution of 2019-nCoV. Infect. Genet. Evol. 81:104272. https://doi.org/10.1016/j.meegid.2020.104272CrossRefPubMedPubMedCentralGoogle Scholar
- Fehr AR, Perlman S (2015) Coronaviruses: an overview of their replication and pathogenesis. Methods Mol. Biol. 1282:1–23. https://doi.org/10.1007/978-1-4939-2438-7_1CrossRefPubMedPubMedCentralGoogle Scholar
- Flint SJ, Racaniello VR, Rall GF, Skalka AM, Enquist LW (2015) Principles of virology, vol 1, 4th edn. ASM Press, Washington, DCGoogle Scholar
- Forni D, Cagliani R, Clerici M, Sironi M (2017) Molecular evolution of human coronavirus genomes. Trends Microbiol. 25(1):35–48. https://doi.org/10.1016/j.tim.2016.09.001CrossRefPubMedGoogle Scholar
- Friedman LM, Furberg C, DeMets DL, Reboussin DM, Granger CB (2010) Fundamentals of clinical trials, vol 4. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-18539-2CrossRefGoogle Scholar
- Gao F, Bailes E, Robertson DL, Chen Y, Rodenburg CM, Michael SF, Cummins LB, Arthur LO, Peeters M, Shaw GM, Sharp PM, Hahn BH (1999) Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature 397(6718):436–441. https://doi.org/10.1038/17130CrossRefPubMedGoogle Scholar
- Gibbs MJ, Weiller GF (1999) Evidence that a plant virus switched hosts to infect a vertebrate and then recombined with a vertebrate-infecting virus. Proc. Natl. Acad. Sci. 96(14):8022–8027. https://doi.org/10.1073/pnas.96.14.8022CrossRefPubMedPubMedCentralGoogle Scholar
- Guan P, Doytchinova IA, Zygouri C, Flower DR (2003) MHCPred: a server for quantitative prediction of peptide–MHC binding. Nucleic Acids Res 31:3621–3624Google Scholar
- Holder M, Lewis PO (2003) Phylogeny estimation: traditional and Bayesian approaches. Nat. Rev. Genet. 4(4):275–284. https://doi.org/10.1038/nrg1044CrossRefPubMedGoogle Scholar
- Hon C-C, Lam T-Y, Shi Z-L, Drummond AJ, Yip C-W, Zeng F, Lam P-Y, Leung FC-C (2008) Evidence of the recombinant origin of a bat severe acute respiratory syndrome (SARS)-like coronavirus and its implications on the direct ancestor of SARS coronavirus. J. Virol. 82(4):1819–1826. https://doi.org/10.1128/jvi.01926-07CrossRefPubMedGoogle Scholar
- Horbach SPJM (2020) Pandemic publishing: medical journals drastically speed up their publication process for Covid-19. bioRxiv:2020.2004.2018.045963. https://doi.org/10.1101/2020.04.18.045963
- Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, Zhang L, Fan G, Xu J, Gu X, Cheng Z, Yu T, Xia J, Wei Y, Wu W, Xie X, Yin W, Li H, Liu M, Xiao Y, Gao H, Guo L, Xie J, Wang G, Jiang R, Gao Z, Jin Q, Wang J, Cao B (2020) Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 395(10223):497–506. https://doi.org/10.1016/s0140-6736(20)30183-5CrossRefPubMedPubMedCentralGoogle Scholar
- Hulo C, de Castro E, Masson P, Bougueleret L, Bairoch A, Xenarios I, Le Mercier P (2011) ViralZone: a knowledge resource to understand virus diversity. Nucleic Acids Res. 39(Database issue):D576–D582. https://doi.org/10.1093/nar/gkq901CrossRefPubMedGoogle Scholar
- Jensen KK, Andreatta M, Marcatili P, Buus S, Greenbaum JA, Yan Z, Sette A, Peters B, Nielsen M (2018) Improved methods for predicting peptide binding affinity to MHC class II molecules. Immunology 154(3):394–406Google Scholar
- Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes. Nucleic Acids Res. 45(W1):W24–w29. https://doi.org/10.1093/nar/gkx346CrossRefPubMedPubMedCentralGoogle Scholar
- Johns Hopkins Bloomberg School of Public Health (2020) 2019 novel coronavirus research compendium (NCRC). https://ncrc.jhsph.edu Accessed 8 May 2020
- Johnson LS, Eddy SR, Portugaly E (2010) Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics 11:431. https://doi.org/10.1186/1471-2105-11-431CrossRefPubMedPubMedCentralGoogle Scholar
- Karosiene E, Rasmussen M, Blicher T, Lund O, Buus S, Nielsen M (2013) NetMHCIIpan-3. 0, a common pan-specific MHC class II prediction method including all three human MHC class II isotypes, HLA-DR, HLA-DP and HLA-DQ. Immunogenetics 65(10):711–724Google Scholar
- Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30(14):3059–3066. https://doi.org/10.1093/nar/gkf436CrossRefPubMedPubMedCentralGoogle Scholar
- Kharisma V, Ansori A (2020) Construction of epitope-based peptide vaccine against SARS-CoV-2: Immunoinformatics study. J Pure Appl Microbiol 14(suppl 1):999–1005CrossRefGoogle Scholar
- Kim D, Lee JY, Yang JS, Kim JW, Kim VN, Chang H (2020) The architecture of SARS-CoV-2 transcriptome. Cell 181(4):914–921.e910. https://doi.org/10.1016/j.cell.2020.04.011CrossRefPubMedPubMedCentralGoogle Scholar
- Koonin EV, Galperin MY (2003) Sequence – evolution – function: computational approaches in comparative genomics. Springer US, Boston. https://doi.org/10.1007/978-1-4757-3783-7CrossRefGoogle Scholar
- Kozakov D, Hall DR, Xia B, Porter KA, Padhorny D, Yueh C, Beglov D, Vajda S (2007) The ClusPro web server for protein-protein docking. Nat Protoc 12(2):255–278. https://doi.org/10.1038/nprot.2016.169
- Kringelum JV, Lundegaard C, Lund O, Nielsen M (2012) Reliable B cell epitope predictions: impacts of method development and improved benchmarking. PLoS Comput Biol 8(12):e1002829Google Scholar
- Kühnert D, Wu CH, Drummond AJ (2011) Phylogenetic and epidemic modeling of rapidly evolving infectious diseases. Infect. Genet. Evol. 11(8):1825–1841. https://doi.org/10.1016/j.meegid.2011.08.005CrossRefPubMedPubMedCentralGoogle Scholar
- Lai MM, Liao CL, Lin YJ, Zhang X (1994) Coronavirus: how a large RNA viral genome is replicated and transcribed. Infect. Agents Dis. 3(2–3):98–105PubMedGoogle Scholar
- Lam HM, Ratmann O, Boni MF (2017) Improved algorithmic complexity for the 3SEQ recombination detection algorithm. Mol. Biol. Evol. 35(1):247–251. https://doi.org/10.1093/molbev/msx263CrossRefPubMedCentralGoogle Scholar
- Lamiable A, Thévenet P, Rey J, Vavrusa M, Derreumaux P, Tufféry P (2016) PEP-FOLD3: faster de novo structure prediction for linear peptides in solution and in complex. Nucleic Acids Res. 44(W1):W449–W454. https://doi.org/10.1093/nar/gkw329CrossRefPubMedPubMedCentralGoogle Scholar
- Lemey P, Salemi M, Vandamme A-M (2009) The phylogenetic handbook: a practical approach to phylogenetic analysis and hypothesis testing. Cambridge University PressGoogle Scholar
- Li F (2008) Structural analysis of major species barriers between humans and palm civets for severe acute respiratory syndrome coronavirus infections. J. Virol. 82(14):6984–6991. https://doi.org/10.1128/jvi.00442-08CrossRefPubMedPubMedCentralGoogle Scholar
- Li F (2015) Receptor recognition mechanisms of coronaviruses: a decade of structural studies. J. Virol. 89(4):1954–1964. https://doi.org/10.1128/jvi.02615-14CrossRefPubMedGoogle Scholar
- Li F, Li W, Farzan M, Harrison SC (2005a) Structure of SARS coronavirus spike receptor-binding domain complexed with receptor. Science 309(5742):1864–1868. https://doi.org/10.1126/science.1116480CrossRefPubMedGoogle Scholar
- Li W, Shi Z, Yu M, Ren W, Smith C, Epstein JH, Wang H, Crameri G, Hu Z, Zhang H, Zhang J, McEachern J, Field H, Daszak P, Eaton BT, Zhang S, Wang LF (2005b) Bats are natural reservoirs of SARS-like coronaviruses. Science 310(5748):676–679. https://doi.org/10.1126/science.1118391CrossRefPubMedGoogle Scholar
- Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, Ren R, Leung KSM, Lau EHY, Wong JY, Xing X, Xiang N, Wu Y, Li C, Chen Q, Li D, Liu T, Zhao J, Liu M, Tu W, Chen C, Jin L, Yang R, Wang Q, Zhou S, Wang R, Liu H, Luo Y, Liu Y, Shao G, Li H, Tao Z, Yang Y, Deng Z, Liu B, Ma Z, Zhang Y, Shi G, Lam TTY, Wu JT, Gao GF, Cowling BJ, Yang B, Leung GM, Feng Z (2020a) Early transmission dynamics in Wuhan, China, of novel coronavirus infected pneumonia. N Engl J Med 382(13):1199–1207CrossRefGoogle Scholar
- Li X, Giorgi EE, Marichann MH, Foley B, Xiao C, Kong X-P, Chen Y, Korber B, Gao F (2020b) Emergence of SARS-CoV-2 through recombination and strong purifying selection. bioRxiv:2020.2003.2020.000885. doi: https://doi.org/10.1101/2020.03.20.000885
- Lole KS, Bollinger RC, Paranjape RS, Gadkari D, Kulkarni SS, Novak NG, Ingersoll R, Sheppard HW, Ray SC (1999) Full-length human immunodeficiency virus type 1 genomes from subtype C-infected seroconverters in India, with evidence of intersubtype recombination. J. Virol. 73(1):152–160. https://doi.org/10.1128/jvi.73.1.152-160.1999CrossRefPubMedPubMedCentralGoogle Scholar
- Lorenc T, Khouja C, Raine G, Sutcliffe K, Wright K, Sowden A, Thomas J (2020) COVID-19: living map of the evidence. http://eppi.ioe.ac.uk/COVID19_MAP/covid_map_v11.html. Accessed 15 May 2020
- Lu R, Zhao X, Li J, Niu P, Yang B, Wu H, Wang W, Song H, Huang B, Zhu N, Bi Y, Ma X, Zhan F, Wang L, Hu T, Zhou H, Hu Z, Zhou W, Zhao L, Chen J, Meng Y, Wang J, Lin Y, Yuan J, Xie Z, Ma J, Liu WJ, Wang D, Xu W, Holmes EC, Gao GF, Wu G, Chen W, Shi W, Tan W (2020) Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet 395(10224):565–574. https://doi.org/10.1016/s0140-6736(20)30251-8CrossRefPubMedPubMedCentralGoogle Scholar
- Lundegaard C, Lamberth K, Harndahl M, Buus S, Lund O, Nielsen M (2008) NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8–11. Nucleic Acids Res 1:W509–12Google Scholar
- Markov PV, Pepin J, Frost E, Deslandes S, Labbé AC, Pybus OG (2009) Phylogeography and molecular epidemiology of hepatitis C virus genotype 2 in Africa. J. Gen. Virol. 90(Pt 9):2086–2096. https://doi.org/10.1099/vir.0.011569-0CrossRefPubMedGoogle Scholar
- Marra MA, Jones SJ, Astell CR, Holt RA, Brooks-Wilson A, Butterfield YS, Khattra J, Asano JK, Barber SA, Chan SY, Cloutier A, Coughlin SM, Freeman D, Girn N, Griffith OL, Leach SR, Mayo M, McDonald H, Montgomery SB, Pandoh PK, Petrescu AS, Robertson AG, Schein JE, Siddiqui A, Smailus DE, Stott JM, Yang GS, Plummer F, Andonov A, Artsob H, Bastien N, Bernard K, Booth TF, Bowness D, Czub M, Drebot M, Fernando L, Flick R, Garbutt M, Gray M, Grolla A, Jones S, Feldmann H, Meyers A, Kabani A, Li Y, Normand S, Stroher U, Tipples GA, Tyler S, Vogrig R, Ward D, Watson B, Brunham RC, Krajden M, Petric M, Skowronski DM, Upton C, Roper RL (2003) The genome sequence of the SARS-associated coronavirus. Science 300(5624):1399–1404. https://doi.org/10.1126/science.1085953CrossRefPubMedGoogle Scholar
- Martin D, Rybicki E (2000) RDP: detection of recombination amongst aligned sequences. Bioinformatics 16(6):562–563. https://doi.org/10.1093/bioinformatics/16.6.562CrossRefPubMedGoogle Scholar
- Marz M, Beerenwinkel N, Drosten C, Fricke M, Frishman D, Hofacker IL, Hoffmann D, Middendorf M, Rattei T, Stadler PF, Töpfer A (2014) Challenges in RNA virus bioinformatics. Bioinformatics 30(13):1793–1799. https://doi.org/10.1093/bioinformatics/btu105CrossRefPubMedGoogle Scholar
- Mayrose I, Penn O, Erez E, Rubinstein ND, Shlomi T, Freund NT, Bublil EM, Ruppin E, Sharan R, Gershoni JM, Martz E, Pupko T (2007) Pepitope: epitope mapping from affinity-selected peptides. Bioinformatics 23(23):3244–3246Google Scholar
- Milne I, Lindner D, Bayer M, Husmeier D, McGuire G, Marshall DF, Wright F (2008) TOPALi v2: a rich graphical interface for evolutionary analyses of multiple alignments on HPC clusters and multi-core desktops. Bioinformatics 25(1):126–127. https://doi.org/10.1093/bioinformatics/btn575CrossRefPubMedPubMedCentralGoogle Scholar
- Nielsen M, Andreatta M (2016) NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets. Genome Med 8(1):1–9Google Scholar
- Nielsen M, Lundegaard C, Lund O (2007) Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC bioinform 8(1):238Google Scholar
- Nora T, Charpentier C, Tenaillon O, Hoede C, Clavel F, Hance AJ (2007) Contribution of recombination to the evolution of human immunodeficiency viruses expressing resistance to antiretroviral treatment. J. Virol. 81(14):7620–7628. https://doi.org/10.1128/jvi.00083-07CrossRefPubMedPubMedCentralGoogle Scholar
- Paraskevis D, Kostaki EG, Magiorkinis G, Panayiotakopoulos G, Sourvinos G, Tsiodras S (2020) Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event. Infect. Genet. Evol. 79:104212. https://doi.org/10.1016/j.meegid.2020.104212CrossRefPubMedPubMedCentralGoogle Scholar
- Pearson WR (2013) An introduction to sequence similarity (“homology”) searching. Curr. Protoc. Bioinformatics. Chapter 3:Unit3.1. https://doi.org/10.1002/0471250953.bi0301s42
- Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U. S. A. 85(8):2444–2448. https://doi.org/10.1073/pnas.85.8.2444CrossRefPubMedPubMedCentralGoogle Scholar
- Ponomarenko J, Bui HH, Li W, Fusseder N, Bourne PE, Sette A, Peters B (2008) ElliPro: a new structure-based tool for the prediction of antibody epitopes. BMC Bioinform 9(1):514Google Scholar
- Pybus OG, Barnes E, Taggart R, Lemey P, Markov PV, Rasachak B, Syhavong B, Phetsouvanah R, Sheridan I, Humphreys IS, Lu L, Newton PN, Klenerman P (2009) Genetic history of hepatitis C virus in East Asia. J. Virol. 83(2):1071–1082. https://doi.org/10.1128/jvi.01501-08CrossRefPubMedGoogle Scholar
- Reche PA, Reinherz EL (2005) PEPVAC: a web server for multi-epitope vaccine development based on the prediction of supertypic MHC ligands. Nucleic Acids Res 33(1):W138–42Google Scholar
- Rota PA, Oberste MS, Monroe SS, Nix WA, Campagnoli R, Icenogle JP, Peñaranda S, Bankamp B, Maher K, Chen MH, Tong S, Tamin A, Lowe L, Frace M, DeRisi JL, Chen Q, Wang D, Erdman DD, Peret TC, Burns C, Ksiazek TG, Rollin PE, Sanchez A, Liffick S, Holloway B, Limor J, McCaustland K, Olsen-Rasmussen M, Fouchier R, Günther S, Osterhaus AD, Drosten C, Pallansch MA, Anderson LJ, Bellini WJ (2003) Characterization of a novel coronavirus associated with severe acute respiratory syndrome. Science 300(5624):1394–1399. https://doi.org/10.1126/science.1085952CrossRefPubMedGoogle Scholar
- Ruano J, Gómez-García F, Pieper D, Puljak L (2020) What evidence-based medicine researchers can do to help clinicians fighting COVID-19? J. Clin. Epidemiol. https://doi.org/10.1016/j.jclinepi.2020.04.015
- Saha S, Raghava GPS (2006) Prediction of continuous B‐cell epitopes in an antigen using recurrent neural network. Proteins 65(1):40–48Google Scholar
- Santiago ML, Rodenburg CM, Kamenya S, Bibollet-Ruche F, Gao F, Bailes E, Meleth S, Soong SJ, Kilby JM, Moldoveanu Z, Fahey B, Muller MN, Ayouba A, Nerrienet E, McClure HM, Heeney JL, Pusey AE, Collins DA, Boesch C, Wrangham RW, Goodall J, Sharp PM, Shaw GM, Hahn BH (2002) SIVcpz in wild chimpanzees. Science 295(5554):465. https://doi.org/10.1126/science.295.5554.465CrossRefPubMedGoogle Scholar
- SciSight (2020) SciSight. https://scisight.apps.allenai.org/. Accessed 1 May 2020
- Simon-Loriere E, Holmes EC (2011) Why do RNA viruses recombine? Nat. Rev. Microbiol. 9(8):617–626. https://doi.org/10.1038/nrmicro2614CrossRefPubMedPubMedCentralGoogle Scholar
- Singh H, Raghava GPS (2001) ProPred: prediction of HLA-DR binding sites. Bioinformatics 17(12):1236–1237Google Scholar
- Singh H, Ansari HR, Raghava GPS (2013) Improved method for linear B-cell epitope prediction using antigen’s primary sequence. PLOS ONE 8(5):e62216Google Scholar
- Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147(1):195–197. https://doi.org/10.1016/0022-2836(81)90087-5CrossRefPubMedGoogle Scholar
- SPIKE-COVID (2020) SPIKE-COVID: Extractive search over CORD 19. https://spike.covid-19.apps.allenai.org/search/covid19. Accessed 2 April 2020
- Sun B, Zhang Y (2014) Overview of orchestration of CD4+ T cell subsets in immune responses. Adv. Exp. Med. Biol. 841:1–13. https://doi.org/10.1007/978-94-017-9487-9_1CrossRefPubMedGoogle Scholar
- Sussman JL, Lin D, Jiang J, Manning NO, Prilusky J, Ritter O, Abola EE (1998) Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr. D Biol. Crystallogr. 54(Pt 6 Pt 1):1078–1084. https://doi.org/10.1107/s0907444998009378CrossRefPubMedGoogle Scholar
- Suttle CA (2005) Viruses in the sea. Nature 437(7057):356–361. https://doi.org/10.1038/nature04160CrossRefPubMedGoogle Scholar
- Sweredoski MJ, Baldi P (2009) COBEpro: a novel system for predicting continuous B-cell epitopes. Protein Eng Des Sel 22(3):113–120Google Scholar
- Tang X, Wu C, Li X, Song Y, Yao X, Wu X, Duan Y, Zhang H, Wang Y, Qian Z, Cui J, Lu J (2020) On the origin and continuing evolution of SARS-CoV-2. Natl. Sci. Rev. https://doi.org/10.1093/nsr/nwaa036
- Thiel V, Ivanov KA, Putics Á, Hertzig T, Schelle B, Bayer S, Weißbrich B, Snijder EJ, Rabenau H, Doerr HW, Gorbalenya AE, Ziebuhr J (2003) Mechanisms and enzymes involved in SARS coronavirus genome expression. J. Gen. Virol. 84(Pt 9):2305–2315. https://doi.org/10.1099/vir.0.19424-0CrossRefPubMedGoogle Scholar
- Thorlund K, Dron L, Park J, Hsu G, Forrest JI, Mills EJ (2020) A real-time dashboard of clinical trials for COVID-19. Lancet Digit Health 2(6):e286–e287. https://doi.org/10.1016/s2589-7500(20)30086-8CrossRefPubMedPubMedCentralGoogle Scholar
- Walls AC, Park YJ, Tortorici MA, Wall A, McGuire AT, Veesler D (2020) Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell 181(2):281–292.e286. https://doi.org/10.1016/j.cell.2020.02.058CrossRefPubMedPubMedCentralGoogle Scholar
- Wan Y, Shang J, Graham R, Baric RS, Li F (2020) Receptor recognition by the novel coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS coronavirus. J. Virol. 94(7):e00127–e00120. https://doi.org/10.1128/jvi.00127-20CrossRefPubMedPubMedCentralGoogle Scholar
- Wang LL, Lo K, Chandrasekhar Y, Reas R, Yang J, Eide D, Funk K, Kinney R, Liu Z, Merrill W (2020) CORD-19: the Covid-19 open research dataset. arXiv preprint arXiv:200410706Google Scholar
- World Health Organization (2005) The international clinical trials registry platform – ICTRP. https://www.who.int/ictrp/en/. Accessed 2 April 2020
- World Health Organization (2020a) Coronavirus disease 2019 (COVID-19) situation report – 51, 11 March 2020. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports. Accessed 15 April 2020
- World Health Organization (2020b) WHO coronavirus disease (COVID-19) dashboard. https://covid19.who.int/. Accessed 1 May 2020
- Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian JH, Pei YY, Yuan ML, Zhang YL, Dai FH, Liu Y, Wang QM, Zheng JJ, Xu L, Holmes EC, Zhang YZ (2020) A new coronavirus associated with human respiratory disease in China. Nature 579(7798):265–269. https://doi.org/10.1038/s41586-020-2008-3CrossRefPubMedPubMedCentralGoogle Scholar
- Xiong J (2006) Essential bioinformatics. Cambridge University Press, New YorkCrossRefGoogle Scholar
- Yang Z, Rannala B (2012) Molecular phylogenetics: principles and practice. Nat. Rev. Genet. 13(5):303–314. https://doi.org/10.1038/nrg3186CrossRefPubMedGoogle Scholar
- Yao B, Zhang L, Liang S, Zhang C (2012) SVMTriP: a method to predict antigenic epitopes using support vector machine to integrate tri-peptide similarity and propensity. PLOS ONE 7(9):e45152Google Scholar
- Zhang Y-Z, Holmes EC (2020) A genomic perspective on the origin and emergence of SARS-CoV-2. Cell 181(2):223–227. https://doi.org/10.1016/j.cell.2020.03.035CrossRefPubMedPubMedCentralGoogle Scholar
- Zhang T, Wu Q, Zhang Z (2020) Pangolin homology associated with 2019-nCoV. bioRxiv:2020.2002.2019.950253. doi: https://doi.org/10.1101/2020.02.19.950253
- Zhou P, Yang XL, Wang XG, Hu B, Zhang L, Zhang W, Si HR, Zhu Y, Li B, Huang CL, Chen HD, Chen J, Luo Y, Guo H, Jiang RD, Liu MQ, Chen Y, Shen XR, Wang X, Zheng XS, Zhao K, Chen QJ, Deng F, Liu LL, Yan B, Zhan FX, Wang YY, Xiao GF, Shi ZL (2020) A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798):270–273. https://doi.org/10.1038/s41586-020-2012-7CrossRefPubMedPubMedCentralGoogle Scholar
- Zhu N, Zhang D, Wang W, Li X, Yang B, Song J, Zhao X, Huang B, Shi W, Lu R, Niu P, Zhan F, Ma X, Wang D, Xu W, Wu G, Gao GF, Tan W (2020) A novel coronavirus from patients with pneumonia in China, 2019. N Engl J Med 382(8):727–733CrossRefGoogle Scholar

