Introduction

Coronavirus disease 2019 (COVID-19) originated from a seafood market in Wuhan city (the capital of Hubei Province in southeastern China) and spread rapidly in more than 200 countries. By 2 Jul 2020, the total confirmed cases had reached more than 10.5 million, and 512,000 deaths had been reported. The symptoms of COVID-19 include cough, fever, headache, fatigue, sore throat, and malaise. The disease can lead to complications, such as pneumonia and severe acute respiratory syndrome (WHO 2020; Ahmad et al. 2020; Velavan and Meyer 2020). COVID-19 is transmitted through direct or indirect contact with respiratory droplets and biological samples such as urine, saliva, and stool (Shereen and Khan 2020). However, some studies proved the presence of the virus in air samples, and one study stated that the virus in air samples is viable for up to 3 h (Cheng et al. 2019; Ong et al. 2020; Liu et al. 2020; Doremalen et al. 2020).

Coronavirus 19 was named by the WHO and the International Committee on Taxonomy of Viruses (ICTV) as SARS-CoV-2, which is grouped in the same class of SARS-CoV (International Committee on Taxonomy of Viruses (ICTV) 2020). The two viruses belong to the family Coronaviridae, subfamily Orthocoronavirinae, genus Betacoronavirus; and subgenus Sarbecovirus, and the species is severe acute respiratory syndrome-related coronavirus. Bat coronavirus (BatCoV RaTG13) was isolated from animals of genus Rhinolophus affinis. Similar to SARS-CoV-2 and SARS-CoV, bat coronavirus BatCoV RaTG13 belongs to the Betacoronavirus family and has 96% genome sequence identity with the genome of SARS-CoV-2 (Zhou et al. 2020).

SARS-CoV-2, SARS-CoV, and bat coronavirus BatCoV RaTG13 have the same virion structure. They are RNA viruses with a nucleocapsid protein and an envelope. The viral envelope contains a bi-lipid membrane and three proteins: the spike protein, an envelope protein, and a membrane protein (Perlman and Netland 2009).

The three viruses contain two major genes: orf1ab and orf1a (comprising two-thirds of the total) and the structural and accessory protein genes (comprising one-third of the total). The Orf1ab and Orf1a genes are translated and hydrolysed to produce 16 nonstructural proteins (nsp1–nsp16), while the translation of the second gene produces the structural proteins spike (S), envelope (E), membrane (M), and nucleocapsid (N) and the accessory proteins orf3a, orf3b, orf6, orf7a, orf7b, orf8a, orf8b, orf9b and orf10. The number and type of accessory proteins differ according to the virus (Zhou et al. 2020; Yoshimoto 2020; Wang et al. 2020; Khailany et al. 2020; Wong et al. 2019; GenBank 2020).

Regarding NS3, NS6, NS7a, NS7b and NS8 of BatCoV RaTG13, some published articles named them nonstructural proteins, and others named them accessory proteins (GenBank 2020; Fahmi et al. 2020; Tang et al. 2020; Li et al. 2020). These proteins are encoded by genes similar to those of structural and accessory proteins, and because they are comparable to the accessory proteins of SARS-CoV-2, they are considered accessory proteins.

This article investigated the protein sequence identity and similarity percentages of SARS-CoV-2 and compared them to the proteins of SARS-CoV and the BatCoV RaTG13.

Materials and methods

Study proteins

This 1ab polyprotein of SARS-CoV-2, SARS-CoV, and BatCoV RaTG13 was studied. Additionally, the structural and accessory proteins found in SARS-CoV-2 and BatCoV RaTG13 were studied, including the spike protein (S), orf3, envelope protein (E), membrane protein (M), orf6, orf7a, orf7b, orf8, and nucleocapsid protein (N) (Table 2). The amino acid sequences were obtained from the National Center for Biotechnology Information (NCBI) site (https://www.ncbi.nlm.nih.gov/protein) (Table 1).

Table 1 The studied proteins of the three viruses

Sequence alignment

The online sequence alignment service of The European Molecular Biology Open Software Suite (EMBOSS) was used to determine the percentages of protein similarity and identity of SARS-CoV-2, SARS-CoV, and RaTG13. The matrix of the sequence alignment was EBLOSUM62, and the gap extends penalties were 14 and 4. The sequence alignment service of the EMBOSS can be accessed at https://www.bioinformatics.nl/cgi-bin/emboss/matcher. As a confirmatory test, a coloured alignment display was generated for each protein using the service of multiple sequence alignment of the European Molecular Biology Laboratory—European Bioinformatics Institute (EMBL-EBI) available at https://www.ebi.ac.uk/Tools/msa/clustalo/.

Results and discussion

This study reports differences in the identity and similarity percentage of the proteins of SARS-CoV-2 versus SARS-CoV and of SARS-CoV-2 versus the bat coronavirus RaTG13. The differences suggest a bat origin over a SARS-CoV origin, and these differences were caused by different types of mutations including deletions, insertions and substitutions [Annex 1, Annex 2 in ESM, Figs. 1, 2, 3, 4, 5, 6, 7 and 8].

Fig. 1
figure 1

Sequence alignment of the orf3a accessory protein of SARS-CoV-2, SARS-CoV and the bat coronavirus RaTG13

Fig. 2
figure 2

Sequence alignment of the envelope structural proteins of SARS-CoV-2, SARS-CoV and the bat coronavirus RatG13

Fig. 3
figure 3

The membrane proteins of SARS-CoV-2, SARS-CoV and bat coronavirus RaTG13 coloured according to the sequence alignment

Fig. 4
figure 4

Sequence alignment of the orf6 accessory protein of SARS-CoV-2, SARS-CoV and the bat coronavirus RaTG13

Fig. 5
figure 5

The orf7a accessory protein of SARS-CoV-2, SARS-CoV and bat coronavirus RaTG13 and its alignment

Fig. 6
figure 6

Sequence alignment of the orf7b accessory protein of SARS-CoV-2, SARS-CoV and the bat coronavirus RaTG13

Fig. 7
figure 7

The orf8 accessory protein of SARS-CoV-2, SARS-CoV and bat coronavirus RaTG13 and its alignment

Fig. 8
figure 8

The nucleocapsid structural protein of SARS-CoV-2, SARS-CoV and bat coronavirus RaTG13 and the coloured sequence alignment

The 1ab polyprotein

The 1ab polyprotein of SARS-CoV-2, SARS-CoV and BatCoV RaTG13 is composed of 7096, 7073, and 7095 amino acids, respectively (Table 1). The amino acid sequence identity and similarity of the 1ab polyprotein of SARS-CoV-2 and BatCoV RaTG13 were 98.5% and 99.1%, respectively. The percentages of identity and similarity of the 1ab polyprotein of SARS-CoV-2 and SARS-CoV were 86.2% and 92.9%, respectively. The results show that SARS-CoV-2 most likely originates from the Rhinolophus affinis bat, not from a laboratory-modified SARS-CoV variant (Table 2). Large-scale mutations were reported for the 1ab protein of SARS-CoV-2, SARS-CoV, and the bat coronavirus RaTG13. However, more mutations in the 1ab polyprotein of SARS-CoV-2 and SARS-CoV were shared in common than those of SARS-CoV-2 and bat coronavirus RaTG13 [Annex 1].

Table 2 The percentages of identity and similarity of the SARS-CoV-2 proteins compared to those of SARS-CoV and RaTG13 (bat coronavirus)

After the production of the 1ab polyprotein, some endopeptidases produce the 1a polyprotein and 16 nonstructural proteins (Snijder et al. 2016). The cleavage products of the 1ab polyprotein carry out a wide range of activities associated with the replication of the virus. The activities include binding and breakdown of ATP to produce ADP and phosphate, and the activities of different endopeptidases lead to the formation of nonstructural proteins (such as nonstructural proteins nsp3 and nsp5); furthermore, ribose-5-phosphate is produced through exonuclease activity, and new nucleotides are synthesized in association with methyltransferase, RNA polymerase and helicase functions for viral replication and prevention of supertwisting, and transcription is regulated through zinc finger proteins (Snijder et al. 2016).

The spike protein

The spike protein of SARS-CoV-2 contains 1273 amino acids, while the spike protein of SARS-CoV contains 1255 amino acids and that of BatCoV RaTG13 contains 1269 amino acids (Table 1). The spike protein of SARS-CoV-2 and that of SARS-CoV has an identity percentage of 76% and a similarity percentage of 86% (Table 2). The identity and similarity percentages of the spike protein of SARS-CoV-2 and the spike protein of RaTG13 are 97.4% and 98.4%, respectively (Table 2) [Annex 2 in ESM]. The identity and similarity percentages of the spike protein of SARS-CoV-2 and RaTG13 are higher than those of the spike protein of SARS-CoV-2 and SARS-CoV.

The spike protein of coronaviruses consists of three polypeptide chains with two domains: S1 and S2. The S1 and S2 domains are critical for binding host cell receptors (S1) and for fusing the virus with the membrane of the host cell. There is a hinge region between S1 and S2 that is a target for host cell proteases (Li 2016; Bosch et al. 2003). The spike protein of SARS-CoV-2 has a furin cleavage site in the hinge region. The furin cleavage site is composed of four amino acids (681–684). The presence of the furin cleavage site may be critical for the high transmission rate of SARS-CoV-2 compared to other coronaviruses (Walls et al. 2020).

Orf3a

The accessory protein orf3a of SARS-CoV-2 contains 275 amino acids, and its gene (25393.0.26220) is located between the spike and E protein genes. The orf3a protein of SARS-CoV contains 274 amino acids, while the NS3 of BatCoV RaTG13 is composed of 275 amino acids (Table 1). The amino sequence alignment of orf3a of SARS-CoV-2 and SARS-CoV showed that the sequence identity was 72.4% and that the sequence similarity was 85.1%. The similarity percentage of orf3a in SARS-CoV-2 and SARS-CoV was 90.2, not 85.1% as reported by Yashimito (2020), which may be due to the different software programs used in the two studies (Yoshimoto 2020). Orf3a (SARS-CoV-2) and NS3 (BatCoV RaTG13) were characterized by 97.8% identity and 98.9% similarity (Table 2, Fig. 1).

Orf3a plays different roles in the virus including 1) viral envelope assembly and 2) host cell binding and infusion by interacting with the structural proteins (M, S, and E) and the accessory protein (7a) of SARS-CoV (Brunn et al. 2007). In host organisms, the highest immunogenicity of the N-terminus of orf3a is known to have a strong protective effect on humoural immunity (Zhong et al. 2006). Orf3a has a cysteine-rich domain that possesses potassium ion channel activity by interacting with the S and E proteins (Brunn et al. 2007; Zeng et al. 2004). The C-terminus of orf3a arrests the host cell cycle by depleting cyclin D3 and facilitates apoptosis of host cells by interacting with the M protein (Yuan et al. 2007; Marra et al. 2003; Law et al. 2005).

Envelope protein (E protein)

The E protein of SARS-CoV-2, SARS-CoV, and BatCoV RaTG13 consists of 75, 76, and 75 amino acids, respectively (Table 1). The percentages of the identity and similarity of the E protein inSARS-CoV-2 and SARS-CoV are 94.7 and 96.1, respectively; these percentages were 94.7 and 97.4 in Yoshimoto (2020) (Table 2, Fig. 2). The E protein of SARS-CoV-2 and the E protein of BatCoV RaTG13 are 100% identical and similar. The results strongly favour a bat origin of SARS-CoV-2 over a SARS-CoV origin. The E protein contains three domains, the C-terminus, N-terminus, and transmembrane, with different functions in the virus and in host cells (Schoeman and Fielding 2019).

The E protein plays different functions in viral replication and the interaction of the virus with host organisms and cells, such as assembly of the virion envelope, suppression of host cell stress responses, facilitation of viral replication and vitality, and as an ion channel to induce the release of virions from host cells (Nieto-Torres et al. 2011; Álvarez et al. 2010; Corse and Machamer 2003; Yuan et al. 2006; Ruch and Machamer 2012).

Membrane protein

The membrane protein (M) of SARS-CoV-2, SARS-CoV, and BatCoV RaTG13 is composed of 222, 221, and 221 amino acids, respectively (Table 1). The percentages of identity and similarity of the amino acid sequences of the M protein of SARS-CoV-2 and SARS-CoV are 90.5 and 96.4, respectively, while those of the M protein of SARS-CoV-2 and BatCoV RaTG13 are both 99.5% (Table 2, Fig. 3). The M protein has three domains, the N-terminus, C-terminus, and transmembrane, with different functions (Neuman et al. 2010).

The M protein is important for the assembly, transport, and release of the virus from host cell organelles (Ma et al. 2008; Siu et al. 2008). The M protein of SARS-CoV inhibits the transcription of interferon-1, which leads to the inhibition of the innate immunity of host organisms (Siu et al. 2009).

Orf6

The orf6 protein of SARS-CoV-2 and BatCoV RaTG13 contains 61 amino acids, while it contains 63 amino acids in SARS-CoV (Table 1). The orf6 protein of SARS-CoV-2 and SARS-CoV is characterized by an identity percentage of 68.9% and a similarity percentage of 88.5%. The percentages of orf6 protein identity and similarity in SARS-CoV-2 and BatCoV RaTG13 are both 100% (Table 2, Fig. 4).

The functions of orf6 include (1) participation in the formation of replication/transcription to facilitate viral replication, (2) induction of an increase in the number of virions during infection, (3) contribution to virus evasion of the host immune system and (4) involvement in the formation of double-membrane vesicles (DMVs) in host cells to ensure virus assembly (Kumar et al. 2007; Narayanan et al. 2008; Gunalan et al. 2011).

Orf7a

Orf7a of SARS-CoV-2 and BatCoV RaTG13 contains 121 amino acids, while orf7a of SARS-CoV contains 122 amino acids (Table 1). The percentages of orf7a identity and similarity in SARS-CoV-2 and SARS-CoV are 85.2 and 90.2, respectively. The orf7a protein of SARS-CoV-2 and the orf7a protein of the BatCoV RaTG13 share an identity percentage of 97.5% and similarity percentage of 99.2% (Table 2, Fig. 5).

Orf 7a of SARS-CoV is a transmembrane protein divided into four regions from the N-terminus: (1) the first 15 amino acids are broken down by the infected host cells; (2) amino acids 16–96 form the intracellular domain; (3) amino acids 97–117 with a collective hydrophobic nature form the transmembrane domain; and 4) the C-terminus consists of the last five amino acids (Liu et al. 2014).

Orf7a plays a role in virus binding to and invasion of host cells by interacting with the S, M, E, and orf3a proteins (Narayanan et al. 2008; Tan et al. 2006). Orf7a does not contribute to the replication of the virus (Liu et al. 2014; Tan et al. 2006; Yount et al. 2005; Schaecher et al. 2007). Orf7a plays some roles in host cells, such as triggering apoptosis, downregulating protein synthesis, arresting the cell cycle at the G0/G1 phase, and activating cytokine production (Narayanan et al. 2008; Liu et al. 2014; Tan et al. 2006; Schaecher et al. 2007).

Orf7b

The orf7b protein in both SARS-CoV-2 and BatCoV RaTG13 contains 43 amino acids, while orf7b in SARS-CoV contains 44 amino acids (Table 1). The orf7b protein of SARS-CoV-2 and the orf7b protein of SARS-CoV are characterized by an identity percentage of 85.4 and a similarity percentage of 90.2. On the other hand, the identity and similarity percentages of the orf7b protein of SARS-CoV-2 and that of BatCoV RaTG13 are 97.7% each (Table 2, Fig. 6).

The orf7b protein contains three domains: an N-terminal domain (external), a C-terminal domain (in the cytoplasm), and a transmembrane hydrophobic domain (Liu et al. 2014).

It has been reported that orf7b is not involved in virus replication (Liu et al. 2014; Tan et al. 2006; Yount et al. 2005; Schaecher et al. 2007). The anti-orf7b antibody concentration is increased in SARS-CoV patients, which shows that orf7b is highly immunogenic and can be used in vaccination trials (Schaecher et al. 2007; Guo et al. 2004).

Orf8

The identity and similarity of orf8 in the SARS-CoV-2 and orf8a in SARS-CoV are 38.9 and 77.8, respectively. The orf8 protein of SARS-CoV-2 is 44.4% identical and 66.7% similar to orf8b of SARS-CoV (Fig. 10). The identity and similarity of SARS-CoV-2 and BatCoV RaTG13 orf8 are 95% and 95.9%, respectively (Table 2, Fig. 7). However, the orf8 protein of SARS-CoV-2 has 121 amino acids compared to 39 and 84 amino acids in the orf8a protein variants of SARS-CoV, and 121 amino acids for orf8 of BatCoV RaTG13 (Table 1).

Orf8a and orf8b of SARS-CoV are not needed for viral replication. In host cells, they are localized in vesicle-like structures in mitochondria, the endoplasmic reticulum, cytosol and nucleus of host cells. orf8a and orf8b of SARS-CoV stimulate cellular DNA synthesis and caspase-dependent apoptosis (Keng and Tan 2009).

Nucleocapsid protein (N protein)

The N protein of SARS-CoV-2 and BatCoV RatG13 consists of 419 amino acids, while that of SARS-CoV consists of 422 amino acids (Table 1). The N protein of SARS-CoV-2 and the N protein of SARS-CoV are 90.5% identical and 94.3% similar, while those of SARS-CoV-2 and BatCoV RaTG13 are 99% identical and similar (Table 2, Fig. 8). The N-protein is an RNA-binding protein with three domains: an N-terminal domain that binds RNA, a C-terminal domain critical for dimerization, and a disordered central region rich in serine and arginine (SR) (Kang et al. 2020).

The N protein is essential for the formation of helical viral RNA, induction of the replication and transcription of the virus, and control of host cell metabolism to ensure the viral replication process and to regulate host cell apoptosis and the cell cycle (Kang et al. 2020; Cong et al. 2020; Surjit et al. 2006). Moreover, the N protein is very immunogenic and induces the host immune system to respond against SARS-CoV (Lin et al. 2003).

Conclusion

The SARS-CoV-2 proteins and the BatCoV RaTG13 share high identity and similarity compared to the SARS-CoV-2 and SARS-CoV proteins. The findings of this study proved the usefulness of determining the percentages of protein identity and similarity in determining the origin of viruses.