Introduction

The COVID-19 pandemic caused by the SARS-CoV-2 virus has unexpectedly affected global health since emerging. As of 1 December 2020, more 68 million cases have been confirmed in 216 countries with more than 1.5 million deaths which clearly shows that a protective vaccine is urgently needed although several knowledge are still on this virus [1].

SARS-CoV-2 genome with ~ 30 kb size encodes multiple spike (S) protein, the envelope (E) protein, the membrane (M) protein, and the nucleocapsid (N) proteins and some non-structural ones. The spike (S) protein has pivotal roles in receptor binding, angiotensin-converting enzyme 2 (ACE2), and also membrane fusion. Therefore, it is widely investigated as an attractive antigen in vaccine designs aiming at virus binding/fusion blocking antibodies to neutralize virus infection. Since SARS-CoV is an RNA virus that imposes an error-prone genome and results in host immune response escape, targeting the full-length S protein in vaccine studies have not brought protective immunity against SARS outbreaks [2,3,4,5]. Although the spike protein is a promising protective immunogenic, antigen design optimization is critical to achieving optimal immune response. The S1 subunit includes the minimal receptor-binding domain (RBD, 318–510 aa), a conserved target for neutralizing antibody induction [6,7,8,9]. Therefore, this region could be more practical in comparison with full-length S protein.

The membrane glycoprotein)M) provides coronavirus assembly, is the most abundant envelope protein that facilitates viral components sortation and incorporation into virions coronavirus assembly [10,11,12]. M protein binding helps the virus to stabilize nucleocapsids and accelerates completion of viral assembly by N protein-RNA complex stabilization [13, 14]. The nucleocapsid protein (N) as a multifunctional RNA-binding protein is essential for viral RNA replication and transcription. It also has many vital roles in the viral RNA genome packaging, regulation of viral RNA synthesis in replication/transcription, and infected cell metabolism modulation. Some studies demonstrated that N protein regulates host–pathogen interactions, including actin reorganization, cell cycle progress, and apoptosis. This protein is also considered highly immunogenic based on abundantly expression during infection [15,16,17].

According to the critical demand on developing safe, effective approaches against SARS-CoV-2 have stepped on the way with some clinical evaluation worldwide [18,19,20]. There is no doubt that any approaches with generated vaccines could be highly valuable in possible outbreaks and probable seasonal re-emerging which is mainly depended on long-term protection evolution. MERS-CoV and SARS-CoV-1 vaccines progressions over the last years are crucial keys given their genetic similarity which provides vital awareness for SARS-CoV-2 vaccine development [21,22,23,24,25].

Therefore, multiple platforms have been under development since the emerging, including DNA- and RNA-based platforms and recombinant-subunit vaccines. Nevertheless, SARS-CoV-2 vaccine development poses some challenges even with novel platforms. For instance, preclinical studies on SARS and MERS vaccine candidates have brought concerns about exacerbating lung disease as an outcome of antibody-dependent enhancement or direct impact. Hence, testing in a suitable animal model and rigorous safety monitoring in clinical trials will be critical [26, 27].

Traditional approaches in vaccination based on laboratory experiments in the outbreak situation could not meet the urgent needs, and many therapeutic agents are being investigated [28,29,30,31]. Bioinformatics study is a strong tool specified in sortation, organization, and process large amounts of available data generated from other experiments to provide a large-scale immunological platform within a limited time. Since the virus genome and its protein sequences information are available, the presented epitopes and the virus characteristics could be predicted by in silico analysis, which significantly accelerates the progress of vaccine development [32,33,34,35,36].

In this study, we aimed at B-cell and T-cell epitope prediction of SARS-CoV-2s Spike SARS-CoV-2 Spike receptor-binding domain (RBD), M and N protein as fusion proteins and comparison in silico immunogenicity by applying bioinformatics methods to provide a subunit vaccine candidate against COVID-19.

Materials and Methods

Sequence Retrieval

Viral amino acid sequences of SARS-CoV-2 Spike (S), Membrane (M) and nucleocapsid (N) proteins (accession numbers S: YP_009724390.1, QIX12195.1, QJD47706.1, QJD47860.1, QJD25757.1, QIU78767.1, QIX12148.2, QIU80900.1, BCB97891.1, M: YP_009724393.1, QJD47709.1 QJD47863.1 QJD25760.1 QIU78770.1 QIX12151.1 QIX12198.1 QIU80903.1, BCB97894.1 and N: YP_009724397.2, QIU78775.1, QIX12156.1, QIX12203.1, MT186677.1, QIU80910.1, MT186677.1, BCB97898.1) were obtained from the GenBank [37]. The whole process is simply shown in Fig. 1.

Fig. 1
figure 1

Schematic view of the applied methods in the study

T-Cell Epitope and Antigenicity Prediction

The obtained sequences were submitted to MHCI- and MHCII-binding prediction tool http://tools.iedb.org/mhc/n in IEDB using different methods including Artificial Neural Network (ANN), Stabilized Matrix Method (SMM) or Scoring Matrices derived from Combinatorial Peptide Libraries (Comblib_Sidney2008) method. MHC-NP net CTLpan1.1server [38,39,40] and RankPEP server were also applied. The outcomes from all applied tools were in a similar range. Therefore, here, the IEDB outputs are reported.

T-cell epitopes lengths were defined as 9-mer for MHC class I and 15-mer for MHC class II for BALB/c and human separately. BALB/c MHC class I alleles included H2-Dd, H2-Kd and H2-Ld and MHC class II alleles were selected H2-IAd and H2-IEd. According to diversity of antigens and the recognition extent by the variable HLA molecules in a population and in considering the most popular HLA in the Persian population based on the available report [41,42,43], HLA-A*01, 02, 03, 11, 24, 26, 32, HLA-B*35, 51, 50, 27, 57 for MHC class I and HLA-DRB1*15, 11, 13, 03, 04, 07 for MHC class II were selected. The peptides which were predicted to bind to MHC class I and II molecules with percentile rank ≤ 1 were considered epitopic sequences.

The VaxiJen v2.0 online antigen prediction tool was applied to assess the antigenicity scores of predicted epitopes [44, 45], which provides antigen sorting according to the protein physicochemical qualities without the sequence alignment usage. Epitopes with antigenic score > 0.5 were considered antigenic.

Toxicity Analysis

We investigated the selected model of 4 for toxicity using ToxinPred [46]. This tool provides the confirmation of non-toxicity of epitopes for the host according to all physic-chemical parameters.

Population Coverage and Epitope Conservancy

MHC I and MHC II potential binders from the selected fusion form of model 4 were computed for population coverage analysis against the whole world population, especially the Persian population, with the selected human MHC I and MHC II interacting molecules using the IEDB population coverage calculation tool. Population coverage calculation is based on total HLA hits score which is achieved from the IEDB. These data are derived from a relative of an allele’s relative frequency at a particular locus in a population (Sequence identity threshold ≥ 100). In addition, we assessed the conservancy level of each potential epitope by searching identities in 10 amino acid sequences of S protein, 12 amino acid sequences of M protein and 12 amino acid sequences of N protein from different geographical area retrieved from the database.

B-Cell Epitope Prediction

BepiPred linear epitope prediction server [47] from the Immune Epitope Database was applied to predict linear B cell epitopes with threshold 0.35 and epitopes length is varied from 6 to higher residues.

For Recognition of other physicochemical properties of amino acids such as the antigenicity (Kolaskar and Tongaonkar) [48], surface accessibility [49], flexibility (Karplus and Schulz) [50], hydrophilicity [51] and beta-turns (Chou and Fasman) [52] methods were also assessed by the available tools at the platform of Immune Epitope Database (IEDB) Analysis Resource (http://tools.iedb.org/bcell). The protein sequence scanning window length for all methods was adjusted to seven residues. We applied ElliPro [53] at IEDB online tool for discontinuous B-cell epitope prediction with minimum score value set at 0.50. This method predicts epitopes by considering both the sequence- and structure-based information.

Structural Analysis

Physicochemical properties of fragments including weight, aliphatic index and Grand average of hydropathicity (GRAVY), theoretical pI and atomic composition were analyzed using Expasy’s ProtParam server [54]. Self-optimized prediction method with alignment (SOPMA) and Jpred tools were applied to generate and evaluate the secondary structure and assessment of a-helix, b sheets, random coils of the proteins [55, 56].

Homology Modeling and Validation

The 3D model were analyzed using the Threading ASSEmbly Refinement (I-TASSER) online server program [57] and IntFOLD Integrated Protein Structure and Function Prediction Server [58] that provides 3D models along with confidence score (C-Score) and model quality score. The further pattern evaluation was done by three indicators: Stereochemical qualities, C-score and DFIRE2 energy profile [59]. The Stereo chemical analysis of the 3D model was assessed by PROCHECK, ERRAT, VERIFY 3D and verified by structural Analysis and Verification server [60,61,62].

Results

The amino acid sequences of chain B, SARS-CoV-2 Spike receptor-binding domain (RBD), Spike, Membrane and Nucleocapsid proteins were obtained and four fusion forms as shown in Fig. 2 were predicted to be compared in term of immunogenicity. A proteasomal linker (AAY) was used to fuse the applied proteins.

Fig. 2
figure 2

Schematic view of predicted constructs with the flexible spacer (AAY)

MHC Class I and II Binding Prediction in BALB/c

We applied 9-mer and 15-mer lengths coverage of T-cell epitopes to design a vaccine model. Spike, M and N proteins were subjected to IEDB MHC I and MHCII binding prediction tool. The IEDB recommended, RANKPEP, net CTLpan1.1, MHC-NP, and netMHCpan3.0 server were used to predict the epitopes from selected proteins. High-affinity peptides with antigenic features are listed in Tables 1 and 2 (percentile rank ≤ 1).

Table 1 BALB/c MHC class I epitopes in predicted models
Table 2 BALB/c MHC class II epitopes in predicted models

According to the generated data, in comparison between MCHI and MHCII in predicted models, the number of MHCI epitopes are clearly higher, meaning that the designed models could elicit cellular immunity responses in a mouse model. Moreover, among the 4 models, the last one, which is composed of truncated Spike, full M and full N proteins includes 14 MHCI epitopes with high antigenic scores rather than other models. In contrast with MHCI, analysis MHCII binders, only predicted model 2 and model 4 contain epitopic peptides with high antigenicity score (> 0.5).

Human T-Cell Epitope Prediction

According to the T-cell epitopes in mice, Models 2 and 4 had more antigenic epitopes. Therefore, we continued T-cell epitopes in human most prevalent HLA I and HLA II. The results are summarized in Tables 3 and 4. Model 2 and Model 4 contain at least 166 and 300 epitopes, respectively, from which we here only report the highly antigenic ones. Therefore, we continued the human epitope prediction study with truncated Spike + full M + full N as the best model. This fusion form has also 42 HLA class II epitopes (percentile rank < 1), from which 29 binders were assessed antigenic and are shown in Table 4.

Table 3 Human class I epitopes in predicted model 4
Table 4 Human MHC class II epitopes in predicted Model 4

Toxicity Analysis

Model 4 at the final step was tested for toxicity using ToxinPred tool, as shown in Tables 3 and 4.

Population Coverage and Conservancy Analysis

Peptides predicted to interact with MHCI and II molecules in the selected Model 4 were tested for population coverage analysis using the IEDB population coverage tool to cover most HIV chronic infected individuals specifically the Persian population. Furthermore, we selected North America, Southwest Asia, South Asia, Europe, South America and Africa continent. The results of total population coverage in Persian and the other populations are listed in Table 5. The selected Model 4 has an acceptable coverage of 82.95% for MHC class I and II in the Persian population. To identify the Conservancy of predicted peptides of Model 4, we used the IEDB tool. Therefore, all peptides (with an antigenic score > 0.5) were submitted against related S, N, M sequences at a high threshold. Finally, we determined all epitopes were fully conserved (100%) epitopes.

Table 5 Predicted epitopes of Model 4 interacting with combined of human MHC class I and II among different population worldwide

B-Cell Epitopes Recognition

The four predicted models were assessed by BepiPred server, and the antigenicity of predicted epitopes was evaluated by VaxiJen. The amino acid sequences, peptide lengths, and positions of these epitopes are shown in Table 5.

Among the predicted models, Model 2 (RBD + M + N) and Model 4 (Truncated Spike + M + N) have a high number of B-cell epitopes in comparison with the other models in agreement with T-cell prediction. Moreover, Model 4 includes 14 B-cell antigenic epitopes which shows to have the highest potency in the humoral response.

Surface accessibility, flexibility, hydrophilicity and antigenicity are essential features of B cell antigenic indexes in vaccine design. The selected Model 4 was assessed by different prediction at the BepiPred Sequential B-Cell Epitope Predictor, as shown in Fig. 3.

Fig. 3
figure 3

Graphical representation of B cell epitopes prediction by a Parker hydrophilicity prediction (threshold: 1.474), b Emini surface accessibility prediction (threshold: 1.000), c Karplus and Schulz flexibility prediction (threshold: 0.999), d Chou and Fasman beta turn prediction (threshold: 1.004) and e Kolaskar and Tongaonkar Antigenicity (threshold: 1.0). The yellow regions above the threshold (red line) are supposed to be a part of B cell epitope whereas the green areas are not (Color figure online)

In order to find conformational B-cell epitope in 3D structure, Ellipro was used. Ellipro predicted six discontinues epitopes for Model 1 with maximum score of 0.942 and minimum score of 0.542, eight epitopes for model 2 with maximum score of 0.802 and minimum score of 0.502 and nine epitopes for model 3 with maximum score of 0.816 and minimum score of 0.55 (data were not shown). Ellipro predicted a total of 61 discontinues epitopes for the chosen Model 4 with a maximum score of 0.994. Those scores greater than 0.8 were selected (Table 6).

Table 6 B-cell linear epitopes for selected Model 4

Primary and Secondary Structure Analysis

Physiochemical characterization of selected Model 4 fusion protein was achieved using Expasy’s ProtParam server based on estimated molecular weight, theoretical isoelectric point, and average hydropathicity that indicates the solubility and hydrophobicity of protein. The fusion Model 4 with 1602 amino acids and 176.443 Da with pI: 8.77 and 157 positively charged residues (Arg + Lys) in the polypeptide and 134 negatively charged residues (Asp + Glu). This Model is also predicted to be soluble and hydrophilic (Grand average of hydropathicity (GRAVY): − 0.234).

SOPMA tool was used to predict secondary structure of Model 4 features, including alpha helixes, beta turns, random coils contribution, and C-score. Random coils and extended strands greater ratios are correlated with protein antigenic epitope formation enhancement. Subsequently, it is composed of 31.27% α-helix and 4.99% β-sheet, which beside the 42.88% of random coils, which is potential to form higher antigenic epitopes (Fig. 4a).

Fig.4
figure 4

Sequence and structural analysis of Model 4. a Secondary structure by SOPMA tool, b Three dimensional structure by PyMOL and c Ramachandran Plot generated to validate the modeled 3 structure of model 4 protein which indicates that 91.7% of residues are in the favored region

Homology Modeling Prediction and Validation

The three-dimensional structure of Model 4 was predicted using the IntFOLD Integrated Protein Structure and Function Prediction Server, which generates five top models with global model quality score. The one with the highest global model quality score represents the best model. This value of selected Model 4 was in the acceptable score. Chimera version v1.2 was applied to generate the protein image [62] (Fig. 4b). Moreover, the Ramachandran plot generated by the PROCHECK. Described the amino acid positions in the plot as well as the overall quality of the protein model. The plot showed that 91.7% amino acids were arranged in most favored core regions with 7% in allowed region, 1.1% generously allowed region, and 0.3% in disallowed region (Fig. 4c). Z-Score for 3D structure of model was − 5.62.

Discussion

Apart from the human coronaviruses, which are continuously circulating among human population, the originated viruses from animals have been shown to be lethal pathogens via crossing species barriers. Effective preventive approaches are urgent needs at the current situation. Potent epitopic vaccines predicted by bioinformatic analysis makes the vaccine design straightforward and fast compared to traditional vaccine approaches, which has been used in COVID-19 vaccine design recently [35, 63,64,65].

In this study, we evaluated four possible fusion forms of structural SARS-CoV-2 proteins in order to achieve the most immunogenic protein, which could elicit humoral and cellular immune responses as well. The amino acid sequences were applied to predict the probable antigenic epitopes of T-Cell, linear and conformational B-cell.

Among the four analyzed models, we found model 4 composed of truncated Spike, the full form of M and N proteins (S: 528–1293 + AAY + M + AAY + N) is the most immunogenic fusion form. The evaluation of Murine T-cell epitopes showed that it contains 14 MHCI binders which are all antigenic and also 10 MHCII peptides from which 5 are antigenic epitopes. Human investigation resulted in 24 highly immunogenic human MHC class I and 29 human MHC class II. Moreover, there are four epitopes, including KTFPPTEPK, VTYVPAQEK, KAYNVTQAF and KMKDLSPRW, which can bind to different HLAs.

B-cell evaluation also showed that this model contains 14 B-cell linear and 61 discontinues epitopes with maximum score of 0.994. Therefore, the in silico comparative analysis predicted this model to have a high potency in both immune arms induction. Structural analysis revealed that the selected model is a 176.443 Da protein composed of 1602 amino acids and 42.88% of random coils. In addition, the Ramachandran plot showed that 91.7% amino acids were arranged in most favored core regions. The predicted model is totally non-toxin with a great rate of population coverage especially in Iran and the Europe.

In a study by Joshi et al., SARS-COV-2 multiple virus proteins were assessed by in-silico methods. The obtained results showed that two epitopes ITLCFTLKR and VYQLRARSV are highly practical after docking and molecular dynamics simulation. Furthermore, these two epitopes were subjected to population coverage and toxicity analysis [66]. In our study, KTFPPTEPK from N protein was found highly potential to associate with two frequent HLA-A*03:01 and HLA-A*11:01. It is also a part of B-cell epitopes KTFPPTEPKKDKKKKADETQALPQRQKKQQ with a high score in the predicted method (Table 7).

Table 7 Discontinuous B-cell epitopes predicted by Ellipro for model 4

Chen et al. investigated another in silico analysis [67]. They predicted 63 sequential B-cell epitopes of spike protein. They also showed that four peptides of Spike, including S 315–324, S 333–338, S 648–663 and S 1064–1079 are highly antigenic with optimum surface accessibility. In our study, one of the discontinuous B-cell predictions includes 38 residues (residues: L900, G901, F902, I903, A904, G905, L906, I907, A908, I909, V910, M911, V912, T913, I914, M915, L916, C917, C918, M919, T920, S921, C922, C923, S924, C925, L926, K927, G928, C929, C930, S931, C932, G933, S934, C935, C936, K937) with high score of 0.886. They also assessed HLA-binding peptides of nucleocapsid protein, which led to 81 and 64 peptides able to bind to MHC class I and MHC class II molecules. The HLA I and HLA II binders in our study were predicted lower due to the fact that we only considered the antigenic peptides at the high threshold (Tables 3 and 4).

The other bioinformatics-based assessment to achieve a vaccine against SARS-CoV-2 by Sahoo et al. focused on T-cell epitopes of similar targets including S, M and N [68]. Their study showed 36 T-cell potential epitopes that interacting with MHC-I alleles and also 25 T-cell epitopes interacting with MHC-II alleles. Among the predicted peptides, IGYYRRATR and YYRRATRRI from N protein and FRLFARTRS, FIASFRLFA and FARTRSMWS from M are predicted to interact by human alleles. These peptides are also supposed to be a BALB/c MHCII binder in our study (Table 2). FVLAAVYRI from M protein is predicted to interact with 31 HLA II and 3 HLA I. In our study, this peptide is also a part of seven HLA II-predicted epitopes (Table 4).

Therefore, immunoinformatics approaches have been already used identification of possible epitopes of novel human coronavirus, SARS-CoV-2. The outbreak of infection caused by this virus has brought great obstacles and challenges to public health. Thus, fast identification of immune epitopes and possible viral immunogenic products would be a superior way to monitor the candidates for vaccine development in comparison with other approaches at the impending pandemic era.

Conclusion

This study resulted in possible fusion forms prediction of SARS-CoV-2 structural proteins, which could be potential targets of neutralizing antibodies. The in silico evaluation of different fusion models have been effective in selecting the best fused model of S, M and N proteins. Truncated Spike + M + N is composed of 24 highly immunogenic human MHC class I and 29 MHC class II with 82.95% population coverage in Iran along with 14 B-cell linear and 61 discontinues epitopes.

The selected recombinant protein could highly elicit immune responses and will be evaluated in vitro and in vivo at the next step.