Introduction

Together to the sequencing of the first genome and the development of efficient bioinformatics tools, the vaccine development concept has greatly evolved and, in the last two decades, entered the so called reverse vaccinology (RV) era. RV is an in silico approach that takes advantages of the many genome sequences available in the public domain. It involves the mining of these sequences using bioinformatics tools where it is possible to predict which proteins among thousands are worthy of further laboratory investigation based on several characteristics such as prevalence, cellular localization, role in the pathogenesis, or immunogenicity. The first studies using this approach led to the identification of several vaccine candidates for several microorganisms such as serogroup B Neisseria meningitidis [1], Streptococcus pneumoniae [2], and Chlamydia pneumoniae [3]. Despite its great success, RV has shown some limitations especially for microorganisms that have a high genetic heterogeneity. For example, the analysis of group B streptococcus (GBS) has shown that the genome is divided in a “core genome” that contains the genes shared by all the strains and the “dispensable genome” that probably represents the adaptation of the strain in a specific environment, absent in at least one strain [4]. This fact led to the conclusion that RV applied to the genome of only one strain would probably not be enough to identify candidates suitable for the development of a universal vaccine. It opens the door for the second phase of RV: the comparative RV, where the objective is to screen multistrain genomes to identify conserved proteins [5, 6]. Proteins with signature sequence motifs commonly found in known secreted or surface-exposed proteins from bacteria are ideal, as they have the highest chance of raising an effective immune response. As a consequence, proteins with known cytoplasmic functions are discarded from most RV studies. However, there is a body of evidence that several typical cytoplasmic proteins could appear on the surface of bacteria and have a role in pathogenesis such as adhesion, plasminogen binding, and modulation of host immune response [7, 8]. The term “moonlighting protein” is now widely used to describe these proteins which sequences containing neither known sequence motifs for surface anchoring nor identified secretion signals but appear on the bacterial surface to take on additional activities. Interestingly, some of them have already been shown as highly immunogenic and even protective in mice in several disease models [911].

In this work, we applied a RV strategy based on conservation, virulence, and nonclassically surface exposure criterions to two microorganisms with significant degree of genomic plasticity among isolates which imposes a major limitation to the production of a multistrain component vaccine.

S. pneumoniae is a multiserotype, Gram-positive bacterium and causes invasive infection such as meningitis and bacteremia, as well as many mucosal diseases such as pneumonia, sinusitis, and otitis, and the burden of diseases that it causes is amongst the greatest in the world [12]. The actual vaccine is a lipopolysaccharide-conjugated vaccine that protects only against the serotypes incorporated in its formulation among more than 90 existing serotypes. Its effectiveness in preventing invasive pneumococcal disease (IPD) in infants and community-acquired pneumonia (CAP) has been clearly observed. However, it has been shown that the vaccine exerts a selective pressure on the serotypes not included in the vaccine [13]. This phenomenon of serotype replacement created a necessity for the development of new-generation recombinant protein vaccines, especially based on virulent factors conserved between strains [14].

Leptospira spp. comprise a heterogeneous group of pathogenic (and saprophytic) species belonging to the order Spirochaetes that cause fever, chills, headache, and severe myalgia in the early phase of the disease. Progression to multiorgan system complications occurs in 5 to 15 % of cases with mortality rates of 5 to 40 % [15]. It has a global distribution with a higher incidence in the tropics and subtropics, ranging from 10 to 100 human cases per 100,000 individuals. Leptospirosis has also a great economic impact in the agricultural industry since the disease affects the livestock inducing abortions, infertility, reduced milk production, and death [16]. Currently available veterinarian vaccines, based on inactivated whole bacteria, are of low efficacy and do not confer cross-protective immunity against the large number of pathogenic serovars (>200). A vaccine licensed for human is still required [17].

In our current study, we identified 37 proteins conserved between 16 genomes of S. pneumoniae and 12 proteins conserved between 5 leptospiral genomes using an in silico analysis. Among them, 19 and 7 proteins, respectively, are potential nonclassically surface-exposed proteins and could represent new antigens for vaccine development.

Methodology

The sequences of steps applied in this work are represented in the following workflow.

Sequence Retrieval

The complete genomes of 16 S. pneumoniae strains representative of 11 different serotypes and 5 Leptospira strains were downloaded from NCBI. The proteomes of all strains used in the analyses were obtained by retrieving the coding sequence (CDS) protein sequences from the genome GenBank files. Tables 1 and 2 show the accession number, the strain, the serotype/serovar, and the number of protein sequences retrieved for each one.

Table 1 Accession numbers and characteristics of the 16 S. pneumoniae genomes
Table 2 Accession numbers and characteristics of the five Leptospira genomes used (splitted by chromosomes)

Subcellular Localization Prediction

PSORTb v.3.0 is the most commonly used software to predict the localization of a protein in prokaryotes. It uses a combination of six analytical modules, each of which analyzes one biological feature known to influence the subcellular localization of a protein, such as the signal peptide for exportation, transmembrane alpha helices, motifs that show characteristics of specific functions, and others [18]. Each module returns a likelihood score (between 1 and 10) of a protein being at a specific localization, and a cutoff of 7.5 is considered reliable to assign a single localization. If the score is below this threshold, the protein can be considered with unknown or multiple localizations. In this study, the software was run choosing “bacteria” as the “organism” and “Gram positive” was chosen in the “Gram stain” option for S. pneumoniae and “Gram negative” for Leptospira. The prediction of the localization was performed for all CDS proteins from the complete genomes described above. The output for each genome was separated in five files according to the localization.

Search for the Most Conserved Proteins Between Strains

OrthoMCL is an algorithm that creates groups of homologous proteins based on their sequence similarity [19]. OrthoMCL is based on the reciprocal BLAST best hits identified by pairwise alignments of proteomes. We decided to use OrthoMCL since it provides a good method for the identification of true orthologs among all genomes of the same genera (groups of 16 proteins for Streptococcus and 5 proteins for Leptospira). To ascertain OrthoMCL to select more conserved sequences, we added an additional BLAST filter to select only alignments with ≥98 % similarity. The protein groups identified are suggested to be most conserved among strains and then to be wider effect vaccine candidates—supposedly more effective against all strains simultaneously.

Virulence Factor Prediction

VirulentPred is software that predicts the virulence of bacterial proteins based on a machine learning classification method known as bilayer cascade Support Vector Machine (SVM) [20]. The first-layer SVM classifiers were trained and optimized with different individual protein sequence features. In addition, a similarity search-based module was also developed using a dataset of virulent and nonvirulent proteins as BLAST database. The results from the first layer (SVM scores and PSI-BLAST result) were cascaded to the second-layer SVM classifier to train and generate the final classifier. In this work, we used the output of the OrthoMCL groups to search for possible virulent factors between the more conserved proteins using a threshold ≥1 to minimize the occurrence of false positive.

Nonclassically Secreted Protein Prediction

SecretomeP is software that runs an algorithm based on artificial neural networks that predicts nonclassically secreted protein in bacteria or mammalian cells [21]. The prediction method assigns a score to each protein between 0 and 1 where a score above 0.5 is considered indicative of secretion. Nonclassically secreted proteins should obtain a score exceeding the normal threshold of 0.5 and should not simultaneously get a prediction of containing a signal peptide. Here, we used the list of virulent proteins predicted to be “cytoplasmic” or “unknown” as an entry for the SecretomeP.

Results

Prediction of Subcellular Localization and Selection of the Most Conserved Proteins

After running PSORTb 3.0 and comparing the degree of conservation between proteins with OrthoMCL according to the localization, 970 and 63 sequences were selected with ≥98 % of similarity between strains of S. pneumoniae and Leptospira, respectively (Tables 3 and 4).

Table 3 Number of conserved proteins among S. pneumoniae genomes regarding the localization
Table 4 Number of conserved proteins among Leptospira genomes regarding the localization

Prediction of Virulence and Nonclassically Secreted Proteins

For further investigation, we selected the protein sequences that were conserved among all the genomes analyzed, to narrow the panel of proteins to be screened. All proteins identified in this step were screened for potential role in virulence using the VirulentPred software. One hundred proteins conserved among the 16 S. pneumoniae strains were found to be putative virulent factors (data not shown). Among them, 75 were found as cytoplasmic localized by PSORT 3.0. These proteins are usually discarded from further studies, as they are not expected to be presented to the host immune system. However, some proteins have already been detected on the surface of microorganisms even when they do not possess classical peptide signal or surface exposure motifs. For this reason, we searched for nonclassically secreted proteins in the categories unknown and cytoplasmic using SecretomeP. We found 19 putative nonclassically secreted proteins. In total, 37 proteins were considered possible virulence factors, surface-exposed and conserved between the 16 Streptococcus genomes (Table 5).

Table 5 Possible virulent factors, surface-exposed and conserved between the 16 genomes of Streptococcus pneumoniae regarding the initial localization by PSORT 3.0

Twenty-six proteins were found to be potential virulent factors conserved between the five Leptospira strains, especially flagellar proteins and ribosomal proteins (data not shown). After screening for nonclassically secreted proteins in the categories unknown and cytoplasmic using SecretomeP, we found seven putative nonclassically secreted proteins, all but one are ribosomal proteins. In total, 12 proteins were considered possible virulence factors, surface-exposed and conserved between the five Leptospira genomes (Table 6).

Table 6 Possible virulent factors, surface-exposed and conserved between the five genomes of Leptospira regarding the initial localization by PSORT 3.0

Discussion

RV has already been used for studying Streptococcus and Leptospira spp. However, these studies focused on an individual organism or in proteins with signature sequence motifs commonly found in known secreted proteins from bacteria [2, 22]. The novelty of the present study lies in the comparison of entire genomes and the incorporation of nonclassically secreted proteins in the analysis. Using this approach, we identified 37 proteins conserved between 16 genomes of S. pneumoniae and 12 proteins conserved between five leptospiral genomes potentially exposed on the surface of the bacteria and with a possible role in virulence.

Among these vaccine candidates, none matched with the ones described in the pioneer RV studies [2, 22] due to our different focus. In fact, as shown in Tables 3 and 4, the number of conserved proteins increased when the number of compared genomes decreased. When we looked at the potential candidate antigens present in less than 16 or 5 genomes, respectively, or with less than 98 % similarity, proteins such as choline-binding protein and signal peptidase were identified for Streptococcus and LipL22 and LipL23 for Leptospira matching the previous studies (result not shown). In the same way, we also identified pullulanase which was indicated as a promising vaccine candidate in a recent RV approach based on conservation, immunogenicity, and human proteome similarity of some streptococcal selected proteins [23]. All together, these observations support the reliability of our results.

Interaction between cell and their environment is critically mediated by surface proteins. In the case of pathogenic bacteria, these molecules are often virulence factors. In both microorganisms, we identified some expected class of virulent factors especially localized in the cytoplasmic membrane or periplam. For example, ABC transporter ATP-binding protein and psr protein transcription regulator have been recognized as Streptococcus pneumoniae virulent factors in lung infection [24, 25]. Inactivation of some flagellar proteins has proved their importance in the bacteria mobility and virulence of Leptospira interrogans [26, 27]. Furthermore, some flagellins have been shown to be immunogenic and even protective [28, 29]. Mutation in the gene of the flagellar protein FliM has been related to Helicobacter pylori pathogenicity by reducing either the ability of the bacteria to attach to gastric epithelial cells or the intensity of bacteria–host cell interactions [30]. Thus, these studies indicated that these classes of proteins could be evaluated to incorporate a vaccine formulation.

Nevertheless, proteins with more than one transmembrane domain are generally difficult to obtain as recombinant one and are theoretically more embedded under the surface, and then their use for vaccine development could be more tedious. Notably, we identified few or none conserved virulence factors localized in the cell wall, outer membrane, or secreted. This kind of protein is more likely to be variable, part of the dispensable genome that confers a selective advantage in specific environmental conditions, whereas the cytoplasmic proteins are more likely to be part of the core genome that mainly encodes factors for functions that contribute to the major metabolic pathways.

However, as it has been long shown that some cytoplasmic proteins can be displayed at the surface of cell, we focused on searching for nonclassically secreted protein. The so-called moonlighting proteins are proteins with an additional biological activity when in a different location that they normally occupy. Most of the moonlighting proteins in bacteria have been primarily identified as glycolytic enzymes, other metabolic enzymes, or molecular chaperones, normally classified as cytoplasmic. The most classical examples were GAPDH and enolase that have been shown to have a role in adhesion to host epithelial components and plasminogen or induce a modulation of the host immune response. Thus, it has been suggested that moonlighting proteins contribute to bacterial virulence [7, 8].

In Streptococcus, we identified proteins such as phosphate ABC transporter, tributyrin esterase, and methyltransferase already known to be involved in bacterial pathogenesis [24, 31]. But, curiously, our pipeline returned, as nonclassically secreted protein, a high number of ribosomal proteins that were not, at first glance, the best candidate for vaccine development. However, ribosomal proteins have been found in the exoproteome of many bacteria [3234] and mounting evidence points to their alternative extracellular location where they might perform nonribosomal function. Interestingly, Ogunniyi and colleagues (2012) [25] demonstrated that the ribosomal proteins L29 and L33 were upregulated in S. pneumoniae after infection in the lung, and the ribosomal protein L33 was identified by our pipeline. Furthermore, ribosomal proteins have already shown to be immunogenic and sometimes protective. For example, in a previous study, we showed that the pneumococcal ribosomal protein S9 decreased the bacteremia in mice 24 h after challenge in a model of sepsis [35]. In another work, Moffitt et al. (2012) [36] identified several pneumococcal antigens from the soluble fraction of a killed whole-cell vaccine. Among them, the ribosomal protein S1 was a potent inducer of Th17 and was able to statistically reduce the colonization of nasal mucosal tissue of immunized mice compared with the control group in a model of nasopharyngeal carriage. Then, it is reasonable to think that these proteins might have a protective role in other bacteria, too.

A new prediction tool, named Jenner-predict server, has recently been developed to predict vaccine candidate in bacteria based on host–pathogen interactions [37]. The performance of web server was evaluated against known protective antigens from diverse classes of bacteria reported in Protegen database and datasets used for VaxiJen server development and, then, was applied to S. pneumoniae and Escherichia coli proteomes. Although the authors state that the prediction accuracy was better than other previously developed tools, it is based on the first premise that the candidate should be classified as “noncytosolic” by the software PSORT 3.0. Thus, none of the cytoplasmic candidates predicted by our method, in S. pneumoniae, were pointed out by the Jenner server. Several known candidates such as pneumolysin, surface protein PspA, choline-binding family protein known to be extracellular, or cell wall component were identified by Jaiswal and collaborators but not by our method. The fact that we did not identify these classical candidates could be explained by our more stringent criterions with respect to the conservation between strains. Among the proteins classified as “cytoplasmic membrane,” only two, an iron ABC transporter and an adhesion lipoprotein, were simultaneously identified by the Jenner server and our method probably due to the number of transmembrane helices. Jaiswal and collaborators used less than three membrane domains as a cutoff. Interestingly, the tributyrin esterase classified as unknown was identified in common by both our method and Jenner-predict server. Furthermore, this protein has been identified as essential for lung infection [24], indicating that this protein could be a strong new candidate for vaccine development against S. pneumoniae.

Among the possible limitations of our method is the election of false vaccine candidates, due to the accuracies of the software used, which are not optimal. Thus, biological confirmation is necessary. Furthermore, our very stringent criterion could explain why we found principally ribosomal proteins. And, even if these proteins are really exposed and immunogenic, one can concern about their possible homology with human proteins. We performed sequence alignment of the unknown and cytoplasmic protein categories identified here with the human genome using Basic Local Alignment Search Tool. In average, the sequence identity varies between 22 and 53 %. Notably, the ribosomal proteins L32, L33, and L35; one hypothetical protein of S. pneumoniae; the ribosomal protein L35; and the hypothetical protein of Leptospira ssp. have no homology with the human genome (results not shown).

The results obtained with Leptospira genomes were more restricted and less interesting than with Streptococcus. This could be due to the fact that we compared two different species L. interrogans and Leptospira borgpetersenii instead of several serotypes of the same species and indicated that finding common antigens to develop a broad vaccine is more unlikely to occur. In conclusion, our pipeline gave us the possibility to identify, besides the canonical virulence factors, cytoplasmic proteins with putative extracellular location. They might be new moonlighting proteins, and we think that this kind of proteins should not be neglected as possible subunit vaccine candidates. Some of them are already under investigation in our laboratory to further confirm their surface localization and immunogenicity.