Background

A wide range of critical functions in prokaryotes such as antibiotic resistance, stationary phase transition, competence, sporulation, chemotaxis, nitrogen regulation, virulence, and phosphate regulation are mediated by a particular type of signalling pathway known as a two-component system (TCS) [1]. TCS typically operate through the transfer of phosphoryl groups from a His residue of a histidine kinase (HK) to an Asp residue of a response regulator (RR), in response to an extracellular stimulus. A variant of the TCS, known as a phosphorelay, includes extra receiver and phosphotransfer domains relaying the phosphoryl group between the HK and RR proteins [2].

Genome-wide identification of HK and RR proteins is relatively straightforward [3], with a variety of TCS databases and prediction servers available [46]. However the identification of HK-RR pairs is challenging as TCS pairs are highly specific [7], there are multiple HK and RRs in most genomes, and their genes are often unpaired (orphan HKs and RRs). Several experimental approaches have been used to identify HK-RR pairs, including phosphotransfer profiling [810] and yeast two-hybrid assays [1114]. Such approaches are costly and labour intensive, therefore it is important to develop computational tools to lessen the burden and complement experimental approaches.

The use of meta-predictors, predictors that combine predictions from individual methods using machine learning algorithms, is a common approach in bioinformatics [1521]. The advantage of meta-predictors is that they do not rely on single methods, but can integrate a wide range of information under a common probabilistic umbrella without relying on complex scoring functions [22]. In doing so, the strengths and weaknesses of individual predictions are combined to achieve higher levels of accuracy [21]. Examples of meta-predictors include those developed for the prediction of functional sites in proteins [23] or prediction of critical residues in protein interfaces [24].

In this work we present MetaPred2CS, a sequence-based meta-predictor designed specifically to predict protein pairs in TCS. MetaPred2CS is based on a Support Vector Machine (SVM) [25, 26] and combines six independent and orthologous protein-protein interaction prediction methods: in-silico two-hybrid (i2h) [27], mirror tree (MT) [28], phylogenetic profiling (PP) [29], gene fusion (GF) [30], gene neighbourhood (GN) and gene operon (GO) [31]. The i2h and MT methods are based on co-evolution theory, and rely on high quality and complete multiple sequence alignments (MSAs), while PP, GF, GN and GO, are genome context methods, utilising different genomic information such as chromosomal proximity (GN), operons (GO), fusion events (GF) and inter-genomic profiles (PP) between fully sequenced genomes.

The identification performance of MetaPred2CS was tested using validated experimental data and it achieved a higher accuracy compared to individual prediction methods such as i2h, MT, GF, PP, GN and GO. MetaPred2CS also compared favourably against a Bayesian meta-predictor benchmarked on TCS pairs [32] and a database of protein-protein interactions: STRING [33].

Methods

Datasets: training and testing

A variety of datasets, described in detail below, were used during the development of the predictor to benchmark its performance under different scenarios and to compare to an independent, competing, method and pre computed scores in the STRING [33] database. File1.xls and File2.pdf in the Additional files 1 and 2 provide complete details and a diagrammatic representation of the all sets described below.

The P+ and P- datasets

The P+ and P- sets contain 113 interacting and 1134 non-interacting experimentally validated TCS pairs respectively, and were compiled and manually curated from the current literature. These sets were used to train and test the MetaPred2CS using a k-fold cross validation strategy. Specifically, the P+ set was compiled by mining protein-protein interaction databases, including BioGriD [34], DIP [35], IntAct [36], PSI-MI [37], UniProtKB [38] and MINT [39] using the RefSeq identifiers extracted from the P2CS database [4]. To create P-, experimentally validated non-interacting pairs were mined from publications describing high-throughput, systematic, yeast two-hybrid or phosphotransfer profiling experiments, from a number of organisms including: Caulobacter crescentus, Escherichia coli, Mycobacterium tuberculosis, Myxococcus xanthus, Synechocystis sp. and Mesorhizobium loti [814].

To test the performance of MetaPred2CS under different scenarios and to compare it to an independent, competing, methods, we derived interdependent testing sets as described below (NP+, OP+, species-specific, T, SP+ and SP- sets). In each test, MetaPred2CS was trained with the corresponding, orthogonal, version of P+ and P- (i.e. removing any proteins present in the testing subset).

NP+ and OP+ and Species-specific datasets

The NP+ set (for Neighbouring Pairs) contains 56 pairs of TCS that are encoded by neighbouring genes, while the OP+ set (for Orphan Pairs) is composed of 57 pairs that are encoded by genes, which are not adjacent in the genome. This distinction is important, as predictions of orphan pairs are usually more challenging [4042]. In order to further clarify species-specific and positive-to-negative class ratio bias in the predictions, we also produced four different species-specific testing sets: Escherichia coli, Myxococcus xanthus, Synechocytis sp, and Mesorhizobim lotis.

T, SP+ and SP- datasets

Datasets T, SP+ and SP- were extracted from the work by Burger and van Nimwegen [32] as testing sets to compare MetaPred2CS performance. For all these three testing sets, MetaPred2CS was trained with an orthogonal version of P+ and P- sets, i.e. any pair present in either of the testing sets was removed from P+ or P- prior to training. The T dataset is composed of 16 experimentally validated interacting pairs and 5 non-interacting pairs while the SP+ and SP- sets are composed of pairs of TCS extracted from the SwissRegulon database [32]. In addition, The SP+ and SP- was also used to compare to STRING [33] database.

The MetaPred2CS prediction method

Individual prediction methods

The selection of individual methodologies was based on their orthogonalitynature, i.e. sequence-based, performance and availability. MetaPred2CS integrates the prediction of six different methods: i2h, MT, PP, GF, GN and GO. Briefly, the i2h method scans for correlated (compensatory) mutations between residues of the two proteins of interest, where a high-correlation implies high-probability that given pair interacts [27]. The MT method relies on similarity between phylogenetic trees to infer the likelihood of interactions between pairs of proteins [28]. The GN and GO methods are based on the observation that proteins that are functionally related tend to be transcribed and expressed concurrently, i.e. are encoded by adjacent genes, particularly in prokaryotes [31]. The PP method is based on the idea that functionally related genes under strong selective pressure appear or disappear together as units during speciation events [29]. Finally, the GF method is based on fusion events, i.e. if two proteins appear as independent units in one organism but as a joint entity in another organism, then it is likely that the individual units are actually an interacting pair [30]. Detailed information about these methods as well as their technical aspects can be found in the individual references indicated above.

Reference genome dataset

The GN, GO, GF, and PP methods rely on a reference genome dataset, the quality of which, in the form of size and diversity, is central to their performance [29]. To that end, and to maximize the prediction performance of these methodologies, we compiled a diverse, yet relatively small (as a compromise between performance and calculation speed), reference genome dataset based on the most successful genome combinations proposed by Muley and Ranjan [43]. Our dataset comprised 243 individual genomes, belonging to 22 different classes (for further details see Table S1 in the Additional file 3). Genome annotations were downloaded from the NCBI database [44] and operon data from annotated genomes using Moreno-Hagelsieb and Collado-Vides' approach [45] available at http://microbiome.wlu.ca/public/TUpredictions/Predictions/ .

Implementation of individual prediction methods

All methods were implemented as described in their original publications. Co-evolutionary-based methods (i2h and MT) rely on the quality of MSAs, i.e. their completeness and diversity. We generated MSAs following the approach described in our previous work [46], using UniprotKB [38]. ParseBlast generates complete and diverse MSAs by filtering both highly identical and highly dissimilar sequence homologs, considering also the coverage of the alignment between query and hit proteins [46]. With the exception of the minimum and maximum number of species represented in the MSAs, 25 and 50 respectively, the rest of the prediction parameters were set to default values as described in the original works [27, 28]. These two parameters set the number of sequences shared between both MSAs, which include the sequences of common species in both alignments selecting the ones with the highest sequence identity to the corresponding pair. Thus, the minimum and maximum number in common between the two MSAs is an important aspect on these methodologies as its performance is highly influenced by these two parameters, i.e. the diversity of alignment.

We used InPrePPI [47], which implements all the genome context based methods (GF, PP, GN and GO), which requires a reference genome dataset and genome annotation (e.g. operon units) as described above. Among the parameters required for prediction are: (i) the evolutionary distances between target and reference organisms, which was calculated using 16S RNA data as described previously [47]; (ii) an e-value cut-off of 1e-5 for BLASTP [48] searches; (iii) a cut-off of 0.35 for mutual information values (required for PP); and (iv) a distance cut-off of 200 bp for the GN/GO predictions, as suggested previously [45, 49, 50]. The GF method identified fusions events, which in the case of TCS result in hybrid proteins combining a HK and RR in single coding unit, by using local alignments based on the Smith-Waterman algorithm [51], implemented in the ssearch36 program using default parameters [52].

Integration of individual prediction methods using a SVM: MetaPred2CS

MetaPred2CS is based on a support vector machine (SVM), implemented using the LIBSVM package [53]. The individual prediction methods described above form a six-dimensional vector representing the prediction scores for a given pair of proteins of interest, i.e. a HK and RR pair. The vector is then inputted into a SVM trained using the same training set. The -w option in LIBSVM was used to account for the imbalance between positive and negative classes. Also, the optimal values for the error cost (c) and the gamma value (g), were explored using a grid search on a 10-fold cross validation with the radial basis kernel function [54] (see Table S2 in the Additional file 3). Finally, decision values were normalized in a range between 0 and 1 (Fig. 1).

Fig. 1
figure 1

Schematic representation of MetaPred2CS. Individual predictions are performed for given pairs of HK and RR (a). The prediction scores are then used as the input vector for the SVM (b) trained in the P+ and P- sets (c). Finally, prediction scores are scaled from 0 to 1 (d)

Benchmarking and comparison of MetaPred2CS performance

MetaPred2CS was benchmarked and assessed using different datasets (described above). Firstly, to assess the contribution of each individual prediction method to the final classifier, we trained and test 8 different classifiers with different combinations of individual prediction methods (Table 1). Each of these different classifiers was assessed using 5-, 10- and 20-fold cross-validation using the P+ and P- datasets. Furthermore, MetaPred2CS, was benchmarked against NP+, OP+, and specie-specific testing sets were used to further discern the performance of predictions in orphans and neighbouring genes and specie-specific sets. Finally, MetaPredCS was compared against the work by Burger and van Nimwegen [32] (T, SP+ and SP- datasets) and STRING [33] database (SP+ and SP- datasets).

Table 1 AUC values of predictions by individual methods for the P+/P-, NP+/P- and OP+/P- datasets. GN and GO methods were not included the AUC comparison given the large genomic distance between pairs on the P- dataset that made the predictions unfeasible

Assessing MetaPred2CS performance

The performance of each classifier was evaluated according to sensitivity (1), specificity (2), accuracy (3), Mathew’s correlation coefficient MCC [55] (4) and Area Under the ROC Curve (AUC) [56] values. Formally,

$$ Sensitivity=\frac{TP}{TP+FN} $$
(1)
$$ Specificity=\frac{TN}{TN+FP} $$
(2)
$$ Accuracy=\frac{TP+TN}{TP+TN+FP+FN} $$
(3)
$$ MCC=\frac{TP\times TN-FP\times FN}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}} $$
(4)

Where TP, FP, TN, and FN represent true positives, false positives, true negatives and false negatives, respectively. Particularly important are the MCC values, given the disequilibrium of positive and negative classes, i.e. the difference in size of interacting and non-interacting pairs. The statistical analysis of ROC curves was performed using STAR [57].

Results and Discussion

Evaluation of individual prediction methods

Individual methods were tested on the P+/P-, NP+/P-, and OP+/P- datasets. Prediction performance metrics of each method are presented and compared using AUC values (Table 1). Co-evolutionary methods (i2h and MT) performed better than the genomic context methods, and the i2h method outperformed all other methods for each dataset, with MT being the next best method for each dataset.

With the exception of the PP, the best performance was on the NP+/P- dataset. This was expected because predicting orphan pairs is usually more challenging than neighboring pairs. PP however rely on comparison across genomes where interacting pairs either appear or disappear concurrently, hence genomic context does not play a unique contribution. Consequently, PP achieved the best performance on the OP+/P- dataset of the genomic context methods. It also performed similarly to the GF method on the P+/P- dataset. In the case of GN and GO, intrinsic limitations of these methodologies, i.e. rely on genomic distance, prevented its use on the P- and OP+ dataset, hence AUC and MCC could not be calculated, hence not presented in Table 1. Nonetheless, GN and GO are valid strategies in the prediction of pairing in neighboring genes (51 pairs out of 57 on the NP+ were predicted correctly), hence GH and GO predictions were considered as part input vector for the meta-predictor (see next).

Contribution of individual methods to MetaPred2CS

To understand the contribution of individual methods, several meta-predictors were trained and tested using different K-fold cross-validation strategies on the P+/P- sets. The different combinations of individual predictors are listed in Table 2. The meta-predictor combining all six-prediction methods, hereinafter referred to as the default predictor or MetaPred2CS, achieved the highest performance (AUC: 94.79; MCC: 0.51). The largest drop in performance resulted when the i2h method was removed from the input features vector (AUC: 88.87; MCC: 0.401). Omitting a method had a minimal effect if another method(s) based on similar principles, e.g. genomic-context, was retained. For example, when GO was excluded but GN was kept, the decreases in AUC and MCC were very small (AUC: 94.76; MCC: 0.46). However, when excluding both GN and GO methods together, the decrease in AUC and MCC was larger (AUC: 89.83; MCC: 0.41). Tests performed at different cross-validation levels also showed that the different sizes of the training and test datasets did not result in a large differences in the performance of SVM classifiers with our training dataset and the best results were obtained at 10-fold cross-validation (Table S2 Additional file 3).

Table 2 Combinations of prediction methods and prediction performance at 10-fold cross-validation. 1: i2h not included, 2: MT not included, 3: GF not included, 4: PP not included, 5: GN not included, 6: GO not included, 7: GN and GO not included, 8: all methods included. AUC and MCC represent the area under the ROC curved and Matthew’s correlation coefficient respectively

Species-specific predictions

To further characterise the performance of MetaPred2CS, we performed species-specific predictions. Four independent testing sets representing Escherichia coli, Myxococcus xanthus, Synechocytis sp, and Mesorhizobim loti were created, due to the number of TCS proteins encoded in their genomes and the resulting ratio between interacting and non-interacting pairs. As shown in several works (e.g. [58, 59]) the ratio between positive and negative cases has an important impact in the performance of predictors of protein-protein interactions. Escherichia coli represents the organism with the lowest number of TCS proteins (62) and the lowest interacting (22) to non-interacting (64) pairs ratio (approximately 1:3) while Myxococcus xanthus had the 236 pairs and a ratio of 20:216 interactiong:non-interacting pairs. The most challenging cases were Synechocystis sp. and Mesorhizobium loti with 20 interacting to 319 non-interacting pairs and 20 interacting to 364 non-interacting pairs, respectively. Overall and as expected, the best prediction performance was achieved for Escherichia coli, although predictions were still accurate even for Synechocystis sp. and Mesorhizobium loti (Table 3).

Table 3 Performance of default predictor on species-specific gene sets. Sensitivity, specificity, accuracy and MCC values are presented, as defined in the text

Predictions of neighbouring and orphan pairs (NP+/P- and OP+/P- sets)

The genes encoding a TCS pairs can be located in adjacent (neighbouring) or separate (orphan) positions within the genome. The prediction of interacting pairs would be expected to be more challenging for orphans than for neighbouring proteins. Therefore, to test the capacity and performance of MetaPred2CS under these different scenarios, the P+ dataset was divided into two subsets: NP+ (neighbouring pairs) and OP+ (orphan pairs), and assessed at different K-fold cross validations. As expected, MetaPred2CS performed better on the NP+/P- than on the OP+/P- set at any K-fold validation values (Table 4). The best performance was achieved at the 10-fold cross-validation level. ROC curves of NP+/P- (AUC = 0.98), P+/P- (AUC = 0.95) and OP+/P- (AUC = 0.89) datasets at 10-fold cross-validation are shown in Fig. 2.

Table 4 Prediction performance of default predictor on neighbouring and orphan pairs. AUC and MCC values for MetaPred2CS trained on the NP+/P- and OP+/P- datasets at different level of K-fold cross-validation
Fig. 2
figure 2

ROC curves of predictions on the NP+/P-, P+/P- and OP+/P- datasets using default predictor. Blue, black and red ROC curves represent the performance on the NP+/P-, P+/P- and OP+/P- datasets, respectively

Comparison of MetaPred2CS and a competing machine-learning method and STRING database

MetaPred2CS was compared to a competing machined learning method publicly available using common testing sets [32] and STRING [33] database. On the first instance, both methods were compared using the T set compiled in Burger and van Nimwegen’s original work [32]. The T set is composed of 16 interacting and 5 non-interacting protein pairs. As shown in Table 5, out of 21 pairs, 16 were predicted more accurately by MetaPred2CS (4 cases both methods performed at the same level). Moreover, MetaPred2CS correctly predicted all non-interacting pairs, assigning low prediction scores for all cases. The T dataset is however a small set composed of protein pairs from a single specie: Caulobacter crescentus.

Table 5 Prediction of the T dataset by the Bayesian approach [32] and MetaPred2CSc. Non-interacting protein pairs are marked with an asterisk and best predictions are highlighted in bold

A more comprehensive comparison was carried out on the SP+/SP- dataset, also compiled Burger and van Nimwegen’s original work [32], which is considerably larger and more diverse, comparing also to STRING [33] database. These datasets include protein pairs from 6 different species: Escherichia coli, Bacillus subtilis, Caulobacter crescentus, Mesorhizobium loti, Myxococcus xanthus, and Synechocystis sp. As shown in Fig. 3, MetaPred2CS performed better than the Bayesian approach (AUC: 92.8 vs. 83.5) and STRING [33] database (AUC: 92.8 vs 88.4). Statistical analysis of the ROC curves showed that there was a significant improvement of MetaPred2CS performance both over that of the Bayesian approach and STRING (p-value < 0.05).

Fig. 3
figure 3

ROC curves of predictions on the SP+/SP- datasets. Red, blue and green ROC curves represent predictions by MetaPred2CS, STRING [33], and the Bayesian approach of Burger and van Nimwegen [21], respectively

Conclusion

In this work we present a novel sequence-based prediction method designed specifically for TCS signalling networks: MetaPred2CS. The method was systematically assessed under different benchmarking scenarios, and performed well in all conditions, including using species-specific gene sets and TCS with different genome architecture features (i.e. neighbouring proteins vs. orphans). We show that integration of individual prediction methodologies improves the performance of the predictions, and that MetaPred2CS prediction performance compared favourably to existing methodologies. MetaPred2CS is accessible through a dedicated web-server at http://metapred2cs.ibers.aber.ac.uk.