Background

Intrinsically unstructured/disordered proteins (IUPs/IDPs) contain long disordered regions or are completely disordered [1]. IUPs are abundant in higher organisms and often involved in key biological processes, such as transcriptional and translational regulation, membrane fusion and transport, cell-signal transduction, protein phosphorylation, the storage of small molecules and the regulation of self-assembly of large multi-protein complexes [211]. The disordered state in IUPs creates larger intermolecular interfaces [12], which increase the speed of interaction with potential binding partners even in the absence of tight binding, and provide flexibility for binding diverse ligands [2, 5, 11, 1315]. However, long disordered regions in IUPs cause difficulties in protein structure determination by both X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. Efficient prediction of disordered region(s) in IUPs by computational methods can provide valuable information in high-throughput protein structure characterization, and reveal useful information on protein function [15].

Many predictors have been developed to predict disordered regions in proteins, such as PONDR [16], RONN [17, 18], VL2, VL3, VL3H and VL3E from DisProt [1, 19, 20], NORSp [21, 22], DISpro [23], FoldIndex [24], DISOPRED and DISOPRED2 [2527], GlobPlot [28] and DisEMBL [29], IUPred [30], Prelink [31], DRIP-PRED (MacCallum, online publication http://www.forcasp.org/paper2127.html), FoldUnfold [32], Spritz [33], DisPSSMP [34], VSL1 and VSL2 [35, 36], POODLE-L [37], POODLE-S [38], Ucon [39], PrDOS and metaPrDOS [40, 41]. Among these predictors, neural networks and support vector machines (SVM) are widely used machine learning models.

The accuracy of disorder predictors is generally limited by the existence of various kinds of disorder which are represented unevenly in the various databases, and the lack of a unique definition of disorder [30]. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions [36, 42] because the long and short disordered regions have different sequence features. As a result, some predictors are specified for predicting long disordered regions, such as POODLE-L [37], while predictors targeting all types of disordered regions usually have to sacrifice time efficiency for exploiting heterogeneous sequence properties, especially the evolution information extracted from PSI-BLAST or protein secondary structure [25, 27, 3336, 38].

In this paper, a new algorithm, IUPforest-L, is proposed for predicting long disordered regions based on the random forest learning model [43] and simple parameters extracted from the amino acid sequences and amino acid indices (AAIs) [44]. 10-fold cross validation tests and blind tests demonstrate that IUPforest-L can achieve significantly higher accuracy than many existing algorithms in predicting long disordered regions. The high efficiency of IUPforest-L makes it a suitable tool for high-throughput comparative proteomics studies.

Methods

Training and test datasets

To train IUPforest-L, a subset (positive training set) of disordered regions was constructed based on DisProt [20] (version3.6), which includes 352 regions of 30 aa or more in length, and 47251 aa in total. The negative training set was extracted from PDBSelect25 [45] (Oct. 2004 version), from which 366 sequences (80,324 aa in total) of at least 80 aa were selected. Each of them has a high resolution crystal structure (< 2.0Å), free from missing backbone or side chain coordinates and free from non-standard amino acid residues.

To assess the prediction performance of IUPforest-L, three datasets were used for blind tests. The first dataset was based on the dataset constructed by Hirose et al (Hirose-ADS1) as a blind test dataset of POODLE-L [37]. Hirose-ADS1 contains 53 ordered regions of at least 40 aa (11431 aa in total) from the Protein Data Bank [46] and 63 disordered regions of at least 30 aa (8700 aa in total) from DisProt (version 3.0). The second test set (Han-ADS1) comprised of 53 ordered regions as in Hirose-ADS1 and 33 long disordered regions (5959 aa in total) from the latest DisProt (version 4.8), after removing disordered regions homologous to those in DisProt (version 3.6) using the CD-HIT algorithm with a threshold of 0.9 sequence identity [47]. The third test set (Peng-DB) was constructed based on the blind test dataset of VLS2 [35], where 56 long disordered regions of at least 30 aa (2841 aa in total) and 1965 ordered regions (318431 aa in total) were used in the assessment. For an objective blind test of IUPforest-L on Hirose-ADS1 (as reported in Table 1), disordered and ordered regions homologous to those in Hirose-ADS1 were removed from our training set based on the CD-HIT algorithm with a threshold of 0.9 sequence identity [47], resulting in 293 disordered regions and 364 ordered regions for training the predictor. Similarly for an objective blind test on Han-ADS1 (as reported in Table 2), ordered regions homologous to the 53 ordered regions in Hirose-ADS1 were also removed from the original training set for training the predictor. The final IUPforest-L was still trained by the whole training set. Han-ADS1 is listed in the Additional file 1 and is also available online at http://dmg.cs.rmit.edu.au/IUPforest/Han-ADS1.fasta.

The random forest model

A random forest is an ensemble of unpruned decision trees (shown in Figure 1), where each tree is grown using a (bootstrap) subset of the training dataset [43]. Bootstrapping is a resampling technique where a number of bootstrap training sets are drawn randomly from the original training set with replacement. Each tree induced from bootstrap samples grows to full length and the number of trees in the forest is adjustable. To classify an instance of unknown class label, each tree casts a unit classification vote. The forest selects the classification having the most votes over all the trees in the forest. Compared with the decision tree classifier [48], random forests have better classification accuracy, are more tolerant to noise and are less dependent on the training datasets.

Figure 1
figure 1

A sample random forest. In the decision tree on the left, the node at the root tests an attribute, such as the first order auto-correlation function of the normalized flexibility parameters (see below). If it is higher than a given threshold then the residue is in a disordered state (the right branch labelled D); otherwise another input attribute is tested and a set of other tests are further performed until a decision is made. A random forest can comprise hundreds of decision trees.

Features used in training and test

When a window of w aa slides along a sequence, six types of features were derived from residues within the window, as defined and explained below.

  1. 1)

    Auto-correlation function of amino acid indices (AAIs)

Each residue in the training set was replaced with a value of the normalized amino acid index (AAI), which is a set of 20 numerical values representing the physicochemical and biological property of 20 amino acids chosen from the AAI Database http://www.genome.ad.jp/dbget/aaindex.html[44]. As such, a sequence of N amino acids in the training set was firstly transformed into a numerical sequence [49, 50], and denoted as:

P1P2P i Pi+wP N (1)

Then the sequences were smoothed with the Savitzky-Golay filter [51]. The Moreau-Broto auto-correlation function F d of an AAI was then calculated within a window, which is defined as:

F d = 1 w d i = 1 w d p i × p i + d , ( d = 1 , 2 , ... , w 1 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOray0aaSbaaSqaaiabdsgaKbqabaGccqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabdEha3jabgkHiTiabdsgaKbaakmaaqahabaGaemiCaa3aaSbaaSqaaiabdMgaPbqabaGccqGHxdaTcqWGWbaCdaWgaaWcbaGaemyAaKMaey4kaSIaemizaqgabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWG3bWDcqGHsislcqWGKbaza0GaeyyeIuoakiabcYcaSiabbccaGiabcIcaOiabdsgaKjabg2da9iabigdaXiabcYcaSiabikdaYiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdEha3jabgkHiTiabigdaXiabcMcaPaaa@58BF@
(2)

where w is the window size, p i and pi+dare the AAI values at positions i and i+d respectively [49, 50]. For example, when d = 1, the numerical value for each residue (i) in the window multiplies by the value of the next nearby residue (i+1) and F1 is the average of these w-1 products. Similarly, F2 is the average of the w-2 products generated from every other residue. The value of d represented the order of the correlation and was tuned to optimize the prediction performance. The F d (d = 1, 2,..., 30) for the 40 sets of AAI listed in Table A1 in the Additional file 2 was calculated and evaluated in training IUPforest-L.

  1. 2)

    The mean hydrophobicity, defined as the average value of Kyte and Doolittle's hydrophobicity [52] in the window.

  2. 3)

    The modified hydrophobic cluster [31], calculated as the longest hydrophobic clusters in the window divided by the window size.

  3. 4)

    The mean net charge within the window and local mean net charge within a 13 aa fragment centered at the middle residue. Residues K and R were defined as +1; D and E were defined as -1; other residues were 0.

  4. 5)

    The mean contact number, defined as the mean expected number of contacts in the globular state of all residues within the window [53].

  5. 6)

    The composition of four reduced amino acid groups [48] and the Shannon's entropy (K2) of the amino acid composition within the window were calculated.

IUPforest-L

A flow chart of IUPforest-L is shown in Figure 2. At the training stage, features listed above were calculated when a window of w aa slides from the N-terminal end to the C-terminal end of a protein sequence. Each window was tagged with a label of disorder (Positive or P) or order (Negative or N) according to the label of the central residue, and IUPforest-L models were trained from the six types of features and the prediction result could be obtained by each of the trees in the forest. The final score was the combination of the outcomes from all trees by voting and smoothing [51]. A threshold that best classifies the ordered or disordered state of a residue could be defined based on the scores and the optimal evaluated values in the 10-fold cross validation tests.

Figure 2
figure 2

Flow chart of IUPforest-L. The sequence features were calculated when a window slides along a protein sequence. IUPforest-L models were trained from the six types of features and the prediction result could be obtained by each of the trees in the forest. The final score in the prediction was the combination of the outcomes from all trees by voting.

During the prediction stage, the features were firstly calculated when a window slides over an inquiry sequence and then a probability score of a residue being disordered was assigned by IUPforest-L. A region was annotated as disordered only when 30 or more consecutive amino acid residues were predicted to be disordered.

Evaluations

To estimate the generalization accuracy, 10-fold cross validation tests were conducted, where 90% of the sequences in the training set were randomly used in training and the other 10% were used in test. The process was repeated for the entire dataset and the final result was the average of the results from 10 processes. In addition, independent tests were performed on Hirose-ADS1 [38], Han-ADS1 and Peng-DB [35].

During the cross validation test, the confusion matrix, which comprises true positive (TP), false positive (FP), true negative (TN) and false negative (FN), was used to evaluate the prediction performance in terms of the following measures:

1) AUC, the area under the receiver operating characteristic (ROC) curve. Each point of a ROC curve was defined by a pair of values for the true positive rate ( T P T P + F N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSqaaeaacqWGubavcqWGqbauaeaacqWGubavcqWGqbaucqGHRaWkcqWGgbGrcqWGobGtaaaaaa@3444@ , or sensitivit y) and the false positive rate ( F P T N + F P MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSqaaeaacqWGgbGrcqWGqbauaeaacqWGubavcqWGobGtcqGHRaWkcqWGgbGrcqWGqbauaaaaaa@3428@ , or 1-specificity).

2) Balanced overall accuracy

B a c c s e n s i t i v i t y + s p e c i f i c i t y 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOqaiKaemyyaeMaem4yamMaem4yamMaeyyyIOBcfa4aaSaaaeaacqWGZbWCcqWGLbqzcqWGUbGBcqWGZbWCcqWGPbqAcqWG0baDcqWGPbqAcqWG2bGDcqWGPbqAcqWG0baDcqWG5bqEcqGHRaWkcqWGZbWCcqWGWbaCcqWGLbqzcqWGJbWycqWGPbqAcqWGMbGzcqWGPbqAcqWGJbWycqWGPbqAcqWG0baDcqWG5bqEaeaacqaIYaGmaaaaaa@53E8@
(3)

3) Sproduct

Sproductsensitivity × specificity (4)

4 Matthew's correlation functions (MCC)

M C C T P × T N F P × F N ( T P + F P ) × ( T P + F N ) × ( T N + F P ) × ( T N + F N ) ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyta0Kaem4qamKaem4qamKaeyyyIOBcfa4aaSaaaeaacqWGubavcqWGqbaucqGHxdaTcqWGubavcqWGobGtcqGHsislcqWGgbGrcqWGqbaucqGHxdaTcqWGgbGrcqWGobGtaeaadaGcaaqaaiabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabdcfaqjabcMcaPiabgEna0kabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabd6eaojabcMcaPiabgEna0oaabmaabaGaemivaqLaemOta4Kaey4kaSIaemOrayKaemiuaaLaeiykaKIaey41aqRaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemOta4KaeiykaKcacaGLOaGaayzkaaaabeaaaaaaaa@63B4@
(5)

5) Sw

S w w d i s o r d e r × T P w o r d e r × F P + w o r d e r × T N w d i s o r d e r × F N w d i s o r d e r × ( T P + F N ) + w o r d e r × ( T N + F P ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaem4DaCNaeyyyIOBcfa4aaSaaaeaacqWG3bWDdaWgaaqaaiabdsgaKjabdMgaPjabdohaZjabd+gaVjabdkhaYjabdsgaKjabdwgaLjabdkhaYbqabaGaey41aqRaemivaqLaemiuaaLaeyOeI0Iaem4DaC3aaSbaaeaacqWGVbWBcqWGYbGCcqWGKbazcqWGLbqzcqWGYbGCaeqaaiabgEna0kabdAeagjabdcfaqjabgUcaRiabdEha3naaBaaabaGaem4Ba8MaemOCaiNaemizaqMaemyzauMaemOCaihabeaacqGHxdaTcqWGubavcqWGobGtcqGHsislcqWG3bWDdaWgaaqaaiabdsgaKjabdMgaPjabdohaZjabd+gaVjabdkhaYjabdsgaKjabdwgaLjabdkhaYbqabaGaey41aqRaemOrayKaemOta4eabaGaem4DaC3aaSbaaeaacqWGKbazcqWGPbqAcqWGZbWCcqWGVbWBcqWGYbGCcqWGKbazcqWGLbqzcqWGYbGCaeqaaiabgEna0kabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabd6eaojabcMcaPiabgUcaRiabdEha3naaBaaabaGaem4Ba8MaemOCaiNaemizaqMaemyzauMaemOCaihabeaacqGHxdaTcqGGOaakcqWGubavcqWGobGtcqGHRaWkcqWGgbGrcqWGqbaucqGGPaqkaaaaaa@980F@
(6)

where w disorder and w order are the weights for disorder and order, respectively, that are inversely proportional to the number of residues in the disordered and ordered state. Sw is also referred to as probability excess [34].

The Sproduct and Sw scores were used in assessing the prediction of disordered residues in the Critical Assessment of techniques for protein Structure Prediction (CASP6 and CASP7) [54].

Results

10-fold cross validation

The 10-fold cross validation test results using a window of 31 aa are shown in Figure 3. With the type 1 features (the auto-correlation function of AAIs), a forest of more trees has better predictive ability. For example, the AUC increased by 2% when the number of trees increased from 10 to 50. However, the prediction accuracy increased only modestly when the number of trees increased further from 50 to 100, while the training and prediction times increased significantly. Detailed test results on the time consumption with number of trees from 10 to 300 are shown in Additional file 3. The default setting of IUPforest-L is a forest of 50 trees for large-scale application.

Figure 3
figure 3

ROC curves of 10-fold cross validation tests. The ROC curves of IUPforest-L in 10-fold cross validation tests are shown. The IUPforest-L could reach a 76% true positive rate at a 10% false positive rate with MCC = 0.67, Sproduct = 0.64 and an area of 89.5% under the ROC curve on the training data set with a window of 31 aa.

With a forest of a fixed number of trees, the ROC curve trained with the auto-correlation function with d value between 1 and 15 almost overlaps with the ROC curve trained with d between 1 and 30. This result indicates that continuous correlations between nearby residues from 1 to 15 along the sequence could determine whether the fragment is involved in a long disordered region.

Figure 3 shows that training with either type 1 or the combination of type 2–6 features could reach the 70.5% or 70.0% true positive rates with a 10% false positive rate, while their combination of type 1–6 features could lead to a higher true positive rate of 76%, and an area of 89.5% under the ROC curve. This result indicates that type 1 and type 2–6 features have redundant, but complementary structural information. Type 2–6 features generated only nine parameters in total within a given window, while type 1 features could generate hundreds of parameters that take into account both order information and physicochemical properties. It has been shown that the random forest model has no risk of overfitting with an increasing number of trees when the input parameters increase [43]. As such, using type 1 features to train the random forest could extract more sequence-structure information [55] and it was thus conjectured that better prediction accuracy could be achieved with the auto-correlation functions generated from AAIs combined with other features of type 2–6.

The window size and step size for sliding the window are additional parameters for tuning the performance of IUPforest-L models. The window should be of a reasonable size so that the AAI-based correlation can be of significance within a reasonable training or test time. Training with small windows increases training time and can introduce noises, whereas training with large windows can lose local information. Our test results indicated that from window size of 19 aa to 47 aa, the random forest gave more stable result on blind test set Han-ADS1, but the accuracy on the 10-fold cross validation test on the training set will drop with larger window size (details listed in the Additional file 4). To batch predict long disordered regions, the window size of 31 aa was set in default to keep the balance between high efficiency and accuracy. The step size for sliding windows can also affect the accuracy and overall time efficiency at both the training and test stage. If the step size is too small, when a window slides along a sequence, it will introduce redundancy between windows and prolong the time for training models. Our experiments (details listed in the Additional file 4) show that with a sliding step of 20 aa (default setting) models achieve stable sensitivity without significantly prolonging the training process.

Blind tests

Figure 4 depicts the ROC curves for IUPforest-L and nine other publicly available predictors on the blind test dataset Hirose-ADS1, including the most recently developed POODLE-L [37] and the well-established predictor VSL2 [35]. It is obvious that IUPforest-L outperforms most of the other predictors in terms of the AUC in predicting long disordered regions. At low false positive rates (< 10%), IUPforest-L achieves the highest sensitivity among all the predictors. In terms of other performance measures listed in Table 1, IUPforest-L is also comparable to or better than other predictors. Figure 5 and Table 2 show the result of comparisons of IUPforest-L with POODLE-L and other predictors on the Han-ADS1. It can be seen that IUPforest-L always performs better than most of them. Figure 6 and Table 3 shows the result of comparisons of IUPforest-L with POODLE-L and other predictors on the Peng-DB. It can be seen again that IUPforest-L always performs better than most of them.

Figure 4
figure 4

ROC curves on test set Hirose-ADS1. The ROC curves for IUPforest-L and nine publicly available predictors on the blind test dataset Hirose-ADS1 are shown. IUPforest-L has the best performance in terms of the AUC.

Table 1 Comparison of IUPforest-L with other predictors on th test set Hirose-ADS1*
Figure 5
figure 5

ROC curves on test set Han-ADS1. The ROC curves for IUPforest-L and some publicly available predictors on the blind test dataset Han-ADS1 are shown. IUPforest-L performs better in terms of the AUC than most of the predictors.

Table 2 Comparison of IUPforest-L with other predictors on test set Han-ADS1*
Figure 6
figure 6

ROC curves on test set Peng-DB. The ROC curves for IUPforest-L and some publicly available predictors on the blind test dataset Peng-DB are shown. IUPforest-L performs better in terms of the AUC than most of the predictors.

Table 3 Comparison of IUPforest-L with other predictors on test set Peng-DB.*

Discussion

Protein structures are stabilized by numerous intramolecular interactions such as hydrophobic, electrostatic, van der Waals, and hydrogen bonds. The autocorrelation function tests whether the physicochemical property of one residue is independent of that of neighbouring residues. A group of residues involved in ordered structure close to other groups of residues in space will be dynamically constrained by the backbone or side chain interactions from these residues, and hence the residues in both groups will show higher density in the contact map or have higher pairwise correlation. On the other hand, a repetitive sequence of amino acids can also give significant positive correlation for all physicochemical properties. Therefore, residues within a fragment exhibiting a higher autocorrelation may either be structurally constrained, or have low sequence complexity. The random forest learning model employed by the IUPforest-L disorder predictor combines the complementary contributions from the autocorrelation function (type 1 feature) and other types of features, so that structural information is extracted with a high degree of prediction accuracy.

The random forest model is an ensemble learning model and is known to be more robust to noise than many non-ensemble learning models. However, as a classifier based on the random forest needs to load many decision trees into memory, it is relatively slow for a forest to predict a single instance at a time. As a result, the current web server of IUPforest-L is better suited to batch prediction of a large number of protein sequences, which provides an alternative useful tool in large-scale analysis of long disordered regions in proteomics. As an initial application, we have provided a server, IUPforest-L, for batch protein sequences analysis with the output of overall summary and details for each sequence. For convenience in proteomic comparisons, the prediction results for 62 eukaryotes linked to The European Bioinformatics Institute are also pre-calculated and can be downloaded from the server.

Conclusion

IUP studies are important because disordered regions are common and functionally important in proteins. The new features, the auto-correlation functions of AAIs within a protein fragment, reflect both residues' contact information and sequence complexity. The random forest model based on this new type of features and other physicochemical features could effectively detect long disordered regions in proteins. As a result, a new predictor, IUPforest-L, was developed to predict long disordered regions in proteins. Its high accuracy and high efficiency make it a useful tool in large-scale protein sequence analysis.