Background

Sequence-derived structural and physicochemical descriptors have frequently been used in machine learning prediction of protein structural and functional classes [15], protein-protein interactions [69], subcellular locations [1016], peptides containing specific properties[17, 18], microarray data [19] and protein secondary structure prediction [20]. These descriptors serve to represent and distinguish proteins or peptides of different structural, functional and interaction profiles by exploring their distinguished features in compositions, correlations, and distributions of the constituent amino acids and their structural and physicochemical properties [2, 8, 21, 22]. There is thus a need to comparatively evaluate the effectiveness of these descriptor-sets for predicting different functional problems by using the same machine learning method and parameter optimization algorithm. Moreover, it is of interest to examine whether combined use of these descriptor-sets help to improve predictive performance.

This work is intended to evaluate the effectiveness of a total of six individual descriptor-sets and four combination-sets (Table 1) in the prediction of several protein functional families by using support vector machine (SVM). Six sets of individual descriptors and three combination-sets have been separately utilized in machine learning prediction of different protein functional and structural properties, all of which have shown impressive predictive performances [2224]. The six individual sets are amino acid compositions [23] (Set D1), dipeptide compositions [24] (Set D2), normalized Moreau-Broto autocorrelation [25, 26] (Set D3), Moran autocorrelation [27] (Set D4), Geary autocorrelation [28] (Set D5), and the composition, transition and distribution of structural and physicochemical properties [26, 8, 17, 29, 30] (Set D6). The three combination-sets are quasi sequence order formed by weighted sums of amino acid compositions and physicochemical coupling correlations [10, 11, 18, 31] (Set D7), pseudo amino acid composition (PseAA) formed by weighted sums of amino acid compositions and physicochemical square correlations [23, 32] (Set D8), and combination of amino acid compositions and dipeptide compositions (Set D9) [24, 33]. In this work, we also considered a fourth combination-set that combines descriptor-sets D1 through D8 (Set D10).

Table 1 Protein descriptors commonly used for predicting protein functional families.

The protein functional families studied here include enzyme EC2.4 [3437], G protein-coupled receptors [3840], transporter TC8.A [41], chlorophyll [42], lipid synthesis proteins involved in lipid synthesis [43], and rRNA-binding proteins. These six protein families were selected for testing the descriptor-sets based on their functional diversity, sample size and the range of reported family member prediction accuracies [2]. The reported prediction accuracies for these families are generally lower than those of other families [3], which are ideal for critically evaluating the effectiveness of these descriptor-sets; having a lower accuracy should enable a better differentiation of the performance of the various classes. SVM was used as the machine learning method for predicting these functional families because it is a popular method that has consistently been shown better performances than other machine learning methods [44, 45]. As this work is intended as a benchmarking study of the performance of various classes of descriptors, other than automatic optimization of results that is an integral part of the SVM programs, such as sigma value scanning, no further attempt was made to optimize the prediction performance of any descriptor class or of any dataset by manually tuning the parameters. Hence, prediction results reported in this paper might differ from those of reported studies.

EC2.4 includes glycosyltransferases that catalyze the synthesis of glycoconjugates and are involved in post-translational modification of proteins (glycosylation). Increased levels of glycosyltransferases have been found in disease states and inflammation [46, 47]. TC8.A consists of auxiliary transport proteins that facilitate transport across membranes, which play regulatory and structural roles [48]. GPCR represents G-protein coupled receptors that transduct signals for inducing cellular responses, and members of GPCR are of great pharmacological importance, as 50–60% of approved drugs elicit their therapeutic effect by selectively addressing members of the GPCR family [4952]. Chlorophyll proteins are essential for harvesting solar energy in photosynthetic antenna systems [53]. Lipid synthesis proteins play central roles in such processes as metabolism, and deficiencies or altered functioning of lipid binding proteins are associated with disease states such as obesity, diabetes, atherosclerosis, hyperlipidemia and insulin resistance [54]. rRNA-binding proteins play central roles in the post-transcriptional regulation of gene expression [55, 56], and their binding capabilities are mediated by certain RNA binding domains and motifs [5760].

Results and Discussion

The statistics of the six datasets are given in Table 2. Training and prediction statistics for each of the studied descriptor-sets are given in Table 3. Independent validation datasets were used to test the prediction accuracies. Among the 5-fold cross-validation test, independent dataset test and jackknife test, the jackknife is deemed the most rigorous [61]; however, it would have taken a lot of time to use SVM to conduct the jackknife test, thus as a compromise, here we adopted the independent dataset test. The program CDHIT [6264] was used to remove redundancy at both 90% and 70% sequence identity so to avoid bias, subsequently, the datasets are tested again with the independent evaluation sets and the statistics are given in Table 4. It should be emphasized that the performance evaluation for the studied descriptor-sets are based only on the datasets studied in this work and the conclusions from this study might not be readily extended to other datasets.

Table 2 Summary of datasets statistics, including size of training, testing and independent evaluation sets, and average sequence length.
Table 3 Dataset training statistics and prediction accuracies of six protein functional families. DS refers to descriptor set, where D1 = amino acid composition; D2 = dipeptide composition; D3 = Moreau-Broto autocorrelation; D4 = Moran autocorrelation; D5 = Geary autocorrelation; D6 = composition, transition and distribution descriptors; D7 = quasi sequence order; D8 = pseudo amino acid composition; D9 = combination of D1+D2; and D10 = combination of D1-D8. Predicted results given as TP (true positive), FN (false negative), TN (true negative), FP (false positive), Sen (sensitivity), Spec (specificity), Q (overall accuracy) and MCC (Matthews correlation coefficient).
Table 4 Dataset statistics and prediction accuracies after homologous sequences removal (HSR) at 90% and 70% identity. DS refers to descriptor set, where D1 = amino acid composition; D2 = dipeptide composition; D3 = Moreau-Broto autocorrelation; D4 = Moran autocorrelation; D5 = Geary autocorrelation; D6 = composition, transition and distribution descriptors; D7 = quasi sequence order; D8 = pseudo amino acid composition; D9 = combination of D1+D2; and D10 = combination of D1-D8. Predicted results given as TP (true positive), FN (false negative), TN (true negative), FP (false positive), Sen (sensitivity), Spec (specificity), Q (overall accuracy) and MCC (Matthews correlation coefficient).

The performance of the ten descriptor-sets were ranked by the Matthews correlation coefficient (MCC) values of the respective SVM prediction of the six functional families, which are given in Table 5. The computed MCC scores for these descriptor-sets are in the range of 0.64~0.97 for all protein families studied. Accordingly, the performance of these descriptor-sets is categorized into two groups based on their MCC values: 'Exceptional' (>0.85) and 'Good' (≤0.85). Moreover, these descriptor-sets are aligned in the order of their MCC values with "=" being of equal values and ">" indicating that one is better than the other. It is noted that, as the differences of many of these MCC values are rather small, such alignment is likely superficial to some extent and may not best reflect the real ranking of performance. Overall, the performances of these descriptor-sets are not significantly different, there is no overwhelmingly preferred descriptor-set, and SVM prediction performance appears to be highly dependent on the dataset.

Table 5 Descriptor sets ranked and grouped by MCC (Matthews correlation coefficient), before and after removal of homologous sequences at 90% and 70% identity, respectively.

As shown in Table 3 and Table 4, for many of the studied datasets, the differences in prediction accuracies and MCC values between different descriptor-sets are small. In particular, for GPCR and rRNA binding proteins, the results of almost all descriptor-sets are in the 'Exceptional' category. Examining the range of MCC values of the descriptor-sets for each of the studied protein families (after removal of 70% homologous sequences), the differences between the largest and smallest MCC values are, in order of increasing magnitude: 0.10, 0.12, 0.14, 0.16, 0.21 and 0.21 for rRNA binding proteins, GPCR, TC8.A, lipid synthesis proteins, chlorophyll proteins and EC.2.4 families respectively. Given that a difference of 0.10 and 0.20 in MCC values translates to an approximate 4% and 7% difference in overall prediction accuracy, this separation is not large indeed.

Though the dataset is a more important determinant of prediction performance than the choice of descriptor class, a few general trends could be observed. Three out of four of the combination-sets tend to exhibit slightly but consistently higher MCC values for the protein families studied in this work. These sets are Sets D8, D9 and D10. In contrast, only one out of six individual sets, Set D6, tend to exhibit slightly but consistently higher MCC values for the protein families studied in this work. Therefore, statistically speaking, it appears that the use of combination-sets tend to give slightly better prediction performance than the use of individual-sets.

When each class was examined individually in this study, we find that the combination of amino acid composition and dipeptide composition (Set D9) tends to give consistently better results than that of the individual descriptor-sets (Set D1 and Set D2). It has been reported that one drawback of amino acid composition descriptors is that the same amino acid composition may correspond to diverse sequences as sequence order is lost [24, 33]. This sequence order information can be partially covered by considering dipeptide composition (Set D2). On the other hand, dipeptide composition lacks information concerning the fraction of the individual residue in the sequence, thus, a combination-set is expected to give better prediction results [24, 33, 65, 66].

Using all descriptor-sets (Set D10) generally, but not always, gives the best result, which is consistent with the findings on the use of molecular descriptors for predicting compounds of specific properties. [67, 68] For instance, Xue et al. found that feature selection methods are capable of reducing the noise generated by the use of overlapping and redundant molecular descriptors, and in some cases, improving the accuracy of SVM classification of pharmacokinetic behaviour of chemical agents [69]. In our study, for example, the three autocorrelation descriptor-sets (Sets D3, D4 and D5) all utilize the same physicochemical properties, only differing in the correlation algorithm. The use of all available descriptors likely results in the inclusion of partially redundant information, some of which may to some extent become noise that interferes with the prediction results or obscures relevant information. Based on the results of previous studies [69], it is possible that feature selection methods may be applied for selecting the optimal set of descriptors to improve prediction accuracy as well as computing efficiency for predicting protein functional families.

Conclusion

The effectiveness of ten protein descriptor-sets in six protein functional family prediction using SVM was evaluated. Corroborating with previous work done on chemical descriptors [67, 68, 7076] and protein descriptors [4, 21, 30, 32, 35, 43, 77, 78], we found that the descriptor-sets evaluated in this paper, which comprise some of the commonly used descriptors, generally return good results and do not differ significantly. In particular, the use of combination descriptor-sets tends to give slightly better prediction performance than the use of individual descriptor-sets. While there seems to be no preferred descriptor-set that could be utilized for all datasets as prediction results is highly dependent on datasets, the performance of protein classification may be enhanced by selection of optimal combinations of descriptors using established feature-selection methods [79, 80]. Incorporation of appropriate sets of physicochemical properties not covered by some of the existing descriptor-sets may also help improving the prediction performance.

Methods

Datasets

The datasets were obtained from SwissProt [81], except for TC8.A, which was downloaded from Transport Classification Database (TCDB) [41]. These datasets were chosen for their functional diversity, sample size and the range of reported family member prediction accuracies. As SVM is essentially a statistical method, the datasets cannot be too small; yet it would also be convenient for the purposes of this study if they were not too large as to be unwieldy computationally. These downloaded datasets were used to construct the positive dataset for the corresponding SVM classification system. A negative dataset, representing non-class members, was generated by a well-established procedure [2, 3, 21, 30] such that all proteins was grouped into domain families [82] in the PFAM database, and the representative proteins of these families unrelated to the protein family being studied were chosen as negative samples.

These proteins, positive and negative, were further divided into separate training, testing and independent evaluation sets by the following procedure: First, proteins were converted into descriptor vectors and then clustered using hierarchical clustering into groups in the structural and physicochemical feature space [83], where more homologous sequences will have shorter distances between them, and the largest separation between clusters was set to a ceiling of 20. One representative protein was randomly selected from each group to form a training set that is sufficiently diverse and broadly distributed in the feature space. Another protein within the group was randomly selected to form the testing set. The selected proteins from each group were further checked to ensure that they are distinguished from the proteins in other groups. The remaining proteins were then designated as the independent evaluation set, also checked to be at a reasonable level of diversity. Fragments, defined as smaller than 60 residues, were discarded. This selection process ensures that the training, testing and evaluation sets constructed are sufficiently diverse and broadly distributed in the feature space. Though an analysis of the 'similar' proteins in each cluster showed that the majority of the proteins in a cluster are quite non-homologous, the program CDHIT (Cluster Database at High Identity with Tolerance) [6264] was further used after the SVM model was trained to remove redundancy at both 90% and 70% sequence identity, so as to avoid bias as far as possible. CDHIT removes homologous sequences by clustering the protein dataset at some user-defined sequence identity threshold, for example 90%, and then generating a database of only the cluster representatives, thus eliminating sequences with greater than 90% identity. The statistical details are given in Tables 2 and 3.

Algorithms for generating protein descriptors

Ten sets of commonly used composition and physicochemical descriptors were generated from the protein sequence (see Table 1). These descriptors can be computed via the PROFEAT server [22].

Amino acid composition (Set D1) is defined as the fraction of each amino acid type in a sequence

f ( r ) = N r N , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzcqGGOaakcqWGYbGCcqGGPaqkcqGH9aqpdaWcaaqaaiabd6eaonaaBaaaleaacqWGYbGCaeqaaaGcbaGaemOta4eaaiabcYcaSaaa@3703@
(1)

where r = 1, 2, ..., 20, N r is the number of amino acid of type r, and N is the length of the sequence. Dipeptide composition (Set D2) is defined as

f r ( r , s ) = N r s N 1 , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzcqWGYbGCcqGGOaakcqWGYbGCcqGGSaalcqWGZbWCcqGGPaqkcqGH9aqpdaWcaaqaaiabd6eaonaaBaaaleaacqWGYbGCcqWGZbWCaeqaaaGcbaGaemOta4KaeyOeI0IaeGymaedaaiabcYcaSaaa@3E0B@
(2)

where r, s = 1, 2, ..., 20, N ij is the number of dipeptides composed of amino acid types r and s.

Autocorrelation descriptors are a class of topological descriptors, also known as molecular connectivity indices, describe the level of correlation between two objects (protein or peptide sequences) in terms of their specific structural or physicochemical property [84], which are defined based on the distribution of amino acid properties along the sequence [85]. Eight amino acid properties are used for deriving the autocorrelation descriptors: hydrophobicity scale [86]; average flexibility index [87]; polarizability parameter [88]; free energy of amino acid solution in water [88]; residue accessible surface areas [89]; amino acid residue volumes [90]; steric parameters [91]; and relative mutability [92].

These autocorrelation properties are normalized and standardized such that

P r ' = P r p ¯ σ , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaudaqhaaWcbaGaemOCaihabaGaei4jaCcaaOGaeyypa0ZaaSaaaeaacqWGqbaudaWgaaWcbaGaemOCaihabeaakiabgkHiTiqbdchaWzaaraaabaacciGae83WdmhaaiabcYcaSaaa@3949@
(3)

where P ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGqbaugaqeaaaa@2DED@ is the average value of a particular property of the 20 amino acids. P ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGqbaugaqeaaaa@2DED@ and σ are given by

P ¯ = r = 1 20 P r 20 , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGqbaugaqeaiabg2da9maalaaabaWaaabCaeaacqWGqbaudaWgaaWcbaGaemOCaihabeaaaeaacqWGYbGCcqGH9aqpcqaIXaqmaeaacqaIYaGmcqaIWaama0GaeyyeIuoaaOqaaiabikdaYiabicdaWaaacqGGSaalaaa@3C09@
(4)

and

σ = 1 20 r = 1 20 ( P r P ¯ ) 2 . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFdpWCcqGH9aqpdaGcaaqaamaalaaabaGaeGymaedabaGaeGOmaiJaeGimaadaamaaqahabaGaeiikaGIaemiuaa1aaSbaaSqaaiabdkhaYbqabaGccqGHsislcuWGqbaugaqeaiabcMcaPmaaCaaaleqabaGaeGOmaidaaaqaaiabdkhaYjabg2da9iabigdaXaqaaiabikdaYiabicdaWaqdcqGHris5aaWcbeaakiabc6caUaaa@42AA@
(5)

Moreau-Broto autocorrelation descriptors (Set D3) [84, 93] are defined as

A C ( d ) = i = 1 N d P i P i + d , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqWGdbWqcqGGOaakcqWGKbazcqGGPaqkcqGH9aqpdaaeWbqaaiabdcfaqnaaBaaaleaacqWGPbqAaeqaaOGaemiuaa1aaSbaaSqaaiabdMgaPjabgUcaRiabdsgaKbqabaaabaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4KaeyOeI0IaemizaqganiabggHiLdGccqGGSaalaaa@4441@
(6)

where d = 1, 2, ..., 30 is the lag of the autocorrelation, and P i and Pi+dare the properties of the amino acid at positions i and i+d respectively. After applying normalization, we get

A T S ( d ) = A C ( d ) N d . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqWGubavcqWGtbWucqGGOaakcqWGKbazcqGGPaqkcqGH9aqpdaWcaaqaaiabdgeabjabdoeadjabcIcaOiabdsgaKjabcMcaPaqaaiabd6eaojabgkHiTiabdsgaKbaacqGGUaGlaaa@3D94@
(7)

Moran autocorrelation descriptors (Set D4) [94] are calculated as

I ( d ) = 1 N d i = 1 N d ( P i P ¯ ) ( P i + d P ¯ ) 1 N i = 1 N ( P i P ¯ ) 2 , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGjbqscqGGOaakcqWGKbazcqGGPaqkcqGH9aqpdaWcaaqaamaalaaabaGaeGymaedabaGaemOta4KaeyOeI0IaemizaqgaamaaqahabaGaeiikaGIaemiuaa1aaSbaaSqaaiabdMgaPbqabaGccqGHsislcuWGqbaugaqeaiabcMcaPiabcIcaOiabdcfaqnaaBaaaleaacqWGPbqAcqGHRaWkcqWGKbazaeqaaOGaeyOeI0IafmiuaaLbaebacqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaojabgkHiTiabdsgaKbqdcqGHris5aaGcbaWaaSaaaeaacqaIXaqmaeaacqWGobGtaaWaaabCaeaacqGGOaakcqWGqbaudaWgaaWcbaGaemyAaKgabeaakiabgkHiTiqbdcfaqzaaraGaeiykaKYaaWbaaSqabeaacqaIYaGmaaaabaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaaOGaeiilaWcaaa@601F@
(8)

where d, P i and Pi+dare defined in the same way as that for Moreau-Broto autocorrelation and P ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGqbaugaqeaaaa@2DED@ is the average of the considered property P along the sequence:

P ¯ = i = 1 N P i N . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGqbaugaqeaiabg2da9maalaaabaWaaabCaeaacqWGqbaudaWgaaWcbaGaemyAaKgabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaOqaaiabd6eaobaacqGGUaGlaaa@3A73@
(9)

Geary autocorrelation descriptors (Set D5) [95] are written as

C ( d ) = 1 2 ( N d ) i = 1 N d ( P i P i + d ) 2 1 N 1 i = 1 N ( P i P ¯ ) 2 , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGdbWqcqGGOaakcqWGKbazcqGGPaqkcqGH9aqpdaWcaaqaamaalaaabaGaeGymaedabaGaeGOmaiJaeiikaGIaemOta4KaeyOeI0IaemizaqMaeiykaKcaamaaqahabaGaeiikaGIaemiuaa1aaSbaaSqaaiabdMgaPbqabaGccqGHsislcqWGqbaudaWgaaWcbaGaemyAaKMaey4kaSIaemizaqgabeaakiabcMcaPmaaCaaaleqabaGaeGOmaidaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaojabgkHiTiabdsgaKbqdcqGHris5aaGcbaWaaSaaaeaacqaIXaqmaeaacqWGobGtcqGHsislcqaIXaqmaaWaaabCaeaacqGGOaakcqWGqbaudaWgaaWcbaGaemyAaKgabeaakiabgkHiTiqbdcfaqzaaraGaeiykaKYaaWbaaSqabeaacqaIYaGmaaaabaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaaOGaeiilaWcaaa@6087@
(10)

where d, P ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGqbaugaqeaaaa@2DED@ , P i and Pi+dare defined as above. Comparing the three autocorrelation descriptors: while Moreau-Broto autocorrelation uses the property values as the basis for measurement, Moran autocorrelation utilizes property deviations from the average values, and Geary utilizes the square-difference of property values instead of vector-products (of property values or deviations). The Moran and Geary autocorrelation descriptors measure spatial autocorrelation, which is the correlation of a variable with itself through space.

The descriptors in Set D6 comprise of the composition (C), transition (T) and distribution (D) features of seven structural or physicochemical properties along a protein or peptide sequence [5, 29]. The seven physicochemical properties [2, 5, 29] are hydrophobicity; normalized Van der Waals volume; polarity; polarizibility; charge; secondary structures; and solvent accessibility. For each of these properties, the amino acids are divided into three groups such that those in a particular group are regarded to have approximately the same property. For instance, residues can be divided into hydrophobic (CVLIMFW), neutral (GASTPHY), and polar (RKEDQN) groups. C is defined as the number of residues with that particular property divided by the total number of residues in a protein sequence. T characterizes the percent frequency with which residues with a particular property is followed by residues of a different property. D measures the chain length within which the first, 25%, 50%, 75% and 100% of the amino acids with a particular property are located respectively. There are 21 elements representing these three descriptors: 3 for C, 3 for T and 15 for D, and the protein feature vector is constructed by sequentially combining the 21 elements for all of these properties and the 20 residues, resulting in a total of 188 dimensions.

The quasi-sequence order descriptors (Set D7) [96] are derived from both the Schneider-Wrede physicochemical distance matrix [10, 18, 97] and the Grantham chemical distance matrix [31], between each pair of the 20 amino acids. The physicochemical properties computed include hydrophobicity, hydrophilicity, polarity, and side-chain volume. Similar to the descriptors in Set D6, sequence order descriptors can also be used for representing amino acid distribution patterns of a specific physicochemical property along a protein or peptide sequence [18, 31]. For a protein chain of N amino acid residues R1R2...R N , the sequence order effect can be approximately reflected through a set of sequence order coupling numbers

τ d = i = 1 N d ( d i , i + d ) 2 , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFepaDdaWgaaWcbaGaemizaqgabeaakiabg2da9maaqahabaGaeiikaGIaemizaq2aaSbaaSqaaiabdMgaPjabcYcaSiabdMgaPjabgUcaRiabdsgaKbqabaGccqGGPaqkdaahaaWcbeqaaiabikdaYaaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGtcqGHsislcqWGKbaza0GaeyyeIuoakiabcYcaSaaa@44FB@
(11)

where τ d is the d th rank sequence order coupling number (d = 1, 2, ..., 30) that reflects the coupling mode between all of the most contiguous residues along a protein sequence, and di,i+dis the distance between the two amino acids at position i and i+d. For each amino acid type, the type 1 quasi sequence order descriptor can be defined as

X r = f r r = 1 20 f r + w d = 1 30 τ d , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGybawdaWgaaWcbaGaemOCaihabeaakiabg2da9maalaaabaGaemOzay2aaSbaaSqaaiabdkhaYbqabaaakeaadaaeWbqaaiabdAgaMnaaBaaaleaacqWGYbGCaeqaaOGaey4kaSIaem4DaC3aaabCaeaaiiGacqWFepaDdaWgaaWcbaGaemizaqgabeaaaeaacqWGKbazcqGH9aqpcqaIXaqmaeaacqaIZaWmcqaIWaama0GaeyyeIuoaaSqaaiabdkhaYjabg2da9iabigdaXaqaaiabikdaYiabicdaWaqdcqGHris5aaaakiabcYcaSaaa@4BFF@
(12)

where r = 1, 2, ..., 20, f r is the normalized occurrence of amino acid type i and w is a weighting factor (w = 0.1). The type 2 quasi sequence order is defined as

X d = w τ d 20 r = 1 20 f r + w d = 1 30 τ d , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGybawdaWgaaWcbaGaemizaqgabeaakiabg2da9maalaaabaGaem4DaChcciGae8hXdq3aaSbaaSqaaiabdsgaKjabgkHiTiabikdaYiabicdaWaqabaaakeaadaaeWbqaaiabdAgaMnaaBaaaleaacqWGYbGCaeqaaOGaey4kaSIaem4DaC3aaabCaeaacqWFepaDdaWgaaWcbaGaemizaqgabeaaaeaacqWGKbazcqGH9aqpcqaIXaqmaeaacqaIZaWmcqaIWaama0GaeyyeIuoaaSqaaiabdkhaYjabg2da9iabigdaXaqaaiabikdaYiabicdaWaqdcqGHris5aaaakiabcYcaSaaa@5076@
(13)

where d = 21, 22, ..., 50. The combination of these two equations gives us a vector that describes a protein: the first 20 components reflect the effect of the amino acid composition, while the components from 21 to 50 reflect the effect of sequence order.

Similar to the quasi-sequence order descriptor, the pseudo amino acid descriptor (Set D8) is made up of a 50-dimensional vector in which the first 20 components reflect the effect of the amino acid composition and the remaining 30 components reflect the effect of sequence order, only now, the coupling number τ d is now replaced by the sequence order correlation factor θλ [32]. The set of sequence order correlated factors is defined as follows:

θ λ = 1 N λ i = 1 L λ Θ ( R i , R i + λ ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF4oqCdaWgaaWcbaGae83UdWgabeaakiabg2da9maalaaabaGaeGymaedabaGaemOta4KaeyOeI0Iae83UdWgaamaaqahabaGaeuiMdeLaeiikaGIaemOuai1aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWGsbGudaWgaaWcbaGaemyAaKMaey4kaSIae83UdWgabeaakiabcMcaPaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemitaWKaeyOeI0Iae83UdWganiabggHiLdGccqGGSaalaaa@4C65@
(14)

where θλ is the first-tier correlation factor that reflects the sequence order correlation between all of the λ-most contiguous resides along a protein chain (λ = 1,...30) and N is the number of amino acid residues. Θ(R i , R j ) is the correlation factor and is given by

Θ ( R i , R j ) = 1 3 { [ H 1 ( R j ) H 1 ( R i ) ] 2 + [ H 2 ( R j ) H 2 ( R i ) ] 2 + [ M ( R j ) M ( R i ) ] 2 } , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqHyoqucqGGOaakcqWGsbGudaWgaaWcbaGaemyAaKgabeaakiabcYcaSiabdkfasnaaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyypa0ZaaSaaaeaacqaIXaqmaeaacqaIZaWmaaWaaiWaaeaadaWadaqaaiabdIeainaaBaaaleaacqaIXaqmaeqaaOGaeiikaGIaemOuai1aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkcqGHsislcqWGibasdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabdkfasnaaBaaaleaacqWGPbqAaeqaaOGaeiykaKcacaGLBbGaayzxaaWaaWbaaSqabeaacqaIYaGmaaGccqGHRaWkdaWadaqaaiabdIeainaaBaaaleaacqaIYaGmaeqaaOGaeiikaGIaemOuai1aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkcqGHsislcqWGibasdaWgaaWcbaGaeGOmaidabeaakiabcIcaOiabdkfasnaaBaaaleaacqWGPbqAaeqaaOGaeiykaKcacaGLBbGaayzxaaWaaWbaaSqabeaacqaIYaGmaaGccqGHRaWkdaWadaqaaiabd2eanjabcIcaOiabdkfasnaaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyOeI0Iaemyta0KaeiikaGIaemOuai1aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkaiaawUfacaGLDbaadaahaaWcbeqaaiabikdaYaaaaOGaay5Eaiaaw2haaiabcYcaSaaa@7006@
(15)

where H1(R i ), H2(R i ) and M(R i ) are the hydrophobicity [98], hydrophilicity [99] and side-chain mass of amino acid R i , respectively. Before being substituted in the above equation, the various physicochemical properties P(i) are subjected to a standard conversion,

P ( i ) = P 0 ( i ) i = 1 20 P 0 ( i ) 20 i = 1 20 [ P 0 ( i ) i = 1 20 P 0 ( i ) 20 ] 2 20 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGPbqAcqGGPaqkcqGH9aqpdaWcaaqaaiabdcfaqnaaCaaaleqabaGaeGimaadaaOGaeiikaGIaemyAaKMaeiykaKIaeyOeI0YaaabCaeaadaWcaaqaaiabdcfaqnaaCaaaleqabaGaeGimaadaaOGaeiikaGIaemyAaKMaeiykaKcabaGaeGOmaiJaeGimaadaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaeGOmaiJaeGimaadaniabggHiLdaakeaadaGcaaqaamaalaaabaWaaabCaeaadaWadaqaaiabdcfaqnaaCaaaleqabaGaeGimaadaaOGaeiikaGIaemyAaKMaeiykaKIaeyOeI0YaaabCaeaadaWcaaqaaiabdcfaqnaaCaaaleqabaGaeGimaadaaOGaeiikaGIaemyAaKMaeiykaKcabaGaeGOmaiJaeGimaadaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaeGOmaiJaeGimaadaniabggHiLdaakiaawUfacaGLDbaaaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabikdaYiabicdaWaqdcqGHris5aOWaaWbaaSqabeaacqaIYaGmaaaakeaacqaIYaGmcqaIWaamaaaaleqaaaaaaaa@68BB@
(16)

This sequence order correlation definition [Eqs. (14), (15)] introduce more correlation factors of physicochemical effects as compared to the coupling number [Eq. (11)], and has shown to be an improvement on the way sequence order effect information is represented [32, 35, 100]. Thus, for each amino acid type, the first part of the vector is defined as

X r = f r r = 1 20 f r + w d = 1 30 θ j , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGybawdaWgaaWcbaGaemOCaihabeaakiabg2da9maalaaabaGaemOzay2aaSbaaSqaaiabdkhaYbqabaaakeaadaaeWbqaaiabdAgaMnaaBaaaleaacqWGYbGCaeqaaOGaey4kaSIaem4DaC3aaabCaeaaiiGacqWF4oqCdaWgaaWcbaGaemOAaOgabeaaaeaacqWGKbazcqGH9aqpcqaIXaqmaeaacqaIZaWmcqaIWaama0GaeyyeIuoaaSqaaiabdkhaYjabg2da9iabigdaXaqaaiabikdaYiabicdaWaqdcqGHris5aaaakiabcYcaSaaa@4BFC@
(17)

where r = 1, 2, ..., 20, f r is the normalized occurrence of amino acid type i and w is a weighting factor (w = 0.1), and the second part is defined as

X d = w θ d 20 r = 1 20 f r + w d = 1 30 ϑ λ . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGybawdaWgaaWcbaGaemizaqgabeaakiabg2da9maalaaabaGaem4DaChcciGae8hUde3aaSbaaSqaaiabdsgaKjabgkHiTiabikdaYiabicdaWaqabaaakeaadaaeWbqaaiabdAgaMnaaBaaaleaacqWGYbGCaeqaaOGaey4kaSIaem4DaC3aaabCaeaacqWFrpGsdaWgaaWcbaGae83UdWgabeaaaeaacqWGKbazcqGH9aqpcqaIXaqmaeaacqaIZaWmcqaIWaama0GaeyyeIuoaaSqaaiabdkhaYjabg2da9iabigdaXaqaaiabikdaYiabicdaWaqdcqGHris5aaaakiabc6caUaaa@50AC@
(18)

Support Vector Machines (SVM)

As the SVM algorithms have been extensively described in the literature [2, 3, 101], only a brief description is given here. In the case of a linear SVM, a hyperplane that separates two different classes of feature vectors with a maximum margin is constructed. One class represents positive samples, for example EC2.4 proteins, and the other the negative samples. This hyperplane is constructed by finding a vector w and a parameter b that minimizes ||w||2 that satisfies the following conditions: w·x i + b ≥ 1, for y i = 1 (positive class) and w·x i + b ≤ -1, for y i = -1 (negative class). Here x i is a feature vector, y i is the group index, w is a vector normal to the hyperplane, | b | | | w | | MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabcYha8jabdkgaIjabcYha8bqaaiabcYha8jabcYha8jabhEha3jabcYha8jabcYha8baaaaa@3884@ is the perpendicular distance from the hyperplane to the origin, and ||w||2 is the Euclidean norm of w. In the case of a nonlinear SVM, feature vectors are projected into a high dimensional feature space by using a kernel function such as K ( x i , x j ) = e x i x j 2 / 2 σ 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWscqGGOaakcqWH4baEdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiabhIha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyypa0Jaemyzau2aaWbaaSqabeaacqGHsisldaqbdaqaaiabhIha4naaBaaameaacqWGPbqAaeqaaSGaeyOeI0IaeCiEaG3aaSbaaWqaaiabdQgaQbqabaaaliaawMa7caGLkWoadaahaaadbeqaaiabikdaYaaaliabc+caViabikdaYGGaciab=n8aZnaaCaaameqabaGaeGOmaidaaaaaaaa@4A11@ . The linear SVM procedure is then applied to the feature vectors in this feature space. After the determination of w and b, a given vector x can be classified by using sign [(w.x) + b], a positive or negative value indicating that the vector x belongs to the positive or negative class respectively.

As a discriminative method, the performance of SVM classification can be accessed by measuring the true positive TP (correctly predicted positive samples), false negative FN (positive samples incorrectly predicted as negative), true negative TN (correctly predicted negative samples), and false positive FP (negative samples incorrectly predicted as positive) [4, 102, 103]. As the numbers of positive and negative samples are imbalanced, the positive prediction accuracy or sensitivity Q p = TP/(TP+FN) and negative prediction accuracy or specificity Q n = TN/(TN+FP) [101] are also introduced. The overall accuracy is defined as Q = (TP+TN)/(TP+FN+TN+FP). However, in some cases, Q, Q p , and Q n are insufficient to provide a complete assessment of the performance of a discriminative method [102, 104]. Thus the Matthews correlation coefficient (MCC) was used in this work to evaluate the randomness of the prediction:

M C C = T P × T N F P × F N ( T P + F N ) ( T P + F P ) ( T N + F P ) ( T N + F N ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGnbqtcqWGdbWqcqWGdbWqcqGH9aqpdaWcaaqaaiabdsfaujabdcfaqjabgEna0kabdsfaujabd6eaojabgkHiTiabdAeagjabdcfaqjabgEna0kabdAeagjabd6eaobqaamaakaaabaGaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemOta4KaeiykaKIaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemiuaaLaeiykaKIaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemiuaaLaeiykaKIaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemOta4KaeiykaKcaleqaaaaakiabcYcaSaaa@5CEB@
(19)

where MCC ∈ [-1,1], with a negative value indicating disagreement of the prediction and a positive value indicating agreement. A zero value means the prediction is completely random. The MCC utilizes all four basic elements of the accuracy and it provides a better summary of the prediction performance than the overall accuracy.