Background

Discovering the three dimensional structure of a protein from its amino acid sequence via computational means is a challenging task and open for research in biological science and bioinformatics. Deciphering protein structure elucidates protein functions. This has a profound impact on understanding the heterogeneity of proteins, protein-protein interactions and protein-peptide interactions. This further helps in drug design. A usual way to predict the structure of a protein is to first acquire proteins with known structures (e.g. by crystallography techniques) and then from their sequences, the prediction process can be conducted by developing recognition techniques. Thereafter, the developed techniques can be used to classify unknown protein sequences into one of its classes or folds. The length of a protein sequence (i.e., the number of amino acids in it) is usually different from the length of another protein sequence. However, two proteins with different lengths and low sequential similarities can be categorized to the same fold. The identification of protein folds from a protein sequence would bring us one step closer to the recognition of protein structures. A wide range of techniques have been developed over the past two decades to recognize protein folds. Despite numerous contributions and significant enhancements achieved [1, 2], the protein fold recognition problem is yet to be completely solved.

The focus in protein fold recognition can be broadly classified into two categories: 1) the development of classifiers to improve fold recognition, and 2) the development of feature extraction techniques using alphabetical sequence (syntactical-based) and/or using physicochemical properties of the amino acids (attribute-based or physicochemical-based). For the former case, several classifiers have been developed or used including linear discriminant analysis [3], Bayesian classifiers [4], Bayesian decision rule [5], K-Nearest Neighbor [6, 7], Hidden Markov Model [8, 9], Artificial Neural Network [10, 11] and ensemble classifiers [1, 12]-[14]. For the latter case, several feature extraction techniques have been developed including composition, transition and distribution [15], occurrence [16], pairwise frequencies [17], pseudo-amino acid composition [18], bigrams [19], autocorrelation [6, 20, 21] and deriving features by considering more physicochemical properties [22].

Dubchak et al. [15] proposed syntactical and physicochemical-based features for protein fold recognition. They used the five following attributes of amino acids for deriving physicochemical-based features namely, hydrophobicity (H), predicted secondary structure based on normalized frequency of α-helix (X), polarity (P), polarizability (Z) and van der Waals volume (V). The features proposed by Dubchak et al. [15] have been widely used in the field of protein fold recognition [4, 12, 22]-[28]. Apart from the above mentioned 5 attributes used by Dubchak et al. [15], features have also been extracted by incorporating other attributes of the amino acids. Some of the other attributes used are: solvent accessibility [29], flexibility [30], bulkiness [31], first and second order entropy [32], size of the side chain of the amino acids [22]. Several attributes have been picked for feature extraction usually in an arbitrary way for protein fold recognition. Contrary to this, Taguchi and Gromiha [16] argued that features from attributes of amino acids can be ignored due to having insufficient information and only syntactical-based features should be considered. This shows that proper exploration of the amino acid attributes has not been conducted. To this, we posed a question: ‘which of the attributes of the amino acids are to be selected for the protein fold recognition problem?’ The answer to this would open the third category of research apart from 1) the development of classifiers, and 2) the development of feature extraction techniques based on the syntactic and/or physicochemical properties.

In this study, we develop a methodology for selecting the attributes of the amino acids for protein fold recognition in a systematic manner. In order to do this, a successive feature selection (SFS) technique based on an exhaustive greedy search algorithm can be applied [33, 34]. The SFS technique can find important features from a group of features. However, since several features could be extracted from an attribute (e.g. composition, transition and distribution from hydrophobicity of amino acids) and there could be many attributes, this would lead to selecting multi-dimensional features belonging to an attribute. Therefore, we develop a scheme to identify important attributes by investigating multi-dimensional features corresponding to attributes. For brevity we call the proposed technique as multi-dimensional SFS (MD-SFS).

We show two schemes of MD-SFS: backward elimination and forward selection. In the backward elimination scheme, the search for the best subset of attributes will start by first retaining all the given attributes. Then an irrelevant attribute is discarded from this subset at an iteration time point that causes minimum loss of information for the subset. This elimination of attributes from a subset is performed until all the attributes are ranked. This scheme is useful to find attributes of low importance that could perform well, if selected in an appropriate subset. In the forward selection scheme, the best attribute is selected first, and a subsequent attribute is included in the subset such that the included attribute improves the performance (e.g., in terms of classification) of the subset. This scheme, however, could be biased towards the highest ranking attribute.

Experiments are carried out using Dubchak’s (DD) dataset [25], Taguchi’s (TG) dataset (Taguchi and Gromiha, [16]) and extended Ding and Dubchak (EDD) dataset [2]. The selection of physicochemical attributes by MD-SFS technique shows improvement in protein fold recognition by around 18 ~ 24% on all the datasets when 10-fold cross-validation has been applied. The MD-SFS technique has been illustrated in the next section and its usefulness has been demonstrated in the subsequent sections.

Multi-dimensional successive feature selection

The MD-SFS scheme has been illustrated in Figures 1 and 2. The backward-elimination procedure of MD-SFS has been shown in Figure 1 and the forward-selection procedure has been shown in Figure 2. The purpose of MD-SFS is to select the best attribute for protein fold recognition. In the figures, four attributes (T a  = 4) have been depicted. A feature extraction technique has been used to extract d-dimensional features from each attribute. Attributes are represented as Aj (where j = 1, 2,..., T a ) and extracted features of Aj are represented as. f1j, f2j, …, f d j In the figures, there are 4 levels in total, including the beginning state. The number of attributes at each of the level is denoted by NA. The classification accuracy using k-fold cross-validation of a subset of attributes is denoted by H( · ) (Figure 2). The highest average classification accuracy using k-fold cross-validation at each of the level is depicted by α l where l = 0, 1, …, T a  − 1. The output is the ranked attributes.

Figure 1
figure 1

Multi-dimensional successive feature selection: backward elimination scheme.

Figure 2
figure 2

Multi-dimensional successive feature selection: forward selection scheme.

MD-SFS: backward elimination

For the backward-elimination case of MD-SFS (Figure 1), a group of features belonging to an attribute is dropped one at a time in each of the successive levels. This would give subsets of attributes containing features. The number of features in a subset at level l is (T a  − l)d. A classifier is used to compute average classification accuracy using k-fold cross-validation procedure on each of the subsets. The subset of attributes with the highest average classification accuracy is progressed to the next subsequent level. The size of subset is reduced by d number of features as we progress across the levels. This process is terminated when all the attributes are ranked. In Figure 1, at level 1, the highest average classification accuracy (α1) obtained is by attribute subset {A1, A2, A4}. It is also possible that average classification accuracy of more than one subset is the same. In that case, the subsets with the highest average classification accuracies would progress to the next level. In Figure 1, subset {A1, A2, A4} is progressed to level 2 and at this level the subset with highest average classification accuracy (α2) is {A2, A4}. At level 3, the subset with highest average classification accuracy (α3) is {A2}. In Figure 1, ranked attributes are {A2, A4, A1, A3}, where A2 is the top ranked attribute and A3 is the bottom ranked or least important attribute. Furthermore, there could be two criteria in which attributes can be selected. For an instance, if we want to select best 3 attributes for the design then we can take {A2, A4, A1} from the ranked attributes. However, a better way would be to find the argument of the maximum of α l i.e., r = arg max l = 0 , . . . , T a 1 α l . For an instance, if r = 2 then this indicates that subset {A2, A4} at level 2 exhibits the maximum accuracy among all the selected subsets at all the levels. Therefore, attributes of subset {A2, A4} can be selected for the design. We refer the former criterion of selection as brute-n (where n is the number of attributes to be selected) and the latter criterion as maximum accuracy (MA) based criterion.

The MD-SFS backward elimination procedure would approximately require between C 2 T a + 1 and 2 T a 1 search combinations, where T α is the total number of attributes and the term mC n is the n-combination of m elements. If t s denotes the number of attributes in a subset s then this subset would have t s d features. Therefore, the computational complexity of a classifier for doing classification using subset s will be based on t s d number of features.

MD-SFS: forward selection

For the forward-selection case of MD-SFS (Figure 2), an attribute with corresponding d-dimensional features would be taken at a time for computing average classification accuracy using the k-fold cross-validation procedure. The attribute corresponding to the highest average classification accuracy will be stored; i.e., r 1 = arg max j = 1 , . . . T a H A j . The selected attribute containing the features will go to the next successive level. In the next level, an attribute that exhibits the highest average classification accuracy in combination with the selected attribute from the previous level A r 1 will be retained. This process will continue until all the attributes are ranked. The number of features used in computing classification accuracy at level l is (l + 1)d. Further, we can apply the same two criteria (brute-n and MA-based) for obtaining attributes from the ranked set of attributes as it was discussed in MD-SFS backward elimination approach.

The MD-SFS forward selection would require around T a (T a  + 1)/2 search combinations, where T a is the total number of attributes. A subset s with t s attributes would have t s d number of features. The computational complexity of a classifier used to compute classification accuracy would depend on t s d number of features.

Methods

Dataset

In this study, three protein sequence datasets have been used: 1) DD-dataset [25], 2) TG-dataset (Taguchi and Gromiha, [16]) and 3) EDD-dataset [2]. The DD-dataset that we have used consists of 311 protein sequences in the training set where two proteins have no more than 35% of sequence identity for aligned subsequence longer than 80 residues. The test set consists of 383 protein sequences where sequence identity is less than 40%. Both the sets belong to 27 SCOP folds which represented all major structural classes: α, β, α/β, and α + β[25]. The training set and test set have been merged as a single set of data in order to perform k-fold cross-validation process.

TG-dataset consists of 1612 protein sequences belonging to 30 different folding types of globular proteins. The names of the number of protein sequences in each of 30 folds have been described in Taguchi and Gromiha [16]. The protein sequences of TG-dataset have been first transformed into their corresponding PSSM (position-specific-scoring-matrix) [35] sequences by using PSIBLAST (http://blast.ncbi.nlm.nih.gov/) (the cut off E-value is set to E = 0.001).

EDD-dataset consists of 3418 proteins with less than 40% sequential similarity belonging to the 27 folds that originally used in DD-dataset. We extracted the EDD-dataset from the 1.75 SCOP in similar manner to Dong et al. [2] in order to study our proposed method using a larger number of samples.

Physicochemical attributes

In this study 30 physicochemical attributesa have been utilized including 5 popular attributes as used by Dubchak et al. [15]. The attributes with the corresponding symbols are listed in Table 1. The residues of amino acids of these 30 attributes are given in Table 2.

Table 1 Physicochemical attributes used in the study
Table 2 Residues of amino acids of the 30 attributes 1

Feature extraction

As discussed in the Background Section, there exist several feature extraction techniques. Given a classifier, the features derived from different feature extraction techniques would exhibit different fold recognition performances. Since in this paper the aim is not to find a feature extraction technique for a particular classifier, we use a simple autocorrelation of the residues of protein sequences. The expression for autocorrelation features used in the paper is given as follows:

R i = 1 N k = 1 N i s k μ s k + i μ ,
(1)

where N is the length of protein sequence, s k is the residue of kth amino acid in a protein sequence and μ is the mean (or average) of N residues. In this work, we use i = 1, 2, …, 20. Therefore, each protein sequence will give 20-dimensional autocorrelation features.

Classifiers

In the literature, several classifiers have been used for the protein fold recognition problem. We used three techniques for classification: support vector machine (SVM), Naïve Bayes (NB) and linear discriminant analysis (LDA) with nearest centroid classifier [57]-[59]. SVM and NB classifiers are used from WEKA environment [60] by using WEKA’s default parameter settings.

Results and discussions

Five attributes used by Ding and Dubchak [25] are used as a benchmark. These attributes are H, P, Z, X and V (see Table 1 for the description of these symbols). In all the experiments we use a 10-fold cross-validation process to obtain the recognition performance. First we present in Table 3 the fold recognition using these 5 attributes on DD, TG and EDD datasets. It can be clearly observed that the highest fold recognition on DD-dataset obtained by HPZXV is 32.8%, on TG-dataset is 28.8% and on EDD-dataset is 38.4%.

Table 3 Protein fold recognition (shown in percentage) on all the datasets using HPZXV attributes used by Ding and Dubchak [[25]]

Next we apply MD-SFS backward elimination approach on DD-dataset, TG-dataset and EDD-dataset, respectively on three cases: 1) using top 10 attributes of the amino acids from Tables 1, 2) using top 15 attributes of the amino acids from Tables 1, and 3) using all 30 attributes from Table 1. We use two criteria: brute-n and MA-based (as discussed in Section MD-SFS: Backward Elimination), to select the attributes. Since in Table 3 the results are reported using 5 attributes, we apply brute-5 to compare the results with that of Table 3. The selected attributes with their corresponding protein fold recognition (abbreviated as PFR in Tables 4, 5, 6, 7, 8, 9, 10 and 11) performance on DD-dataset using brute-5 criterion is given in Table 4 and using MA-based criterion is given in Table 5. The first row of results is by HPZXV (which is taken from Table 3). The first column indicates the number of attributes taken for attribute selection. The same setup has been used for all the remaining tables (Tables 6, 7, 8, 9, 10 and 11). It can be seen from Tables 4 and 5 that incorporating more attributes and then performing attribute selection is helping in improving the recognitionperformance. By using only 5 attributes (Table 4), the recognition performance has significantly improved by 5.7% to 16.6% as compared with the recognition performance of HPZXV attributes. If the number of attributes is not fixed and selection is based on MA criterion then the improvement is recorded between 14.1% and 18.1%.

Table 4 MD-SFS backward elimination approach on DD-dataset using brute-5 criterion
Table 5 MD-SFS backward elimination approach on DD-dataset using MA-based criterion
Table 6 MD-SFS backward elimination approach on TG-dataset using brute-5 criterion
Table 7 MD-SFS backward elimination approach on TG-dataset using MA-based criterion
Table 8 MD-SFS backward elimination approach on EDD-dataset using brute-5 criterion
Table 9 MD-SFS backward elimination approach on EDD-dataset using MA-based criterion
Table 10 MD-SFS forward selection approach on DD-dataset using brute-5 criterion
Table 11 MD-SFS forward selection approach on DD-dataset using MA-based criterion

A similar scheme has been applied using the TG-dataset and the results are reported in Tables 6 and 7 (Table 6 using brute-5 criterion and Table 7 using MA-based criterion). It can be observed from Table 6 that recognition performance has been improved between 7.5% and 10.2%. Also the improvement from Table 7 is between 12.6% and 18.1%.

We have also employed the EDD-dataset for the experiment and the results are reported in Tables 8 and 9 (Table 8 using brute-5 criterion and Table 9 using MA-based criterion). From Table 8, we note that the improvement in recognition performance is between 6.5% and 8.8%, and from Table 9, it is between 15.5% and 24.3%.

Subsequently we applied the MD-SFS forward selection approach on the DD, TG and EDD datasets. Again we use brute-5 and MA-based criteria. The protein fold recognition performance using the DD-dataset with brute-5 criterion is show in Table 10 and with MA-based criterion is shown in Table 11. It can be observed from Table 10 that by using only 5 attributes the recognition performance can be improved between 4.5% and 12.2%. In a similar way, the improvement using MA-based criterion is noted from 13.3% to 17.7%.

On TG-dataset, MD-SFS forward selection with brute-5 criterion is depicted in Table 12 and with MA-based criterion is depicted in Table 13. The improvement from Table 12 using only 5 attributes is between 8.1% and 10.4%; and, from Table 13 we have improvement from 12.4% to 17.5%.

Table 12 MD-SFS forward selection approach on TG-dataset using brute-5 criterion
Table 13 MD-SFS forward selection approach on TG-dataset using MA-based criterion

Similarly, on EDD-dataset, MD-SFS forward selection with brute-5 criterion is shown in Table 14 and with MA-based criterion is shown in Table 15. The improvement from Table 14 using only 5 attributes is between 7.4% and 8.7%; and, from Table 15 we have improvement from 10.5% to 16.2%.

Table 14 MD-SFS forward selection approach on EDD-dataset using brute-5 criterion
Table 15 MD-SFS forward selection approach on EDD-dataset using MA-based criterion

From the results, we can deduce that physicochemical based attributes are important for the prediction accuracy of protein folds. An appropriately selected subset of attributes could enhance the prediction accuracy significantly. The subset of attributes selected for different datasets are different. The attributes in a subset also vary depending on the classifier used. However, some attributes repeatedly appear on the obtained subsets. For an instance, a subset BPEVO is selected from all 30 attributes using brute-5 criterion on DD-dataset when LDA is used and a subset BPDFM is selected when SVM is used (see Table 4). It can be observed that the attributes B and P are common in both the subsets. This could imply that these attributes contain more discriminative information for protein fold recognition than others. When we analyzed all the subsets using brute-5 criterion on all the three datasets (Tables 4, 6, 8, 10, 12 and 14), we found that top 5 occurrences of attributes are J (appeared 12 times), B (appeared 9 times), T (appeared 9 times), F (appeared 8 times) and M (appeared 6 times). Therefore, these attributes (J,B,T,F and M) can be seen as important attributes. However, it does not imply that a subset containing all these 5 attributes would perform the best as the performance of attributes in combination with other attributes is also crucial.

We have also carried out a statistical hypothesis test to exhibit the significance of the results achieved. In order to do this, we randomly selected m attributes from a given set of n attributes and computed prediction accuracy using these m attributes. We repeated this random selection r times and computed average prediction accuracy. All three classifiers (LDA, SVM and NB) are used for this purpose. We applied this testing on all the three benchmark datasets (DD, TG and EDD) and compared the results with the proposed schemes. In this testing, we used m = 5, n = 30 and r = 20. The results are reported in Tables 16, 17 and 18. It can be observed from these tables that the prediction accuracy using a random selection approach is inferior to the proposed schemes. This depicts that systematically selecting attributes (using MD-SFS procedures) contributed to the prediction accuracy of protein folds.

Table 16 Statistical analysis using DD-dataset
Table 17 Statistical analysis using TG-dataset
Table 18 Statistical analysis using EDD-dataset

Furthermore, we have carried out paired t-test with 5% significance level to study the statistical significance of the prediction accuracy obtained. We used MD-SFS backward elimination method (using brute-5 criterion) as a prototype and used all the three classifiers (LDA, SVM and NB). We compared the results obtained by all the classifiers for HPZXV attributes for DD, TG and EDD benchmarks (the degree of freedom is 2). The paired t-test results for LDA, SVM and NB are 0.029, 0.003 and 0.004, respectively. These results show that the prediction accuracies obtained are significant.

We can summarize that the performance of the protein fold recognition improved when the attributes are appropriately selected. This also shows that physicochemical attributes can play an important role in protein fold recognition if selected appropriately. It should also be noted that the performance can be improved further by considering several other feature extraction techniques with sophisticated ensemble classifiers.

Conclusion

In this study, we have shown that by selecting physicochemical attributes of amino acids the protein fold recognition performance improved significantly. It is, therefore, beneficial to explore important attributes in the process of determining the three dimensional structure of proteins. To do this, we have developed a multi-dimensional successive feature selection (MD-SFS) technique and shown it on both backward elimination and forward selection approaches. There are several attributes available (e.g. a list of 544 attributes can be found in AAindex, http://www.genome.jp/aaindex/, [61]) and the investigation of these attributes by an exhaustive search would help in solving the problem better. Though it is always useful to explore as many attributes as possible, it comes with an expense of additional computational cost and memory requirements. Nonetheless, computationally efficient techniques for an exhaustive exploration of important attributes should care to develop along with the development of feature extraction and classification techniques.

Endnote

aThough there are large number of physicochemical based attributes defined for amino acids, many authors (e.g. [31, 62]-[65]) in the past, used limited number of attributes (up to 8) in their studies. We attempted to study the attributes which were given more emphasis in the literature.