Background

Determining function of a protein is one of the most challenging problems of the post-genomic era. In past various techniques have been developed for predicting the function of proteins using information derived from sequence similarity or clustering patterns of co-regulated genes, interaction of protein etc. It is important to understand interaction of protein with other proteins or ligands in order to understand it function. One of most important ligands among the molecules that interact with proteins is nucleotide. Prediction of proteins and nucleotide interaction can be divided in two categories (I) short nucleotide interaction, where short nucleotide (mono/di/trinucleotide) interacts with proteins (II) polynucleotide interactions, where polynucleotide (DNA/RNA) interacts with proteins.

Many proteins use small nucleotides as a source of energy and signaling molecules inside the cell (adenine and guanine nucleotides). The flavin and nicotinamide nucleotides work as electron donor/acceptors in lots of cellular metabolic reactions. FAD is a redox cofactor involved in several important reactions in metabolism. Living organism mostly generate energy by using glucose or fat molecules, both metabolic pathway regulated by enzyme which prosthetic group is FAD. Thus, identification of FAD interacting residue (FIR) is very important in molecular recognition. Despite tremendous progress in the annotation of protein, there is no any existing online tools are available for the prediction of FIR using primary sequence. Experimental method to identify FIR is very difficult and time consuming process and also very costly. We can easily identify either FAD interact with protein or not by using absorption spectra but can't identify which residues are FIRs.

In the past large number of tools have been developed for the prediction of polynucleotide (DNA/RNA) interacting residues using different machine learning techniques [17]. In contrast there has been only one preexisting method available for the prediction of small nucleotide-protein interaction, developed by Saito et al [8]. They developed method for small nucleotide binding site prediction using empirical score approach and detect 40% FAD binding sites correctly. Saito et al. methods only give us idea about binding site but can't give information about the FAD interacting residues. Kallberg et al. [9] used simple sequence in Hidden Markov Model and developed method for identifying Rossmann folds and predict there coenzyme specificity (NAD, NADP, FAD) and found that FAD least preferred cofactor. So there studies suggest that FAD interacting residues can't predict easily. Thus, the development of computational method for predicting FIR in a protein from its amino acid sequence is very important for functional annotation of proteins.

In this work, a systematic attempt has been made to predict FIRs in a protein sequences using binary pattern and PSSM profiles of 5172 FIRs and non-FIRs of 198 non-redundant protein chains. In first step FBP chain were analyzed, then SVM model were developed by using binary pattern of FIRs. It have been observed in past that evolutionary information provide more information, thus we developed SVM models using PSSM profiles obtained from PSI-BLAST [10]. All models developed in this study were evaluated using five-fold cross validation technique. FADPred can directly predict the FAD interacting residues using binary pattern of sequence and its evolutionary information. Our server will be useful for experimental biologists working on flavoproteins/flavoenzymes.

Methods

Dataset

In first step of data collection we use SuperSite documentation [11] and extract 675 PDB IDs of protein having contact with FAD in PDB. We download the sequence of all chains of these PDB IDs using web download section in PDB. In next step we use these PDB IDs in Ligand Protein Contact (LPC) [12] and get total 1539 chain which interacts with FAD with their corresponding interacting residues and its position. Then we remove redundant chains which have more than 40% similarity by using CD-HIT [13], finally retrieved a total 198 interacting chains with a total 5172 FIRs remaining all residues are non-FIRs. In this study we used 5172 FIRs and 5172 non-FIRs for developing our models. Sequences of these 198 FBP with their PDB ID and chain name are freely available [14], where FIRs are in lowercase and non-FIRs are in uppercase.

Five-fold cross-validation

Fivefold cross-validation technique has been used to evaluate the performance of all the models developed in this study. In this technique dataset is randomly divided into five sets where each set consist of nearly equal number of interacting and non-interacting patterns out of these five sets four sets are used for training and the remaining set for testing. This process is repeated five times in such a way that each set is used once for testing. The final performance is obtained by averaging the performance of all the five sets.

Pattern or window size

We generated an overlapping pattern of 17 residues, for each FAD interacting chains sequences. If the central residue of pattern was FIR, then we classified the pattern as positive or FIR pattern, otherwise it was termed as non-interacting or negative pattern. In this study we follow the similar approach adopted by Kaur and Raghava [1517] for prediction of turns in protein sequences. In additional to 17 residue window we also generate pattern of 15 and 19 residues. In this study we used unique residue patterns for binary and PSSM pattern generation. Finally we have total 4896, 4974 and 4974 unique pattern for interacting residues respectively in 15, 17 and 19 residue window, and randomly picked equal number of non-interacting pattern as negative data.

Support Vector Machine (SVM)

An excellent machine learning technique SVM [18] has been used for the prediction of FIRs. All SVM models have been developed by using a freely available package SVM_light [19]. The SVM is particularly attractive to biological sequence analysis due to its ability to handle noise, dataset and large input space. Further details about SVM can be obtained from Vapnik's [20] paper. The software allows users to run SVM using various numbers of parameters as well as to select inbuilt kernel functions, including a linear, polynomial and Radial Basis Function (RBF).

Evolutionary information

This was obtained from position-specific scoring matrix (PSSM) generated from PSI-BLAST search against non-redundant (nr) database [21] of protein sequences. The PSSM matrix was generated by three iterations of searching at cutoff e-value of 0.001 for inclusion of sequences in next iteration. The generated PSSM contained the probability of occurrence of each type of amino acid at each position along with insertion/deletion. Hence, PSSM is considered as a measure of residue conservation in a given location. This means that evolutionary information for each amino acid is encapsulated in a vector of 20 dimensions where the size of PSSM matrix of a protein with N residues is 20 × N. Where 20 dimension are 20 standard amino acids. We normalized each value within 0-1 range using equation:

Where val is the PSSM score and Val is its normalized value.

Figure of merits

In this study performance of constructed modules has been evaluated by using five-fold cross-validation techniques. Following threshold dependent parameters: sensitivity (Sn) or percent coverage of FIR is the percentage of FIR residue predicted as FIR; specificity (Sp) or percent coverage of non-interacting residues is the percentage of non-FIR predicted as non-FIR; overall accuracy (Ac) is the percentage of correctly predicted interacting residues has been used for assessing the performance of method. These parameters can be using following equations:

(1)
(2)
(3)
(4)

Where TP is correctly predicted FIRs, TN is correctly predicted non-FIRs, FP is the number of non-FIRs predicted as FIR and FN is the number of FIRs wrongly predicted as non-FIR. Matthew's correlation coefficient (MCC) equal to 1 is regarded as a perfect prediction, whereas 0 is for completely random prediction. We also calculated AUC of ROC plot which is a threshold independent parameter.

Description of web server

The prediction method described in this paper is implemented in the form of a web-server FADPred [22]. The common gateway interface (CGI) script of FADPred is written using PERL version 5.03. FADPred server is installed on a Sun Server (420E) under UNIX (Solaris 7) environment. It is a user-friendly web server which allows users to submit their protein sequence in two different ways; first browse and upload the fasta sequence file and second, either type or paste fasta sequence in a box which is available on submit page. This server allows users to predict FAD binding residues using both binary pattern and PSSM based SVM models with different threshold range from -1 - +1. Here we provide option for both binary pattern and PSSM user can select according to their choice and get the result through mail also. The default method is PSSM and threshold is 0.0, sensitivity and specificity is roughly found to equal during the five-fold cross-validation procedure at this threshold. The prediction result presented in graphical form where the predicted FIRs and non-FIRs are displayed in different color. We are using PSSM as default option and it takes several minutes to predict FAD interacting residues in a protein.

Results

Compositional analysis

We calculated the percentage composition of interacting and non-interacting residues and found Gly, Tyr and Ser residues were more abundantly interact with FAD as compare to non-interacting residues (Figure 1). The dominance of these residues shows a vital role of these residues in FAD interaction.

Figure 1
figure 1

Percentage composition of interacting and non-interacting residues.

SVM model using binary pattern of amino acid sequence

In our study we generated multiple 17 residue long window for negative and positive pattern. These sequence patterns were converted to binary patterns, where a pattern of length N was represented by a vector of dimension N × 21 and each amino acid in that pattern was represented by a vector of 21 (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) which contained 20 amino acids and one dummy amino acid X. As shown in Table 1, this SVM-based model was able to give a maximum MCC of 0.39 with 69.65% accuracy having minimum difference in sensitivity and specificity. Threshold at which sensitivity and specificity is nearly same is shown by bold font, in order to balance sensitivity and specificity. Similarly we achieved accuracy 70.31% with MCC of 0.41 for 15 window patterns and accuracy 70.49 with MCC of 0.41 for 19 window patterns. AUCs are 0.769, 0.773 & 0.770 respectively for 15 window, 17 window and 19 window pattern models (Table 2). The performances of models were evaluated at residue level.

Table 1 The performance of SVM model using binary pattern. Bold values indicate the point where sensitivity and specificity is equal or minimum difference with highest MCC.
Table 2 SVM parameters and AUC for our best models. The SVM parameter d (in polynomial kernel), g (in RBF kernel), c: parameter for trade-off between training error & margin, j: cost-factor.

SVM model using evolutionary information

In the past, it has been shown in several studies that evolutionary information obtained using multiple sequence alignment provides more comprehensive information about the protein than a single sequence [1, 6, 1517]. In this study, the evolutionary information obtained from PSSM generated from PSI-BLAST profiles was used for predicting FIRs. As shown in Table 3, performance increased significantly when PSSM was used as input instead of single sequence. A maximum MCC of 0.62 was achieved with 80.82% accuracy using evolutionary information. Similarly we achieved accuracy 80.29 with MCC of 0.61 for 15 window patterns and accuracy 80.39 with MCC of 0.61 for 19 window patterns. AUCs are 0.878, 0.904 & 0.876 respectively for 15 window, 17 window and 19 window pattern models (Table 2).

Table 3 The performance of SVM model using evolutionary information. Bold values indicate the point where sensitivity and specificity is equal or minimum difference with highest MCC.

Discussion

Due to the vital roles of FAD binding proteins in cellular metabolism and difficulties in iv-vitro analysis or identification of FIRs, by biophysical method, there is as urgent need for computational method to identify FAD binding sites on the basis of amino acid sequence of a protein. In this direction, we had followed a systematic attempt to develop a highly accurate and robust method for predicting FAD binding residues in protein sequences. There is no any preexisting online method in our knowledge for the prediction of FIRs using primary sequence. So first of all we developed method for predicting FIRs using sequence of FBP proteins. For this study firstly we collect the information of FAD binding proteins PDB IDs with SuperSite, fasta sequence from PDB and FAD interacting residues using LPC. Then analyze FIRs and found that there is significant difference in interacting residues as well as flanking residues.

It has been reported in some of the earlier studies that SVM perform better than any other artificial intelligence (AI) techniques in interacting residue prediction. First we developed SVM model based on binary patterns of amino acid sequence. Manish et al. 2008 showed that evolutionary information perform better than simple sequence information in interacting residue prediction. Further we used evolutionary information to generate PSSM profile as input for SVM model and check overall performance of FIRs prediction. SVM parameter for each model with their AUC is given in Table 2. One of the obvious questions is why we can't use BLAST for predicting FIRs. Thus we also make an attempt to predict FAD interacting residues using BLAST and achieved very poor performance (data not shown). We also provide a direct access of our developed prediction method to public, through web server FADPred. FADPred allow users to predict FAD interacting residues in their protein sequence.

Conclusion

In this study first, time a highly robust method has been developed to predict FAD interacting residues from protein sequence using AI technique, SVM. This study demonstrates that PSSM based method performs better than simple sequence based method. In this study we also observed that 17 window pattern perform better than 15 and 19 window pattern (Table 2, Figure 2). This study will be helpful for biologist in proteome annotation. One of the major advantage of this study is that we developed free web server; FADPred. Our web server allows users to identify FAD interacting residue in given sequence using the model trained on our data set.

Figure 2
figure 2

ROC plot for 15, 17 and 19 windows size binary and PSSM models.