Background

A protein individually utilizes only a limited range of functionality present in its natural amino acid side chains, and the catalytic activity of many enzymes requires the involvement of a small-molecule that acts as a co-factor. These are required in almost all important metabolic pathways because they are specialized in certain types of reaction. One particular cofactor can be involved in several pathways and, conversely, several cofactors can be required in one particular pathway[1, 2]. Many vitamins have diverse biochemical functions but they are primarily known to assist enzyme-substrate reactions by playing the role of an enzyme cofactor[3, 4]. Some vitamins have hormone-like function as regulators of mineral metabolism (e.g. vitamin D), or regulators of cell and tissue growth and differentiation (e.g. some forms of vitamin A). The function of vitamin D as anti-infectious and anti-inflammatory is well-established[5, 6] and other functions as antioxidants (e.g. vitamin E and sometimes vitamin C). The majority of vitamins (e.g. B complex vitamins) function as precursors of enzyme cofactor that helps enzyme in their work as catalysts in metabolism[7].

As most vitamin biosynthetic pathway enzymes are not present in mammals and present in many of the pathogens[8], these enzymes have become attractive drug targets in several disease including tuberculosis[8, 9] and malaria[10, 11]. Several investigators have targeted Ornithine decarboxylase (ODC) for different diseases like African trypanosomiasis, Pneumocystis carinii pneumonia, ischemia, autoimmune diseases and hyperplasia[12]. Nonetheless, many groups are targeting Serine hydroxyl-methyltransferase (SHMT) as antitumor target knowing that enhanced levels of SHMT activity have been found in rapidly proliferating tumor cells[13]. A constitutive ODC activity observed in cancer cells, where its uncontrolled expression confers a cancer phenotype to the cells so ODC has been targeted in antitumor drugs[14]. In past, several studies have been done to identify the cofactor binding cleft and interacting residues in various enzymes. Pyridoxal 5'-phosphate (PLP)-dependent enzymes like 3,4-dihydroxyphenylalanine decarboxylase (DDC)[15, 16], Cystathionine beta-synthase (CBS)[17], 8-amino-7-oxononanoate synthase[18], Aminobutyrate aminotransferase[19], ODC and SHMT etc. have been investigated in various studies for identification of PLP and substrate interacting residues. These studies helped them to investigate the underlying mechanism and develop strategies for inhibitor designing. Similarly enzymes involved in folate (Vit-B9) metabolism such as Dihydropteroate synthase[20], Dihydrofolate synthase[21] and thiamin (Vit-B1) pathway[22] like Pyruvate dehydrogenase[23] and Oxoglutarate dehydrogenase[24] have also been taken as drug targets. In addition, binding of PLP also inhibits the activity of aminoacyl-tRNA synthetases[25]. Therefore, computational tool for the prediction of PLP and other vitamin-interacting site is highly desirable.

The advancement of genome sequencing produces huge amount of sequence data but reliable in silico annotation of these sequences still remains a challenge. There are several prediction tools available for the functional annotation of proteins. Broadly, the existing computational method can be divided in two categories; (i) protein level prediction, where function of whole protein is predicted[26-28] and (ii) residue level prediction where function of each residue in a protein is predicted[29-31]. The protein level prediction provides overall function of protein whereas residue level predictions are advancement over protein level and provides the information of functional residues. The residue level predictions mainly deal with prediction of interaction with other proteins, DNA, RNA and ligands. There are various methods to predict different interacting residues from the structure of protein but the major challenge is to predict interacting residues when only protein sequence is known. Several prediction methods have been developed for carbohydrates[32, 33], lipids[34, 35], DNA[29, 36-39] and RNA[30, 38, 40] interacting residues in protein sequence. Some methods have been developed for specific ligands such as ATP[41, 42], GTP[43], NAD[44], FAD[45] and mannose[46].

In this study, preliminary investigations revealed differential binding patterns of vitamins and other small-molecules. These differential patterns suggested that each ligand has specific residual preference for their binding with protein. Therefore, it becomes important to develop vitamin-specific interacting residue prediction methods. In this study, we developed different models for the sequence-based prediction of vitamin-interacting residues (VIRs), vitamin-A interacting residues (VAIRs), vitamin-B interacting residues (VBIRs) and PLP-interacting residues (PLPIRs). We utilized various classifiers and finally selected Support Vector Machines (SVMs) for developing the prediction models. SVM is a very powerful machine learning technique, which has been used for developing various bioinformatics methods in the past[38, 47-50]. It has been shown that the evolutionary information provided more information[40, 43, 45] than protein sequence, therefore we applied evolutionary information in the form of Position-Specific Scoring Matrix (PSSM) profile for developing a prediction method. This vitamin binding site prediction will be very useful for the study of enzyme activity and further advancement of drug development technologies.

Results

Analysis of protein-binding patterns of various ligands

It is important to analyze protein-binding patterns of different ligands in order to understand binding specificity of each ligand. Previously published datasets of different ligand-binding patterns for example ATP, GTP, NAD, FAD and mannose, were used to look at the preference of interacting residues. We analyzed the ligand-binding patterns for ATP (Additional file1: Figure S1), GTP (Additional file1: Figure S2), NAD (Additional file1: Figure S3), FAD (Additional file1: Figure S4) and mannose (Additional file1: Figure S5) with the help of Two Sample Logo (TSL) (See all Figures in Additional file1). It was observed that each ligand preferentially interacted with different residues of proteins. The ATP, GTP, NAD, FAD and mannose preferred the residues {Gly, Arg, Lys, Ser, His}, {Gly, Lys, Thr, Ser, Asp, Asn}, {Thr, Gly, Tyr}, {Gly, Tyr, Trp} and {Tyr, Asp, Trp, Asn, Glu}, respectively. The non-preferred residues were {Leu, Ala, Pro, Glu, Val}, {Leu, Glu, Ile, Met, Val}, {Leu, Glu, Ala, Lys}, {Glu, Asp, Lys, Ala, Pro} and {Leu, Val, Ile} for the ATP, GTP, NAD, FAD and mannose ligands respectively. We further analyzed and observed that significant differences were also present in the neighboring residues surrounding these preferred and non-preferred sets. This suggests the existence of different binding pockets for each small molecule ligand in the proteins. In order to predict these potentially differing binding pockets, there should be ligand specific binding site tools.

Analysis of different protein-interacting residues of different vitamin classes

After analysis of various ligand-protein interactions, we compared vitamins-interacting patterns with other ligands and found that significant differences were present. The Tyr, Phe, Ser, Trp, Thr, Gly and His are preferred as VIRs whereas Glu, Ala, Pro, Leu, Lys, Gln, Val and Asp are non-preferred. We analyzed amino acid compositions of the vitamin binding protein residues grouped by the sub-class to which the binding protein belonged: VIRs, VAIRs, VBIRs and PLPIRs (Figure 1). The interacting site of Vitamin A, Vitamin B and PLP preferred {Phe, Ile, Trp, Tyr, Leu, Val}, {Ser, Tyr, Gly, Thr, His, Trp, Asn, Glu} and {Ser, Thr, Gly, His, Tyr, Asn} whereas the non-preferred residues were {Glu, Pro, Asp, Asn, Ser, Arg, Gln}, {Leu, Glu, Ala, Pro, Val, Ile, Lys} and {Leu, Glu, Ala, Pro, Val, Ile, Ala} respectively. This implies that differences do exist at the protein-vitamin interaction sites even within vitamins sub-classes.

Figure 1
figure 1

Comparative average percent amino acids composition of VIRs, non-VIRs, VAIRs, VBIRs and PLPIRs.

In this study, we initially developed a model for the prediction of vitamin-interacting residues and then further classified VIRs into vitamin A, vitamin B and pyridoxal-5-phosphate (vitamin B6; PLP) interacting residues. Four different types of prediction methods were developed, one for each of the interacting residues: VIRs, VAIRs, VBIRs and PLPIRs. All the models developed in this study were evaluated using five-fold cross validation technique. In all cases, we used 10 times more negative instances than positive instances.

Prediction of vitamin-interacting residues (VIRs)

Here we developed the comprehensive prediction method for all VIRs. By generating sliding patterns and creating Two Sample Logo, we found that Phe, Gly, His, Ser, Thr, Trp and Tyr were more abundant in VIRs as compared to non-VIRs (See Additional file1: Figure S6). These patterns were converted into binary patterns and different kernels/parameters of SVM were employed to optimize the discrimination power between VIR and non-VIR patterns. We achieved 68.57% sensitivity, 64.88% specificity, 65.22% accuracy and 0.20 MCC. Preferences for neighboring amino acids between VIRs and non-VIRs patterns were also observed in the TSL (See Additional file:1 Figure S6). Thereafter, evolutionary information obtained from PSI-BLAST was used for the discrimination between VIRs and non-VIRs. Applying different machine learning algorithms of WEKA revealed that IBk method achieved maximum 50.70% sensitivity, 96.91% specificity, 92.71% accuracy and 0.52 MCC. SVM achieved highest 0.53 MCC with 52.19% sensitivity, 96.79% specificity and 92.73% accuracy. At the −0.8 thresholds level SVM achieved 78.52% sensitivity, 78.61% specificity, 78.60% accuracy and 0.37 MCC. Performances of all applied classifiers are provided in Table 1. As shown in Receiver Operating Curve (ROC) graph, binary (SVM), PSSM (IBk) and PSSM (SVM) achieved 0.74, 0.74 and 0.87 Area under curve (AUC) values, respectively (Figure 2). The performance increased significantly when PSSM was used as input instead of the binary patterns approach.

Table 1 Prediction performance of different classifiers for vitamin-interacting residues (VIRs)
Figure 2
figure 2

The ROC plot of the performance of different approaches for prediction of VIRs.

Prediction of vitamin A interacting residues (VAIRs)

We also developed prediction method for the VAIRs. The TSL of sliding patterns showed that Phe, Ile, Leu, Val and Trp were more abundant in VAIRs than in non-VAIRs (See Additional file1: Figure S7). These patterns were converted into the binary profile of patterns in order to develop the SVM-based prediction model. This model achieved 61.92% sensitivity, 65.09% specificity, 64.80% accuracy and 0.16 MCC. The IBk based prediction model of PSSM achieved maximum 44.05% sensitivity, 94.65% specificity, 90.05% accuracy and 0.39 MCC. SVM based PSSM approach achieved highest MCC of 0.48 with 42.75% sensitivity, 97.51% specificity and 92.54% accuracy. At the −0.8 thresholds level SVM achieved balanced performance of 72.70% sensitivity, 76.89% specificity, 76.51% accuracy and 0.32 MCC. Table 2 shows performances of all applied classifiers. As shown in ROC graph, binary (SVM), PSSM (IBk) and PSSM (SVM) achieved 0.70, 0.70 and 0.83 AUC values, respectively (Figure 3). The PSSM based approach enhanced the prediction performance with SVM.

Table 2 Prediction performance of different classifiers for vitamin A-interacting residues (VAIRs)
Figure 3
figure 3

The ROC plot of the performance of different approaches for prediction of VAIRs.

Prediction of vitamin B interacting residues (VBIRs)

The TSL analysis of VBIRs and non-VBIRs showed that Gly, His, Asn, Ser, Thr, Trp and Tyr were more abundant in VBIRs (See Additional file:1 Figure S8). The SVM-based prediction model was developed using binary patterns and achieved 73.22% sensitivity, 67.00% specificity, 67.57% accuracy and 0.24 MCC. The IBk based prediction model of PSSM achieved maximum 56.74% sensitivity, 98.04% specificity, 94.28% accuracy and 0.62 MCC. SVM based PSSM approach achieved highest 0.61 MCC with 55.57% sensitivity, 98.04% specificity and 94.18% accuracy. At the −0.8 thresholds level SVM achieved 81.39% sensitivity, 81.77% specificity, 81.73% accuracy and 0.43 MCC. Performances of all applied classifiers are provided in Table 3. As shown in ROC graph, binary (SVM), PSSM (IBk) and PSSM (SVM) achieved 0.78, 0.77 and 0.90 AUC values, respectively (Figure 4). The overall performance increased by PSSM profiles based model, in compare to binary patterns based approaches.

Table 3 Prediction performance of different classifiers for vitamin B-interacting residues (VBIRs)
Figure 4
figure 4

The ROC plot of the performance of different approaches for prediction of VBIRs.

Prediction of pyridoxal-5-phosphate interacting residues (PLPIRs)

The compositional and TSL analysis of PLPIRs and non-PLPIRs found that Gly, His, Asn, Ser, Thr and Tyr were more abundant in PLPIRs (See Additional file1: Figure S9). The binary patterns (17-length windows) based prediction model achieved 77.02% sensitivity, 83.17% specificity, 82.62% accuracy and 0.42 MCC. The IBk based PSSM approach achieved 76.10% sensitivity, 98.80% specificity, 96.74% accuracy and 0.79 MCC whereas SVM based achieved highest 0.81 MCC with 79.76% sensitivity, 98.62% specificity, 96.91% accuracy. At the −0.7 thresholds level SVM achieved 79.76% sensitivity, 98.62% specificity, 96.91% accuracy and 0.81 MCC. As shown in ROC graph, binary (SVM), PSSM (IBk) and PSSM (SVM) achieved 0.88, 0.87 and 0.97 AUC values, respectively (Figure 5). Table 4 shows performances of all applied classifiers. Here also PSSM profile based evolutionary information enhanced the prediction performance of SVM model.

Figure 5
figure 5

The ROC plot of the performance of different approaches for prediction of PLPIRs.

Table 4 Prediction performance of different classifiers for PLP-interacting residues (PLPIRs)

Performance of balanced datasets

We also developed the SVM-based prediction models on the balanced datasets using both binary and PSSM approaches. The binary approach achieved 0.32, 0.24, 0.37 and 0.52 MCC for VIRs, VAIRs, VBIRs and PLPIRs respectively (Table 5). The PSSM approach improved the prediction performance significantly and achieved 0.53, 0.47, 0.63 and 0.80 MCC for VIRs, VAIRs, VBIRs and PLPIRs respectively (Table 5).

Table 5 SVM-based prediction performances for four different types of prediction methods using equal positive and negative instances

Performance on the independent datasets

Four different independent datasets, V-IND-46, VA-IND-15, VB-IND-27 and PLP-IND-16, containing 46, 15, 27 and 16 protein sequences and utilized for the evaluation of VIRs, VAIRs, VBIRs and PLPIRs prediction methods, were used. We used SVM-based binary approach, calculated performances at already optimized threshold level (by 5-fold cross validation of main-dataset) and achieved highest 0.19, 0.23, 0.20 and 0.30 MCC for the prediction of VIRs, VAIRs, VBIRs and PLPIRs respectively (See Additional file1: Table S1). The performance enhanced significantly while using PSSM approach and achieved highest 0.38, 0.37, 0.35 and 0.63 MCC for the prediction of VIRs, VAIRs, VBIRs and PLPIRs respectively (Table 6).

Table 6 SVM-based prediction performances (at the default threshold) of PSSM approach on the different independent datasets

Surface accessibility based prediction

Most of binding residues reside inside the surface pockets and predicting these pockets is therefore important. For these predictions, it is required to firstly predict the surface accessibility (SA) of each residue from the protein sequence. Therefore, we used SARpred method[51] for the prediction of surface accessibility of all residues. On the basis of these surface accessibility values, we tried to develop SVM-based models but as shown in the Additional file1: Table S2 the performances were very poor on the realistic dataset. On the balanced dataset, SA-based approach achieved 0.15, 0.08, 0.22 and 0.30 MCC for the prediction of VIRs, VAIRs, VBIRs and PLPIRs respectively. The major limitation of this approach was that surface accessibility feature itself was predicted from the protein sequences. The results were showing that only PLP-interacting residues could be predicted (MCC 0.30) with surface accessibility while other predictors performed poorly (See Additional file1: Table S2). The performance of PLPIRs predictor was better than the performance from this study. This may be because of the presence of more than one ligand in the other predictors (VIR, VAIR, VBIR). There may be chances that binding pockets were very different for each ligand and therefore difficult to model. Sometime, it is better to combine more than two features, in order to achieve good prediction results. In-spite of a combined PSSM-surface accessibility approach, we were unable to achieve any improvement in performance measures over the single PSSM-based approach for both the realistic and balanced datasets (See Additional file1: Table S2). These results suggest that PSSM-based individual approach performances were as good as combined approach with both PSSM and surface accessibility features.

Quality of PSSM profiles

The number of homology sequences can affect the quality of PSSM profiles; therefore it is important to check the quality of PSSM profiles. Earlier this type of analysis has been done for the prediction of DNA-binding proteins in the DNAbinder method[27]. The number of homology sequences depends on total number of the protein sequences in the database. We used PSI-BLAST program for the default parameters with 3 iterations and checked the prediction performance on the different independent datasets. The independent datasets of VIRs, VAIRs, VBIRs and PLPIRs are V-IND-46, VA-IND-15, VB-IND-27 and PLP-IND-16 and containing 46, 15, 27, and 16 protein sequences respectively. The prediction performances (at default threshold level) of different independent datasets are shown in the Additional file1: Table S3. As the total numbers of homology sequences were different for each query sequence; by default it varied from the 0-500 sequences. On the basis of total PSI-BLAST hits, we divided each dataset into five different categories (overall 0-500, 0-10, 11-100, 101-400 and 401-500). As mentioned in the Additional file1: Table S3, it was observed that performances increased with the increment of number of homolog sequences. Prediction performances were poor for the 0-10 and 11-100 ranges of query sequences in all four cases whereas average for the 101-400 range and good for the 401-500 homolog sequences.

These results suggested that the quality of PSSM profiles depends on the number of homolog sequences. In most of cases, the major fraction of sequences ranged between 401-500 (PSI-BLAST hit range). The overall performances of simple binary-based approach (Additional file1: Table S1) were higher than the PSSM-based prediction that had range values between 0-10 (Additional file1: Table S3).

Methods

Datasets

In this study, we collected data from SuperSite documentation[52] and extracted 1061 PDB IDs of protein having contact with vitamins in PDB. We downloaded the sequence of all chains of these PDB Ids from Protein Data Bank[53]. In next step, we used these PDB IDs in Ligand Protein Contact (LPC) web-server[54] and get total 2720 chains that interact with vitamins with their corresponding interacting residues and its position. We used a cut-off of 5.0 Å to define the vitamin interacting residues. A residue was considered to be vitamin-interacting if the closest distance between atoms of the protein and the partner vitamin was within the cut-off (5 Å). The 25% non-redundant dataset of protein chains was created by using BLASTCLUST and finally retrieved a total 187 interacting chains with a total 3004 vitamin-interacting residues (VIRs) and remaining all residues are non-vitamin-interacting residues (non-VIRs). This step was repeated for the dataset development of vitamin A, vitamin B and PLP (vitamin B6-derived) interacting residue prediction and retrieved 538, 2207 and 1092 interacting residues in 31, 141 and 71 chains respectively. The interacting and non-interacting residues were used as positive and negative instances respectively. The number of non-interacting residues was very large than interacting residues so we have randomly picked up 10 times more non-interacting than interacting residues in order to create realistic dataset. The balanced datasets of equal positive and negative were also created, where equal numbers of random negative instances was taken from the total negative window patterns.

We created four different independent datasets: V-IND-46, VA-IND-15, VB-IND-27 and PLP-IND-16 of the 46, 15, 27 and 16 protein sequences for the prediction of VIRs, VAIRs, VBIRs and PLPIRs respectively. All these datasets were 25% non-redundant and all sequences of these independent datasets were less than 25% similar than sequences of main datasets.

Window patterns and size

We generated sliding (overlapping) patterns of 17-residue size, for each interacting chain sequence. In past, several studies have adopted this strategy for the interacting residue tools development[40, 45]. If the central residue of pattern was interacting, then we classified the pattern as interacting or positive pattern; otherwise it was termed as non-interacting or negative pattern. To generate the pattern corresponding to the terminal residues in a protein sequence, we have added (L-1)/2 dummy residue "X" at both terminals of protein (where L is the length of pattern). Here the length of pattern is 17 so we have added 8 "X" before N-terminal and 8 "X" after C-terminal, in order to create equal number of patterns from sequence length.

Binary profile of patterns

These positive and negative patterns were converted into the binary patterns and all amino acids represented by a vector of 21 dimensions (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0; Cys by 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), which contained 20 standard amino acids and one dummy amino acid “X”. We used these profiles as an input data of various machine-learning algorithms.

Position-Specific Scoring Matrix (PSSM)

We performed PSI-BLAST (position-specific iterative BLAST) search (default parameter) against the non-redundant (NR) database available at Swiss-Prot[55]. After three iterations, PSI-BLAST generated the PSSM profiles with the highest score from multiple alignments of the high-scoring hits by calculating the position-specific scores for each position in the alignments. The PSSM profile contains the occurrence probability of all amino acids at each position along with insertion/deletion and provides the evolutionary information for all amino acids. The final PSSM was normalized using a sigmoid function.

Surface accessibility

We calculated surface accessibility value for each residue of the all sequences using SARpred method[51]. We normalized these values (between minimum to maximum) and assigned a value for the each residue of the 17-length window patterns. We used these 17 input features for the SVM-based prediction of VIRs, VAIRs, VBIRs and PLPIRs. In the hybrid approach with PSSM, we combined these 17 input features with the PSSM features.

Support vector machine

In this study, a highly successful machine learning technique termed as a Support Vector Machine (SVM) was used. SVM is a machine-learning tool and based on the structural risk minimization principle of statistics learning theory. SVMs are a set of related supervised learning methods used for classification and regression[56]. The user can choose and optimize number of parameters and kernels (e.g. Linear, polynomial, radial basis function and sigmoidal) or any user-defined kernel. In this study, we implemented SVMlight Version 6.02 package[57] of SVM and machine learning was carried out using three different (linear, polynomial and radial basis function) kernels. SVM takes a set of fixed length input features, along with their output, which is used for training of model. After training, learned model can be used for prediction of unknown examples[58]. We optimized different parameters and kernels for all approaches and developed efficient prediction tools.

WEKA package

WEKA is a large collection of various machine-learning algorithms as single package[59]. We applied WEKA 3.6.4 version, which integrates different classifiers such as BayesNet, NaiveBayes, ComplementNaiveBayes, NaiveBayesMultinomial, RandomForest and IBk. All algorithms have been applied and optimized for different prediction tool development.

Five-fold cross validation

The validation of any prediction method is very essential part. In this study, we have used a five-fold cross-validation technique[60] for training, testing and evaluating our prediction methods. The protein sequences/patterns of positive and negative instances were randomly divided into five parts. Each of these five sets consists of one-fifth of positive and one-fifth of negative instances. In this technique, the training and testing was carried out five times, each time using one distinct set for testing and the remaining four sets for training.

Evaluation parameters

To assess the performance of various modules developed in this study, we calculated the sensitivity, specificity, accuracy and Matthew's correlation coefficient (MCC). These calculations were routinely used in these types of prediction-based studies[61, 62]. These parameters were calculated using following equations (14):

Sensitivity = TP TP + FN × 100
(1)
Specificity = TN TN + FP × 100
(2)
Accuracy = TP + TN TP + FP + TN + FN × 100
(3)
MCC = TP × TN FP × FN TP + FP TP + FN TN + FP TN + FN
(4)

Where TP and TN are correctly predicted positive and negative examples, respectively. Similarly, FP and FN are wrongly predicted positive and negative examples respectively.

The standalone version of VitaPred gives prediction results with probability score instead of SVM score. We have calculated probability score by using following equation -

Probability score = SVM score + 1.5 3 × 9
(5)

We rescaled the SVM scores with maximum 1.5 and minimum −1.5, where more than 1.5 and less than −1.5 both scores were used as 1.5 and −1.5 respectively. The probability score varies from 0-9 for each residue of protein sequence. The probability scores ranges between 0-4 and 5-9 predicted as non-interacting and interacting residues respectively at default 0.0 thresholds.

The five fold cross-validation technique created five test sets and calculated performance for each test set. The final performance of prediction model is an average performance of these five test sets. In this average performance, we also calculated standard error of the performance of these five test set. MCC is considered to the most robust parameters for the evaluation of any prediction method[63]. The MCC value ranges between +1 to −1. The MCC value of 1 corresponds to a perfect prediction, whereas 0 corresponds to a completely random prediction. The −1 MCC value indicates total disagreement between prediction and actual examples. The evaluation parameters of SVM performances are threshold-dependent and require parameters/kernels optimization for the better results. The complete optimization of all parameters is key step in SVM based machine learning. We manually optimized all parameters and selected the highly performed prediction models for different tasks. In order to have a threshold independent evaluation of our method, we also created ROC and calculated AUC value for the threshold independent evaluation using SPSS statistical package.

Two sample logo (TSL)

In this study, we have created Two Sample Logo (http://www.twosamplelogo.org/) for the graphical representation of positive and negative patterns[64]. It is a web-based application to calculate and visualize position-specific differences between positive and negative samples.

Web-server

A user-friendly web-server VitaPred developed for the prediction of VIRs, VAIRs, VBIRs and PLPIRs in protein sequence. The VitaPred is freely available fromhttp://crdd.osdd.net/raghava/vitapred/ web-address. It requires protein sequence in standard FASTA format. There are four different type of options provided for the prediction of VIRs, VAIRs, VBIRs and PLPIRs. We have also provided our datasets and other supplementary materials, which were used for the development of VitaPred web-server.

Standalone version of VitaPred

In the era of genomics, it is essential to develop computational tools for the huge amount of sequence data. We have developed standalone version of VitaPred by using Visual Basic .NET technologies. This is available from the site of web-server. User can download and install it in their system. This software gives the results with probability scores (Equation 5) for each residue of protein sequences. The multiple sequences can efficiently proceed with this software.

Discussion

The experimental determination of vitamin binding sites is very difficult task because of their complex chemical nature, and the fact that they are often made in very small amounts, making detection of the enzyme activities and intermediates difficult[4]. So there is a need to develop alternate technique, such as computational techniques for predicting vitamin-binding sites in a protein. The comparative analysis of different ligands with VIR (Additional file1: Figure S6) such as ATP (Additional file1: Figure S1), GTP (Additional file1: Figure S2), NAD (Additional file1: Figure S3), FAD (Additional file1: Figure S4) and mannose (Additional file1: Figure S5) revealed that each ligand has different protein-binding patterns (See all Figures in Additional file1). Thus, it is important to develop a separate vitamin-interacting residues prediction tool.

We have used available structural information (knowledge-based) for the prediction model development using different machine learning algorithms. The structural information of protein-vitamin complexes extracted from SuperSite[52]. We found total 1061 protein-vitamin complexes, in which 181 and 843 complexes proteins are bind with vitamin A and B respectively. Out of these total 843 complexes of vitamin B binding complexes, 553 are bind to vitamin B(6)-derived pyridoxal 5'-phosphate (PLP) binding protein. The structural availability of vitamin C, D, E and K binding protein complexes are very low in PDB. Thus, we have developed four different methods for the prediction of VIRs, VAIRs, VBIRs and PLPIRs. We identified interacting and non-interacting residues using Ligand Protein Contact (LPC) web server[54]. The interacting residues analysis suggested that Phe, Gly, His, Ser, Thr, Trp and Tyr amino acids are preferred in the vitamin binding pockets of Vitamin Binding Proteins (VBPs) (Figures 1). The preference of interacting and neighboring residues is vitamin class-specific (See Additional file1: Figure S6-S9). In the past, it has been shown in some studies that multiple sequence alignment based evolutionary information provides more comprehensive detail about the protein instead of single sequence[51, 65]. Thus, all sequences of datasets were created into PSSM profiles and used for the prediction tool development. The comparative analysis between vitamin A and B interacting sites showed that Phe, Ile, Leu, Val and Trp are abundant in VAIRs whereas Asp, Glu, Gly, His, Lys, Asn. Arg, Ser and Thr are abundant in VBIRs (Figure 1, See Additional file1: Figure S7-S8). The vitamin B(6)-derived pyridoxal 5'-phosphate (PLP) is the cofactor of enzymes catalyzing a large variety of chemical reactions (more than 140 enzymes are PLP-dependent) mainly involved in amino acid metabolism[66]. According to the Enzyme Commission, about 4% of enzyme-catalyzed reactions are PLP-dependent (EC;http://www.chem.qmul.ac.uk/iubmb/enzyme/). Therefore, it was very important to develop a separate prediction model for the PLPIRs in protein sequence. The PSSM based approach achieved maximum performance for PLPIRs because of separate model for a single PLP molecule. The VIRs, VAIRs and VBIRs modules performed relatively low because each class comprises more than one molecule. It means the overall prediction performance of VIRs is an approximately combined performance of all vitamins.

The performances of all the used classifiers are also provided in the Tables 1,2,3, and4. It was observed that PSSM feature based SVM classifier performed better in all cases, in term of balancing between sensitivity and specificity. The threshold-independent performance of SVM is better than IBk for all modules (Figures 2,3,4 and5). In the 5-fold cross validation, we got total five prediction performances corresponding to five test sets and computed average performance and standard error (SE) from these 5 performances. In most of cases, we found low value of SE, which is variation in the performance over five sets (it is not performance of variation on individual protein/chain). As patterns were divided randomly in five sets so it is expected that performance in each set will be nearly same. In other words, low SE values show that distribution of patterns in sets is not biased. Moreover, SE is not affected by similarity between patterns or protein chains, as this SE only measures biasness in distribution of patterns in five sets.

The prediction performances on the different independent datasets show that these modules can predict interacting residues of all vitamin classes with reasonably good accuracy (Table 6). The quality of PSSM profiles were also investigated and found that protein sequences in our dataset have fairly high number of hits. Furthermore we also found PSSM approach based prediction performances increase with the increasing number of PSI-BLAST hits of the query sequence. As discussed, vitamins are crucial for the activation of many enzymes and crystal structures of many VBPs are unsolved. Furthermore, many vitamin-dependent enzymes have been used as a potential drug targets, thus residue level study of vitamin-interacting and non-interacting sites will be use for the further drug discovery processes.

Conclusions

In order to assist the biologists in assigning the vitamin-interacting residues of VBPs, a systematic attempt has been made for predicting the vitamin-binding sites (VIRs, VAIRs, VBIRs and PLPIRs) from the amino acid sequence of VBPs. This study demonstrates that PSSM evolutionary information can be use to predict vitamin-binding sites in a protein sequence.