Background

Protein-DNA interactions play important roles for the regulation of key biological functions like DNA transcription, replication, packaging and recombination. With the increasing number of high quality structure of complexes in Protein Data Bank (PDB) [1] and Nucleic Acid Database (NDB) [2], the collection of atomic interaction information for their structural and functional integrity is sufficiently complete for analysis and prediction of protein-nucleic acid interactions. Atomic level analyses have been investigated to understand how amino acids interact with nucleotide bases or sugar-phosphate backbones through hydrogen bonds, van der Waals contacts, or water-mediated hydrogen bonds [3], depending on the amino acid propensities [4, 5]. In recent years, the prediction of residues in a protein chain that interact with DNA has been a research topic that attracts a high level of interest. Some of the studies were purely based on analysis of the protein polypeptide sequence [611], while the others took the structural information into account [1217]. Particularly, the issue for sequence-specific binding residue prediction has been also mentioned recently [18]. Transcription factors (TFs) are proteins that regulate gene expression, which serve as integration centers of the different signal-transduction pathways affecting a given gene [19]. TFs regulate cell development, differentiation, and cell growth by binding to a specific DNA site and regulating gene expression [2022]. As it has been reported in a recent article that the tertiary structures of a large number of TFs are mostly disordered [23], sequence based analysis aimed at identifying the residues in a highly-disordered TF that play key roles in interaction with the DNA is essential for obtaining a comprehensive picture of how TFs function.

As studied in previous research, proteins that interact with DNA will change their conformations from their free states, changing non-specific complexes to specific complexes [24]. During the course of DNA-recognition, residues play different roles to either recognize nucleotide bases or stabilize the protein-DNA conformation. In this work, we try to identify whether the residue performs sequence-specific or non-specific binding. There are two types of binding mechanisms involved in amino acid - nucleotide interactions, namely sequence-specific and non-specific site binding [2529]. Sequence-specific binding occurs between protein side-chains and nucleotide bases, while non-specific binding occurs between protein side-chains and the DNA sugar/phosphate backbone [28]. In general, sequence-specific binding is also named as specific binding. Specific binding corresponds to sequence-specific recognition of a gene and therefore is essential for the correct regulation of genes. Non-specific binding shows relatively little base-sequence preference and binds preferentially to either single or double-stranded DNA. The role for non-specific binding residues is to stabilize the interactions between protein and nucleotide backbone to help specific binding residues in recognizing base pairs correctly. As reported in the review article by Luscombe et al. [30], protein-DNA interactions can be grouped into eight different structural/functional groups based on the structures of the DNA-binding region in the proteins, which is also referred to as the binding mode of the protein [3032]. There are eight such binding modes including (I) Helix-Turn-Helix, HTH (including "winged" HTH), (II) zinc-coordinating, (III) zipper-type, (IV) other α- helix, (V) β- sheet, (VI) β-hairpin/ribbon, (VII) other, (VIII) enzymes. Related research has investigated the classification of protein-DNA complexes and structural domains [3335]. Proteins in the same class have similar binding site conformations despite having different DNA targets. The importance of introducing the DNA-binding mode information is to find the binding pattern that a protein uses to interact with the target DNA [36, 37], which could help to identify the location of sequence-specific and non-specific binding residues.

This article presents the design of a sequence based predictor for identifying the residues in a TF that are involved in both sequence-specific binding and non-specific binding with the DNA and the binding mode. We use support vector machine (SVM) as the classifier to predict binding residues as sequence-specific or non-specific according to binding specificity. Originally, the definition of sequence-specific binding and the non-specific binding residues is based on the identification of hydrogen bonds and van der Waals attractions between protein side-chains and DNAs. In this work, we use a computational approximation of distance cut-off to define binding classification instead. A residue is regarded as involved in sequence-specific binding with the DNA if one or more heavy atoms on its side-chain are within 4.5 Å from any of the nucleic bases, while a residue is regarded as involved in non-specific binding with the DNA if one or more heavy atoms on its side-chain are within 4.5 Å from the sugar/phosphate backbone of the DNA. The threshold of distance cut-off is based on hydrogen bonding and van der Waals attractions: (1) a hydrogen bond was defined as having a maximum donor acceptor distance of 3.35 Å and maximum hydrogen-acceptor distance of 2.7 Å. (2) atoms were considered to form van der Waals contacts if the distance between them was ≤ 3.9 Å and the contact had not been defined as a hydrogen bond [5]. Residues in a protein interacting with DNA play their roles on specific binding, or non-specific binding, or both. The reason to predict both sequence-specific and non-specific binding residues is that the main determinants of specificity are the unfavorable contributions of "wrong" base pairs and specific binding will also require a large non-specific contribution to the binding free energy to achieve sufficient binding affinity [38]. Furthermore, the information of the predicted sequence-binding and non-specific binding residues can be used protein-DNA binding mode prediction. As shown in Figure 1, this is an example of PDB ID 2PRT:A to show sequence-specific and non-specific binding residues in the tertiary structure. Residues colored by red for sequence-specific binding residues, blue for non-specific binding residues, and purple for both sequence-specific and non-specific binding residues.

Figure 1
figure 1

Sequence-specific and non-specific binding residues of PDB 2PRT:A. Residues colored by red are sequence-specific binding residues. Residues colored by blue are non-specific binding residues. Residues colored by purple are both sequence-specific and non-specific binding residues.

Results and discussion

In this section, we will report the experiments conducted to evaluate the performance of our proposed approach. In the experiments of the first stage, we repeated the same testing procedure 20 times with randomly and independently generated testing data sets. The independent testing data set used in each run was derived from 30 TF chains randomly selected from the 253 TF-DNA complexes that we have collected (see Materials and Methods for details). In order to eliminate possible bias present in our collection of TF complexes, we took steps to guarantee that no two TF chains used to generate the testing data set in the same run are homologous with a sequence identity higher than 20%. Furthermore, aiming to obtain experimental results that accurately reflect the actual performance observed by the users of our proposed approach, we guaranteed that the training data generated with a TF chain that is homologous to the protein chain under testing by having a sequence identity higher than 20% are removed. For this study, LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm was used for data training and classification [39]. Table 1 shows the overall performance of the SVM predictor for predicting sequence-specific and non-specific binding residues in the first stage. The results have been obtained using the training parameters, C = 22, γ = 2-5, class weight for binding residue is 1.5, and class weight for non-binding residue is 1, which give better results than other values for prediction of sequence-specific binding residues. The predictor for DNA specific binding residues achieves 96.45% accuracy with 50.14% sensitivity, 99.31% specificity, 81.70% precision, and 62.15% F-measure. The results have been obtained using the training parameters, C = 20, γ = 2-5, class weight for binding residue is 2, and class weight for non-binding residue is 1, which give better results than other values for prediction of non-specific binding residues. The predictor for DNA non-specific binding residues achieves 89.14% accuracy with 53.06% sensitivity, 95.25% specificity, 65.47% precision, and 58.62% F-measure. While combining prediction results of sequence-specific and non-specific binding residues with OR operation, the predictor achieves 89.26% accuracy with 56.86% sensitivity, 95.63% specificity, 71.92% precision, and 63.51% F-measure. Table 2 shows the breakdown of overall performance of the binding residues prediction in terms of secondary structure elements. The number of sequence-specific (or non-specific) binding residues in β-sheet secondary structure elements is far fewer than the number of sequence-specific (or non-specific) binding residues in either α-helix or coil elements. As a result, our proposed framework cannot learn sufficient clues in order to identify sequence-specific (or non-specific) binding residues in β-sheet elements.

Table 1 Overall performance of proposed approach
Table 2 Performance of the binding site prediction in terms of secondary structure elements

In the experiments of the second stage, the protein-DNA binding mode prediction achieves 75.83% overall accuracy while applying LIBSVM with multi-class prediction using one-against-one approach. As shown in Table 3, the predictor can deliver precision of 100% and sensitivity of 80.22% for zipper-type binding mode, precision of 70.45% and sensitivity of 73.46% for helix-turn-helix binding mode, precision of 68.07% and sensitivity of 88.98% for zinc-coordinating binding mode, and precision of 34.21% and sensitivity of 52.00% for β-hairpin/ribbon binding mode. The predictor did not perform well for TFs with a binding mode of β-hairpin/ribbon. The reason is that the prediction power of sequence-specific binding and non-specific binding residue on β-sheet structure is worse than that of α-helix and coil. We select PDB 1LMB:4 as an example to show how the predicted binding mode information can be used to enhance the binding residues prediction. Figure 2 displays the prediction result of PDB ID 1LMB:4, which is a difficult case in our binding residues prediction experiment. The protein, 1LMB:4, belongs to the HTH_3 domain which is classified in the group of helix-turn-helix, which has 10 sequence-specific binding residues and 18 non-specific binding residues. However, the predictor found no sequence-specific binding residues with 10 false negatives and found 4 non-specific binding residues with 14 false negatives and 5 false positives. The binding mode predictor can correctly classify the 1LMB:4 into helix-turn-helix group. According to the best alignments of secondary structure elements, a protein is selected from the helix-turn-helix group. In Figure 2, residues are colored by red for false positive, blue for false negative and green for true positive. Figure 2(a), 2(b), 2(c) show the prediction of sequence-specific binding residues, the prediction of non-specific binding residues, and the combined result, respectively. Figure 2(d) shows the enhanced prediction with the best aligned template of correct protein-DNA binding mode prediction. It is obviously that correct binding mode prediction can greatly help the binding residues prediction, especially in difficult case. However, this idea needs more investment to derive a systematic approach.

Figure 2
figure 2

A difficult case (PDB ID 1LMB:4) of binding residue prediction, which can be enhanced with the best aligned template of correct predicted protein-DNA binding mode. Residues colored by red means false positive. Residues colored by blue means false negative. Residues colored by green means true positive. (a) Prediction of sequence-specific binding residues. (b) Prediction of non-specific binding residues. (c) Combination of sequence-specific and non-specific binding residues prediction. (d) Enhanced prediction with the best aligned template of correct protein-DNA binding mode prediction.

Figure 3
figure 3

Overall framework for DNA-binding residues prediction.

Table 3 Overall performance of protein-DNA binding mode prediction

In the following section, we will discuss how the proposed approach performs in comparison with the related studies reported in recent years. One must note that our proposed approach is the only predictor listed in Table 4 that identifies the residues involved in both sequence-specific and non-specific binding with DNA, while all the other predictors do not distinguish between sequence-specific binding and non-specific binding. Since the results listed in Table 4 include the main results extracted from recent studies along with the overall results with our proposed approach, it should be regarded as a survey of the latest advances in the field. It must also be noted that most related studies have adopted slightly different definitions of DNA-binding residues. In the article by Ahmad and Sarai [10] and in the article by Wang and Brown [40], a residue is regarded as involved in interaction with the DNA if one of its heavy atom is within 3.5 Å from a heavy atom of the DNA. In the article by Hwang et al., a larger threshold of 4.5 Å is used instead of 3.5 Å. In the article by Yan et al. [8], a residue is regarded as involved in interaction with the DNA if its solvent accessible surface area (ASA) in the protein-DNA complex is less than its ASA in the unbound protein by more than 1 Å2.

Table 4 Performance delivered by alternative predictors of DNA-binding residues, where the F-measure is the harmonic mean of precision and sensitivity

The numbers listed in Table 4 with an asterisk have been derived from the numbers reported in the related studies. Since all the four related studies addressed in Table 4 reported three out of the four performance metrics listed in the table, we can obtain 3 equations about the following 4 variables for each of the related study:

In addition, we have . Therefore, for each related study, we can derive the actual value of the fourth performance metric based on the values of the other three performance metrics that were provided. The only exception is precision for the predictor proposed by Hwang et al. [7]. By definition, the accuracy cannot be higher than the sensitivity and the specificity simultaneously, which is the case with the numbers reported by Hwang et al. Therefore, there is no way to derive the exact value of precision for their predictor.

According to the observation of the predicted results, the predictor of non-specific binding residues tries to locate positive charged patches. However, not all positive charged patches in a protein will come into contact with single- or double-strand DNA. It might be the reason of the performance gap between sequence-specific and non-specific binding residue prediction. While combining prediction results of sequence-specific and non-specific binding residues, sensitivity is higher than other predictors. The reason is that non-specific binding residues help a protein to slide along the target DNA, and specific binding residues will recognize base pairs while sliding along the target DNA. The role the non-specific binding residues play is to help specific binding residues recognize base pairs precisely. Therefore, the prediction of non-specific binding residues can increase the predictor's capability for predicting DNA-binding residues.

Conclusion

This article presents the design of a sequence based predictor that aims to identify the sequence-specific and non-specific DNA-binding residues in a TF. As a recent study has revealed that the tertiary structures of a large number of transcription factors are mostly disordered, a sequence based predictor is essential for analyzing how a TF interacts with DNA. Furthermore, it is highly desirable to have a predictor capable of identifying the residues involved in sequence-specific binding with DNA, since sequence-specific binding corresponds to sequence-specific recognition of a gene and is therefore essential for correct gene regulation. However, non-specific binding residues can help specific binding residues to increase binding specificity as well.

In the experiments reported in this article, our proposed approach has been able to deliver precision 81.70% and 65.47% in sequence-specific and non-specific binding residue prediction respectively. Precision of 81.70% implies that about 4 out of 5 predicted binding residues are really involved in sequence-specific binding with the DNA. Precision of 65.47% implies that about 7 out of 10 predicted binding residues are really involved in non-specific binding with the DNA. While combining prediction results, the performance for DNA-binding residue prediction can deliver sensitivity 56.85%. Sensitivity of 56.85% implies that our proposed approach can catch about 6 out of 10 residues involved in DNA binding with the DNA. In the DNA-binding segment of the protein, regions where non-specific binding residues are located will cover the regions where specific binding residues are located. Therefore, improvement can be achieved for DNA-binding residues prediction while combining prediction results of specific and non-specific binding residues. The protein-DNA binding mode prediction is also proposed in this framework, and we select 1LMB:4 as an example to reveal how can be helpful for improving DNA-binding residue prediction.

It is anticipated that the prediction accuracy delivered by our proposed approach will continue to improve as the number of TF-DNA complexes deposited in the PDB continues to grow which will increase the number of training samples for use in our learning algorithm. Nevertheless, the primary interest of computational biologists is to develop more advanced prediction mechanisms. In this respect, we believe that as the number of TF-DNA complexes deposited in the PDB increases, we can obtain more insights about the key physiochemical properties that play essential roles in TF-DNA interactions to be used to develop more advanced prediction mechanisms. In addition, we will exploit the experiences learned in this study in order to design binding-mechanism concerned predictors for other families of proteins interacting with DNA. We believe that different families of proteins may have very different characteristics. Therefore, a specifically-designed predictor should be created for each specific type of protein to be able to deliver superior performance in comparison with a general-purpose predictor.

Materials and methods

Datasets

Our analysis was based on the dataset of DNA-binding residue prediction collected by Ofran and Rost [6]. In this collection, there are 691 protein-DNA complexes. Because we focus on transcription factors, we have created a data set containing 253 TF-DNA complexes among which 227 complexes were extracted from the 691 protein-DNA complexes, and the remaining 26 TF-DNA complexes are those that were deposited into PDB between September 2007 and November 2008. All protein structures are determined by X-ray crystallization at a resolution of 3.5 Å or better. Using the Gene Ontology (GO) terms [41], we use proteins where the molecular function is transcription factor activity, biological process is transcription, and cellular component is nucleus to select transcription factors. All 253 TF-DNA complexes are listed in Table 5.

Table 5 Dataset of 253 TF-DNA complexes for DNA-binding residues prediction

Defining the DNA-binding residue

Previous research used various distance cut-offs from 3.5 Å to 6 Å to define DNA-binding residues between proteins and DNA [610, 14, 40, 42]. Most, if not all, of the cut-off distance is measured between the atoms of amino acid and the atoms of nucleotide bases or sugar-phosphate backbones. Most DNA-binding residue prediction tools used 3.5 Å or 4.5 Å as the distance cut-off in general. Considering electrostatic interaction, hydrogen bonding, water-mediated hydrogen bonding, and van der Waals contacts, we use 4.5 Å distance cut-off to label DNA-binding residues. A residue is regarded as involved in sequence-specific binding with DNA if one or more heavy atoms on its side-chain are within 4.5 Å from the nucleic bases of the DNA. A residue is regarded as involved in non-specific binding with the DNA, if one or more heavy atoms on its side-chain are within 4.5 Å from the sugar/phosphate backbone of the DNA. In all 253 TF-DNA complexes, there are 1526 binding residues and 23371 non-binding resides for sequence-specific binding residue prediction. The ratio of positive to negative samples is 1:15 in sequence-specific binding. For non-specific binding residue prediction, there are 3831 binding residues and 21066 non-binding residues. The ratio of positive to negative samples is 1:5 in non-specific binding. The number of non-specific binding residues is twice as many as the number of sequence-specific binding residues. Without distinguishing between sequence-specific and non-specific binding residues, there are 4360 binding residues and 20537 non-binding residues. All missing residues which do not have coordinate information in the PDB data file, will be excluded from the training and testing datasets.

Framework of DNA-binding residues and binding mode prediction using support vector machine

We proposed the two stage framework to predict the DNA-binding residues in a protein and the corresponding binding mode for a query protein respectively. Figure 3 shows the overall framework for binding residue prediction and a binding mode prediction. The first stage predicts the DNA binding residues and the second stage predicts the protein-DNA binding mode. In the first stage, a well-known machine leaning approach has been used for prediction from amino acid sequences which uses support vector machine with features created by the evolutionary profile of the proteins [43, 44]. The evolutionary profile of position-specific scoring matrices (PSSM) is computed by PSI-BLAST [45] against the NR database for a protein sequence. In addition, in order to keep evolutionary information of neighborhood residues information, we use the principle of sliding window to calculate the backward (or/and forward) metrics over a limited region of the received sequence. For each residue in a protein sequence, we use a sliding window of size 11 to describe neighborhood information; therefore, we have a 11 * 21 = 231 dimension feature factor in addition to the 20 amino acids and a boundary flag. In the end, we used LIBSVM [39] as predictor to predict DNA-binding residues. The best parameters selected for DNA-binding residues prediction is decided by leave-one-out cross validation (LOOCV).

In second stage, protein-DNA binding mode is predicted by using the prediction results of the previous stage. In Table 6, DNA-binding domains recognized by Pfam [46] will be classified into five binding modes, including zipper-type, helix-turn-helix (HTH), zinc-coordinating, β-hairpin/ribbon, and others. As shown in Table 7, there are 28 features for protein-DNA binding mode prediction including the information of non-specific binding residues, predicted secondary structure elements, and the number of total residues. The secondary structure elements for each protein structure in the training data are determined by DSSP program [47]. Because this predictor is a sequence based predictor to identify protein-DNA binding mode, the secondary structure elements for each protein structure in testing data (query protein) are predicted by PSIPRED [48]. In the training dataset, we used only the residue information in DNA-binding domain detected by Pfam server.

Table 6 Protein-DNA binding modes and their corresponding Pfam domains
Table 7 Illustration of feature set for protein-DNA binding mode prediction

Predictor performance measures

The predictions made for the testing instances are compared with the defined class labels (binding or non-binding) to evaluate the predictor. The accuracy is defined as

(1)

where TP is the number of true positives (binding residues with positive predictions); TN is the number of true negatives (non-binding residues with negative predictions); FP is the number of false positives (non-binding residues but predicted as binding sites) and FN is the number of false negatives (binding residues but predicted as non-binding sites). Since the data for DNA-binding residue prediction is skewed, the accuracy alone may be misleading. The predictor can achieve 85% accuracy by simply predicting all residues as negative for datasets where the positive to negative sample ratio is 1:10. Therefore, we focus on the specificity and sensitivity of the predictions, which are defined as follows:

(2)
(3)

The sensitivity is used to measure the prediction capability of positive samples; the specificity is used to measure the prediction capability of negative samples. In addition, precision and F-measure are also defined as follows:

(4)
(5)

Note

Other papers from the meeting have been published as part of BMC Bioinformatics Volume 10 Supplement 15, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics, available online at http://www.biomedcentral.com/1471-2105/10?issue=S15.