Background

The family of DNA binding proteins is able to recognize and bind to DNAs, and they play vital roles in many biological processes such as DNA replication, recombination, repair, transcription, translation, and maintenance of telomeres, and so on [14]. There are two kinds of DNAs, single-stranded DNA (ssDNA) and double-stranded DNA (dsDNA). Accordingly, the DNA binding proteins usually consist of single-stranded DNA-binding proteins (SSBs) and double-stranded DNA-binding proteins (DSBs). SSB binds with ssDNA with high affinity and low specificity, and is mainly involved in DNA replication, recombination and repair. While DSBs involve in binding to particular dsDNA sequences, to modulate the process of transcription, to cleave DNA molecules, or to be involved in chromosome packaging and transcription in the cell nucleus, etc. Though there are some researches [57] on the SSB and DSB respectively, few attentions have been paid on investigating what makes SSB and DSB have such different kind of binding specificity.

With the development of biotechnology, a large amount of proteins have been sequenced. However, SSBs have shown to have little sequence conservation [8]. Even DSBs involved in similar functions may have conserved subsequences, different kinds of DSBs with different functions seems to show few common subsequences. Therefore, it is hard to recognize SSB sequences from DSB sequences, or vice versa. Now that the molecular structure determines its biological function, structural information is expected to provide insight on the binding mechanism of SSB or DSB. The great progress of the structure genomics project [9] results that more and more high resolution 3D structures for DSBs and SSBs are available now, which makes it possible to investigate the common structural differences between SSB and DSB that are responsible for the binding specificity. In the meantime, the investigation results can help to annotate or refine the annotation of the proteins with known structures yet unknown or not fully understood functions. In fact, up to Jan. 25, 2013, the Protein Data Bank (PDB) [10] contains 3390 structures for DNA binding proteins (see Additional file 1), among them only about 30% and 5% are annotated as DSBs and SSBs, respectively, and whether the remains belong to DSBs or SSBs are still not very clear. Therefore, a computational method is required to annotate the DNA binding protein as DSB or SSB automatically. To address this question, this work is devoted to characterize the structural differences between DSBs and SSBs, and then to construct the distinguishing model that can automatically refine the annotations of the DNA binding proteins.

The surface of a protein is generally irregular, containing many clefts and grooves of varying shapes and sizes [11]. Previous researches have shown that a large cleft can provide an increased opportunity for the protein to form interactions with other molecules, particularly small ligands [12, 13]. Therefore, some researches used a particularly large and deep cleft to characterize the binding active sites of the proteins [11, 13, 14]. We guess that for DNA binding proteins, the cleft properties on the surface may also play important roles on the dsDNA/ssDNA binding specificity.

Research results have shown that although the sequences of different SSBs are very different, there are well-conserved elements in the structures. That is, most SSBs contain one or more OB (oligonucleotide/oligosaccharide binding) -fold domains [6, 1518]. A typical OB-fold has a five-stranded beta-sheet coiled to form a closed beta-barrel. This barrel is capped by an alpha-helix located between the third and fourth strands. The OB-fold plays critical role in binding with ssDNA. Although it is hard to say that the OB-fold is unique for SSBs, we think that it should also be used as an important descriptor to distinguish SSBs from DSBs.

In this paper, we aim to investigate the structural differences between collected SSBs and DSBs, and extract the structure-based features related to surface clefts and OB-folds, based on which, we construct a computational model that can automatically classify the DNA protein as a DSB or SSB by using the widely used support vector machine (SVM). The promising performance suggests that our method will be useful in the protein function annotation and refinement.

Methods

Data sets

We first extracted the structures of all 3390 DNA binding proteins from PDB (Jan. 25, 2013 release) according to their annotations, which contain 1039 DSBs (HOLO 890, APO 149), 158 SSBs (HOLO 70, APO 88) and 2193 unknowns. Then we use PISCES (http://dunbrack.fccc.edu/PISCES.php) [19] to get the non-redundant set, in which every structure is either solved by NMR or by X-ray yet with resolution better than 3Å, the sequence identity is less than 30%, and the length of chain is greater than 40 amino acid residues. As a result, we finally got 204 DSBs (HOLO 154 and APO 50), 75 SSBs (HOLO 37 and APO 38) and 727 unknowns (Additional file 2). For simplicity, we call the set containing protein-DNA bound structures as HOLO set, and the set containing protein-DNA unbound structures as APO set, and the proteins in these sets are respectively denoted as DSB_holo, SSB_holo, DSB_apo, and SSB_apo hereinafter.

Features on clefts

The protein surface has a very complex and irregular shape that contains concave, convex and flat, which contributes to protein to interact with the external environment. The clefts, pockets, or cavities are generally considered as the active sites on protein surfaces, thus the research on them are meaningful of understanding the protein functions.

Now that it has been reported that a large cleft can provide an increased opportunity for the protein to form interactions with other molecules [12, 13], and the particularly large and deep clefts have been used to characterize the binding sites of the proteins [11], we consider that for DNA binding proteins, the large clefts on the surface may also play important roles on the dsDNA/ssDNA binding. In other words, the large clefts on SSB would be narrow enough to prevent it from binding with dsDNA.

Some tools have been developed to recognize the clefts based on the protein structures, such as HOLE [20], MOLE [21, 22], MolAxis [23] and Caver [24, 25]. In this work, we applied CAVER 3.0 package to detect the clefts and the corresponding indexes of the largest clefts (also called as tunnels in this work) on the protein surfaces, to investigate whether they are possible to be used for distinguishing the potential interfaces between SSBs and DSBs. Concretely, we mainly got three indexes of the detected tunnels: length, curvature and bottleneck radius.

Length: indicating the length of the path from the start point to the end point along the tunnel axis.

Curvature: indicating the curvature of the tunnel. The curvature of the tunnel is calculated by Curvature = Length/Distance, where the distance is the length of the straight line from the start point to the end point of the tunnel. The greater the curvature, the curved is the tunnel.

Bottleneck radius: indicating the radius of the narrowest part of the tunnel, also representing the radius of the largest possible ball that can be centered at a given point of the tunnel axis without colliding with the input structure.

Since the protein surface contains many tunnels of varying shapes and sizes. The CAVER package return as many tunnels as possible. For the reason mentioned above, we just check the largest one in terms of maximizing (Length*Bottleneck Radius). For example, for protein 1A73, CAVER detects out 27 tunnels shown in Figure 1, and 1their indexes are listed in Table 1. According to the choosing criteria, tunnel number 25 (Figure 2) will be considered as the largest tunnel.

Figure 1
figure 1

All detected tunnels of protein 1A73. The graph shows the CAVER package detects out 27 tunnels in 1A73 protein, and show 3D structure for all tunnels with different colours in protein surface.

Table 1 Index values for all tunnels of 1A73
Figure 2
figure 2

The largest tunnel (25#) of protein 1A73. The graph shows the red tunnel is the largest tunnel in terms of maximizing (Length*Bottleneck Radius).

Feature on OB-fold domain

OB-fold is a small structural motif that was first characterized in 1992 in four proteins that bind either oligonucleotides or oligosaccharides [26]. Typically, the OB fold comprises a five-stranded β-sheet coiled to form a closed β barrel and capped by an α-helix located between the third and fourth β strands [2730]. Although OB-fold has since been observed at protein/protein interfaces as well, but the nucleic acid-binding superfamily is the largest within the OB-folds, and proteins containing OB-folds involve almost any time that single-stranded DNAs or RNAs are present or require manipulation [8]. Now that OB-folds are conserved and play important roles in SSB-ssDNA binding, we extract the feature indicating whether OB-fold is contained in a protein, with the hope that the feature is able to distinguish SSBs with DSBs.

Considering that OB-folds evolve into several variants though they are very conserved, we choose the chain A of six typical proteins (PDB:1QUQ [31], 1V1Q [32], 4GS3 [33], 3ULL [34], 1O7I [35], 1JMC [36]) shown in Figure 3 as OB-fold templates. From Figure 3, we can see that these proteins contain nothing except for OB-fold domains. Moreover, each chain of the former five proteins contains one and only one OB-fold domain. Since 1JMC_A contains two OB-fold domains, we only use one of them as the template.

Figure 3
figure 3

Six templates of the OB-fold domain. They show structural similarity but different topologies, and the similarity of sequences are with <30%.

For an unknown protein, we use the protein structure alignment package TM-align [37] to compare its structure with each of the templates and use the maximal alignment score TM-score as the OB-fold feature of the protein.

Classification model and evaluation

In this work, we used support vector machine (SVM) to build the classification model. The SVM classifiers were implemented using Matlab 2012a SVM package with the Gaussian Radial Basis Function (RBF) as a kernel.

In order to evaluate the performance of the prediction results, we used several measures, including Accuracy, Sensitivity, Specificity, and F-measured and area under the receiver operating characteristic curve (AUC). Let TP (true positive) is the number of proteins correctly predicted as SSBs, FP (false positive) is the number of proteins incorrectly predicted as SSBs, TN (true negative) be the number of proteins correctly predicted as DSBs and FN (false negative) be the number of proteins incorrectly predicted as DSBs. The accuracy (ACC), sensitivity (SN), specificity (SP), F-measured (F1) and Matthews Correlation Coefficient (MCC) are defined as the following:

Accuracy = T P + T N T P + F N + T N + F P
(1)
Sensitivity = T P T P + F N
(2)
Specificity = T N T N + F P
(3)
F - measure = 2 × T P 2 × T P + F P + F N
(4)
MCC = T P × T N - F P × F N ( T P + F P ) × ( T P + F N ) × ( T N + F P ) × ( T N + F N )
(5)

We use 10-fold cross validation test to evaluate the classification performance. Because of the unbalance of different kinds of proteins, in each fold we iterate 15 times to randomly select the equal numbers of SSBs and DSBs into the train set by using down-sampling method, and use the voting strategy to assign the class label of the test protein. To the best of our knowledge, there is no computational method to distinguish SSBs from DSBs, therefore we also train the random classifier as the baseline in each test.

Results and discussion

Investigation of the distinguishing ability of the features

By using CAVER3.0, we have detected 990 tunnels from HOLO set (865 for DSBs, 125 for SSBs), and 1168 tunnels from APO set (757 for DSBs, 411 for SSBs). According to the maximizing criterion described above, we selected one maximal tunnel for each protein. As a result, we finally got 37 tunnels for bound (DNA-bound) SSBs, 38 tunnels for unbound (DNA-free) SSBs, 154 tunnels for bound DSBs and 51 tunnels for unbound DSBs. Accordingly, we also got three feature values for each tunnel. By using TM-align, we aligned every protein with each of the six OB-fold templates shown in Figure 3, and got the maximal alignment score as the TM-score of the protein. In order to investigate the distinguishing ability of the features, we had statistically analysed the distribution for each feature, shown in Figure 4. It is obvious that, bottleneck radius shows little difference between DSBs and SSBs in either bound or unbound forms; and the DNA binding protein in bound form tends to have larger bottleneck radius than that in unbound form, which may be due to the fact that the protein usually need to widen the tunnel for binding with the DNA. SSBs tend to have the smaller tunnel length and curvature than DSBs, and tunnel length seems to be more distinguishable than tunnel curvature between DSBs and SSBs; moreover, it seems easier to differentiate DSBs and SSBs in bound forms than in unbound forms by using either of the features. As expected, SSBs obtain much higher TM-scores than DSBs by comparing to the OB-fold templates, illustrating that most SSBs have OB-fold like domains. In conclusion, TM-score, tunnel length and tunnel curvature are usable features to construct distinguish model for SSBs and DSBs, while bottleneck radius is lack of the distinguishing ability. Since the statistical results of tunnel length and tunnel curvature are very similar, we further investigate the correlation between these two features, listed in Table 2 showing that they are actually positive correlated with each other.

Figure 4
figure 4

Feature distributions of different kinds of DNA-binding proteins. These graphs show the box plot of the four features for the HOLO and APO datasets. Those are (a) tunnel bottleneck radius, (b) tunnel length, (c) tunnel curvature and (d) TM-score.

Table 2 Correlation of tunnel length and tunnel curvature

This table shows the values of Pearson coefficient and P-value between tunnel length and curvature. The columns of Pearson coefficient and P-value correspond to the pairs of DSBs/SSBs in HOLO set and APO set, respectively.

Validation of the differentiating features

We have done the validation experiments on HOLO set and APO set by using one, two or three features to construct the classification models. The validation performances are shown in Table 3, 4 respectively. From the tables we can see that, feature TM-Score can recognize out SSBs with high accuracy, while the feature tunnel length/curvature can recognize out DSBs with high accuracy, meaning that the distinguishing abilities of TM-Score and length/curvature are complementary. The performance of the classification model constructed with length feature is better than that constructed with curvature, also better than or nearly equal to that constructed with length and curvature features, further confirming that curvature feature is redundant with length feature and adding redundant features into the classification model does not necessarily get the positive response. Compared to the model with single feature, the significant enhancement of performance when using TM-Score together with one or more other features showing that constructing classification models with complementary features is preferable to the discrimination of DSBs and SSBs.

Table 3 Performance on HOLO set
Table 4 Performance on APO set

Independent test on APO set

In many cases, it is easier to collect information on DNA binding proteins in the bound form than in unbound form, whereas we need to know whether an unknown unbound protein be SSB or DSB. Thus, we train the classifier on HOLO set and test it on APO set. The results are listed in Table 5 from which we can see that the structural information on tunnel and OB-fold can actually reflect that differences between SSBs and DSBs thus can be used as discriminant features to build the classification model.

Table 5 Performance of the independent test

Prediction on mixed set

In practice, we often found the available dataset include not only the bound form proteins, but also the unbound form proteins, whereas we need to know whether an unknown DNA binding protein be SSB or DSB. Thus, we have done the validation experiments on the mixed set by using one, two or three features construct the classification models. The results are listed in Table 6 from the tables we can see that, feature TM-Score can still recognize out SSBs with high accuracy in each single feature. Compared to the models with single feature, the best performance using more features with an accuracy of 0.8251, MCC of 0.6632, SN of 0.8605 and SP of 0.7904 is much better. Thus, we further train the classifier on mixed set and predicted the unknown proteins (727 unknowns). The classified results are listed in additional file 2.

Table 6 Performance on mixed set

Conclusion

Despite many similar properties, dsDNA and ssDNA possess distinctive entities that are recognized differently by specialized dsDNA and ssDNA binding proteins, respectively. SSBs and DSBs binding interfaces are thus expected to differ in their geometrical features consistent with the different nature of dsDNA and ssDNA [29, 38, 39]. While the sequence and structural properties of DSBs and SSBs binding interfaces has been studied during the last decade [28, 40], computationally distinguishing between the DSBs and SSBs binding interfaces is still a lack of research. In this study, we investigated surface tunnels features of SSBs and DSBs and found that they have different ranges of tunnel lengths and tunnel curvatures; moreover, the alignment results with OB-fold templates have also found to be the discriminative feature of SSBs and DSBs. Therefore, we made the first try to present a method to computationally distinguish SSBs with DSBs based on the discriminant features and got the satisfactory results.

The protein surface features should also be useful for the analysis of other types of molecular interactions, such as protein-ligand, protein-RNA, and protein-protein complexes, and for the study of a variety of proteins, multiple binding sites or a specific family of proteins. These problems would require modelling interface surfaces of different characteristics such as compatibility, different sizes, and cooperatives between these surfaces, thus new surface features in addition to the solid angle may be needed.