Introduction

Carbohydrates are involved in a broad spectrum of pathophysiological processes ranging from protein folding, bacterial adhesion, viral infection, cancer metastasis, inflammatory reactions, cell proliferation, and cell–cell communication [1, 2]. Carbohydrate research has gained considerable momentum in the past decade due to its potentially rewarding applications in therapeutics, drug delivery, diagnosis, and vaccine development [24]. Nevertheless, only a limited number of carbohydrate-based drugs have reached the market to date, and carbohydrates are still considered to be a relatively untapped source for new therapeutic agents [5]. The relatively slow development of carbohydrate-based therapeutics could be attributed to a number of factors; including the problematic synthesis of carbohydrate derivatives [6], inadequate pharmacokinetic profiles due to high water solubility [2], and the inherent low binding affinities (in the milli- to micro-molar range) of naturally occurring carbohydrates [5, 7]. Moreover, carbohydrates present a unique set of structural and energetic features that makes the accurate modeling of their properties a daunting task. Such features include: (1) complex stereochemistry and the high density of polar functional groups, which necessitates the accurate treatment of electrostatic interactions [8, 9], (2) the rich diversity of linear and branched structures formed by oligosaccharides as well as the multiple rotameric states of glycosidic bonds [8], (3) importance of the C–H···π interactions on the α-hydrophobic face of sugars [1012], (4) the anomeric and exoanomeric effects [9, 13], and (5) the highly dynamic and relatively weak nature of carbohydrate–protein interactions [14, 15].

The increased interest in carbohydrate research over the past two decades has stimulated the development of computational tools specifically tuned for carbohydrate simulations. For instance, carbohydrate-specific force fields, e.g. GLYCAM06 [9], are increasing in number and quality and are being adopted more frequently in biomolecular simulations involving carbohydrate–macromolecule interactions [8]. However, the optimization of carbohydrate leads in drug discovery requires the correct identification of their native binding modes to macromolecular targets and the reliable estimation of binding affinities of putative complexes. Although a multitude of docking/scoring programs have achieved considerable success in reproducing crystal poses, the accurate prediction of binding affinity from these poses is still largely elusive [16, 17].

In addition to general utility scoring and free-energy functions [1825] three attempts specifically dealing with the quantification of carbohydrate–protein binding are reported. In a first approach, Laederach and Reilly [26] employed a set of 30 carbohydrate–protein complexes to train an empirical model based on the AutoDock scoring function, plus a special term for hydrogen bond. The best performing model yielded a residual standard error of 1.4 kcal/mol in the training set. Later, Hill and Reilly [27] expanded this study to a training set of 115 complexes and introduced a novel entropic term that accounts for ligand’s translational and rotational degrees-of-freedom. Starting from the AutoDock scoring function, they examined 288 different free-energy models and the best model (JA) achieved a root-mean-squared-error (RMSE) of 2.0 kcal/mol. The third approach was the sugar–lectin interactions and DoCKing (SLICK) scoring functions introduced by Kerzmann et al. [28], which employs a special term to account for C–H···π interactions [29]. The developed free-energy function predicted binding affinities in a training set of 20 lectin–sugar complexes within a maximum absolute error of 2.8 kJ/mol (0.7 kcal/mol). In an extended iteration of the study, the authors successfully redocked 17 out of 18 training complexes, with an average RMSD of 0.85 Å and an average absolute error of 3.6 kJ/mol (0.9 kcal/mol) in the binding free-energy estimate [30] Notably, the three attempts were derived by recalibrating an existing scoring function on training sets of carbohydrate–protein complexes.

Despite the relative abundance of methodologies for calculating different free-energy components, it would seem that we still lack a better understanding of why the traditional free-energy functions generally fail to yield good correlation with experimental results. In this study, we gathered and refined a large and diverse set of carbohydrate–protein complexes with experimentally determined binding affinities. We investigated a larger number of combinations of computational methods accounting for one or more of the free-energy components (e.g. force fields, scoring functions, solvent-accessible surface area, desolvation penalties, etc.). The employed methods vary in their theoretical derivation, degree of sophistication, and associated computational cost; from a simple integer representing the number of freely rotatable bonds in the ligand up to a sophisticated free-energy function employing an implicit solvent model such as MM/GBSA. The aim was to find the computational tools that could, either individually or in combination, serve as an objective free-energy function for carbohydrate–protein complexes. In addition, our study addressed two fundamental questions related to the quantification of carbohydrate–protein interactions: (1) the target-dependence of scoring functions [16, 19, 31, 32]; i.e. why is it that certain scoring functions could predict binding affinities accurately in some protein families and fail in others, and (2) the impact of the binding-site topology and solvent accessibility.

Results and discussion

Traditional approaches for estimating binding free energy

Our investigation started by assessing the performance of the Glide XP scoring function and the MM/GBSA method, as examples of well-established free-energy models, on our carbohydrate-specific data set. The evaluated free-energy functions showed poor correlations with the experimental binding affinities in our carbohydrate data set (Fig. 1; Fig. S1 in Online Resource 1 for AutoDock and MM/PBSA). Although this finding is disappointing, it is not by any means surprising. Despite the reported success of Glide and AutoDock in reproducing crystallographic conformations and database screening, they were shown to yield inaccurate binding affinity predictions in several protein families [16]. In general, the prediction accuracy of scoring functions employed in widely used docking programs is known to be system-dependent [16, 19, 31, 32]. On the other hand, performance of MM/GBSA and MM/PBSA in free-energy predictions was in most cases assessed on uniform data sets of ligands binding to the same protein [25, 33] or on relatively small data set of different proteins [23]. In the latter case, MM/GBSA and MM/PBSA were shown to exhibit target-dependent variation in prediction accuracy in a manner similar to the scoring functions employed in docking [23, 34, 35]. However, the apparent lack of correlation in Fig. 1 is not dependent on the molecule size; i.e. the Glide XP and MM/GBSA energies incorrectly describe small rigid ligands and larger and more flexible ligands alike.

Fig. 1
figure 1

Correlation plots of experimental free energies in the carbohydrate–protein data set versus Glide XP scoring function (left) and MM/GBSA free-energy function (right), points are color-coded according to the ligand’s molecular weight (N = 236)

It is worth noting that both energy models were, at least to some extent, biased towards larger ligands awarding them higher scores (i.e. more negative values) in comparison to smaller ligands. Evidently, it has been reported that the binding free energy improves by ~−1.5 kcal/mol for each non-hydrogen atom in the ligand up to a limit of 15, where it reaches a plateau [36, 37]. In addition, the solvent accessible surface that becomes buried when the ligand and the protein associate (i.e. contact area) is a major determinant of the strength of interaction [30, 3840]. In our data set, however, no correlation was observed between binding affinities and ligand sizes or contact areas (Fig. S2, Online Resource 1). This could be attributed to the large diversity and the wide affinity range of the studied carbohydrate–protein complexes. The underlying physical model and mathematical formulation of the empirical scoring functions, e.g. Glide XP, differ significantly from those in the implicit solvent model of MM/GBSA free-energy function. Surprisingly, however, the energy scores of both methods correlate well with each other and suffer similarly from size-dependent bias in the calculated energies (Fig. S3, Online Resource 1).

It is important to note, however, that in the preliminary assessments above the four methods were used as black boxes and the calculated energies were used “as is” without parameter fitting to the carbohydrate data set. Previous studies on similar problems highlighted the difference in relative importance of certain components of binding free energy in carbohydrate–protein interactions. For example, Laederach and Reilly [26] reported that electrostatic interactions play a more important role in determining the affinity between a carbohydrate and a protein. Since the MM/GBSA model uses equal weights for the different energy components (electrostatic, vdW, etc.), it is crucial to introduce empirical weighting coefficients when applying it for carbohydrate–protein systems. Similarly, the coefficients employed in the evaluated scoring functions were optimized to reproduce the experimental affinities of specific training sets of 30 complexes in case of AutoDock [41] and 198 complexes in case of Glide XP [42]. Since the proteins employed to train these scoring functions are not necessarily carbohydrate binders, it would seem beneficial to recalibrate their coefficients for our carbohydrate-specific set.

Empirical free-energy functions

The use of linear regression models, or linear response models, is a recurring theme with several successful examples in the development of free-energy functions [4346]; and the reported carbohydrate-specific scoring functions are, in fact, empirical models derived by recalibrating an existing scoring function on training sets of carbohydrate–protein complexes, with the occasional addition of terms to improve treatment of special interaction motifs, e.g. C–H···π interactions [2628, 30]. The following Master Equation was employed as a testing device to assess different combinations of computational methods as potential free-energy models for carbohydrate–protein interactions.

$$\Delta G_{bind} = c_{1}\Delta G_{inter} + c_{2}\Delta G_{solv} + c_{3}\Delta G_{strain} + c_{4} {\text{T}} \cdot\Delta S_{lig} + c_{5}\Delta G_{reward/penalty} ,$$

where ΔG inter is the ligand–protein interaction energy, ΔG solv is the desolvation penalty associated with binding, ΔG strain is the conformational strain penalty, ΔS lig is the entropy lost by the ligand upon binding, and ΔG reward/penalty represent special rewards and penalties, e.g. the polar surface buried on binding. All permutations obtainable using different complex descriptors at each position in the Master Equation were evaluated (Fig. S4, Online Resource 1), aiming to investigate, as thoroughly as possible, the ability of the available repertoire of methodologies for modeling molecular interactions to formulate a reliable free-energy model for carbohydrate–protein systems. A total of 51,520 models were exhaustively enumerated and evaluated by linear fitting to the training set comprising 236 carbohydrate–protein complexes. The adjusted coefficient of determination (adjusted-r 2) was used to assess the quality of the resultant models.

The examined empirical models ranged in complexity from simple equations using a single predictor variable to complex equations using 21 variables. To our surprise, none of the assessed functions satisfactorily predicted binding affinities in our data set (Fig. 2). This was rather disappointing, since the employed pool of descriptors covered a very wide scope of structural and energetic features, including their ensemble averages from molecular dynamics (MD) simulations. It would seem, therefore, that contemporary molecular modeling methodologies with relatively low computational cost cannot be used reliably to predict binding affinity of carbohydrate–protein complexes.

Fig. 2
figure 2

Statistical assessment of the free-energy models resulting from the combinations of complex descriptors in the Master Equation (Fig. S4, Online Resource 1). The number of independent variables in the model is plotted on the horizontal axis, while the adjusted-r 2 as a measure of model predictive quality is plotted on the vertical axis. The dotted line marks the value of adjusted-r 2 = 0.5, which can be used as an arbitrary threshold delineating the potentially predictive models from the non-predictive models

Topological classification of carbohydrate-binding sites

Accounting for solvation effects is one of the most challenging issues in structure-based design. Methods combining force fields with implicit solvation model such as MM/PBSA and MM/GBSA are examples of rigorous methods with numerous successful applications in a variety of ligand–protein systems. Their performance, however, is known to be largely system-dependent [47, 48]. The physical model employed by both methods pictures the interacting molecules as zones of low dielectricity embedded in a continuum of high-dielectricity, i.e. the solvent. Among other factors, the limited accuracy of this model can be attributed to the difficulty in accurately defining the boundary between the two zones of differing dielectric properties [4953]. Moreover, Hou et al. [23] demonstrated that MM/GBSA predictions are quite sensitive to the solute’s dielectric constant. The authors recommended that the dielectric parameter ‘should be carefully determined according to the characteristics of the protein/ligand binding interface’. Inaccuracy in the treatment of dielectric properties could result in errors in the final estimates of solvation contribution to the binding free energy. In principle, these errors would be relatively uniform in homogeneous sets and consequently have less negative impact on final free-energy estimates. In heterogeneous sets, however, binding sites exhibit larger variations in shape and solvent-accessibility. In such cases, the errors introduced by inaccurate dielectric boundary assignment will significantly vary with the topological features of the binding site, and hence have more detrimental effect on accuracy of the calculated free energies.

The extent to which the carbohydrate-binding site is in continuity with the solvent bulk is governed by its shape and solvent accessibility, which in turn influences key parameters of the micro-environment where the intermolecular interaction takes place, e.g. dielectric properties. Nevertheless, analytical treatment of these parameters is practically unfeasible as it typically requires long converged conformational sampling in explicit solvent affinity, such as free-energy perturbation [54, 55] and thermodynamic integration [56]. However, the complexity of the free-energy landscape could, in principle, be simplified by defining families of binding site topologies within which the binding micro-environments are roughly identical. Such topological classification could reduce the large and heterogeneous problem to a set of smaller more homogenous problems, for which simple free-energy formulations could be applied. Therefore, topologies of the carbohydrate–protein interfaces in the studied complexes were analyzed using DoGSite [57] combined with clustering and the complexes were allocated to one of five topological categories based on shape and degree of surface exposure of the binding site: fully buried, partially buried, small-mouth groove, big-mouth groove, and shallow (Fig. 3).

Fig. 3
figure 3

Complexes were classified into five categories based on topology and solvent exposure of the carbohydrate-binding site. From top to bottom, the figure shows: category name; schematic representation of the category; PDB code for an example carbohydrate–protein complex; and the solvent-accessible surface representation of the example complex (blue ligand, grey protein). In the left-most complex, the protein surface is rendered transparent to show the completely buried ligand

Figure 4 shows the distribution of key properties within the different binding site categories in our data set. As seen from the topmost plot, the proposed classification did not segregate complexes according to binding affinity, i.e. carbohydrate ligands could exhibit high or low affinity to their targets regardless of the binding-site topology. Complexes in the fully-buried category span similar range of binding affinities to those in the shallow category. There are, however, differences in molecular-weight distributions among the different categories. Fully-buried binding sites tend to accommodate smaller ligands while the three middle categories bind medium-sized ligands. On the other hand, fully exposed shallow binding sites can accommodate a wide range of ligand sizes including relatively large molecules. The area of the contact surface, however, follows a qualitatively different trend with the middle three binding categories exhibiting relatively larger interaction surfaces. The smaller average contact surfaces in fully buried binding sites could be justified by the small sizes of bound ligands in this category. Surprisingly, the shallow binding sites show on average contact surfaces of the same scale observed in case of the fully buried sites, although the former bind relatively larger ligands. This could indicate that ligands in shallow carbohydrate-recognition sites require relatively smaller contact areas to bind to their targets. This observation matches the picture of carbohydrate-binding proteins involved, for instance, in cell–cell communication, e.g. lectins, where the carbohydrate ligand is typically a large biopolymer interacting via a small di- or tri-saccharide motif at its tip. Finally, Glide XP seems to mirror the trends seen in molecular weights and contact surface areas. Glide XP tends to assign lower scores on average to ligands in the fully buried category (smaller ligands) and to those in the shallow category (small contact surface). This trend matches our earlier observation of the size-dependent bias in Glide XP scores.

Fig. 4
figure 4

Distribution of key properties within binding-site categories of the studied carbohydrate data set (non-shaded box plots) and the entire uncategorized data set (shaded box plot). Median indicated by black bar, average indicated by the cross marker. Boxes indicate the first (25 %) and third (75 %) quartiles. Whiskers plotted at ×1.5 interquartile range, roughly encompassing 99.7 % of the data (mean ± 3σ). Circles represent individual outliers larger than the upper/lower whiskers

The influence of categorization on the prediction accuracy of empirical scoring functions is presented in Fig. 5. It is obvious that independent training of the empirical free-energy functions for individual categories results in substantial improvement in prediction accuracy in contrast to training the models for the entire data set without categorization. A significant proportion of evaluated empirical scoring functions were capable of reproducing binding affinities of the training set with acceptable accuracy (adjusted-r 2 > 0.6). This result indicates that the problem at hand; i.e. predicting carbohydrate–protein binding affinities, is likely a collectively heterogeneous problem of smaller internally more homogeneous sub-problems. It is important to note, however, that the proposed classification scheme did not segregate the data set into distinct protein families (e.g. glycogen phosphorylases, neuraminidases, etc.), which could be inherently easier to model.

Fig. 5
figure 5

Comparison of the performance of free-energy models derived from the Master Equation on the uncategorized data set and after categorization according to binding-site topology. The vertical axis shows the fraction of all assessed models with adjusted-r 2 in the range defined in the horizontal axis

Free-energy models from the exhaustive search depicted in Fig. 5 (257,600 models resulting from 51,520 × 5 categories) were further analyzed to identify physically and statistically valid free-energy models. Firstly, scoring functions showing good prediction accuracy in all categories and exhibiting no co-linearity within the employed descriptors were kept. Secondly, models exhibiting regression coefficients that made no physical sense, e.g. entropic penalty or ligand strain energy contributing favorably to affinity, were excluded. Finally, the remaining models were subjected to stringent statistical tests including cross-validation and y-scrambling. Results of the statistical quality-based and physics-based filtering are summarized in Fig. S5 in Online Resource 1. The best performing free-energy models are listed in Fig. 6, and results of their statistical validation are shown in Table 1 (Details for models GA2, and GA3 are given in Table S1 in Online Resource 1). Models GA2d and GA3d were developed by replacing terms in the corresponding static models, GA2 and GA3 with the corresponding MD-derived averages (Fig. S8, Online Resource 1). Despite the evident fluctuations in the calculated interaction energies along MD simulations (Fig. S9, Online Resource 1), the use of dynamic averages of interaction energies had a negative impact on the prediction quality of the free-energy models (Table 1), which could indicate that longer and more extensive simulations are required [23, 47, 58].

Fig. 6
figure 6

Free-energy models showing the best performance after statistics and physics-based filtering

Table 1 Results of statistical validation for the best performing free-energy models GA1, GA2, and GA3 and the corresponding models GA2d, and GA3d using ensemble averages from MD simulations

The GA1 model exhibited the best balance between complexity and comprised Columbic and van der Waals interaction energies from the Glide XP scoring function, two solvent-accessible surface area terms accounting for the non-polar and polar solvent-accessible surface area (SASA) that becomes buried on binding, and two reward/penalty terms for the number of rotatable bonds (N rot ) and formal charge of the ligand (Q lig ). Statistical performance of the model is summarized in Table 1. The GA1 model reproduced binding free energies within topological categories with r 2 values ranging from 0.67 to 0.82, RMSE from 0.89 to 1.32 kcal/mol and mean unsigned errors of 0.76–1.04 kcal/mol in the predicted free energies. Results of leave-one-out and leave-k-out cross-validation confirm robustness and internal consistency of the model. In the leave-k-out cross-validation, the k is chosen such that in each cycle one-seventh of the training set is removed then predicted using the model trained for the remaining complexes. The perturbation introduced by removing one-seventh of the complexes is more significant compared to removing a single complex in leave-one-out cross-validation. The leave-k-out cross-validation, therefore, is a more stringent test for model robustness. Finally, randomization of experimental affinities across carbohydrate–protein complexes in each category resulted in a substantial drop in quality prediction.

To assess the overall performance of the GA1 free-energy model, prediction errors were pooled from the five binding site topological categories. The GA1 model reproduces binding free energies in the entire data set within RMSE of 1.25 kcal/mol, which corresponds to a factor of 10-off from experimental values. Prediction accuracy of the GA1 model is substantially reduced when applied to the entire uncategorized data set. Notably, the GA1 model did not exhibit the size-dependent bias observed in the traditional scoring functions (Fig. S6, Online Resource 1). Furthermore, Fig. 7 presents the influence of the proposed categorization scheme on the performance of the GA free-energy model. The GA1 Model does not seem to exhibit systematic over- or under-estimations in the predicted ΔG values. However, it shows a slight bias in the plot of residuals against experimental ΔG values (Fig. S7, Online Resource 1), i.e. some high affinity ligands are underestimated while some low affinity ligands are overestimated. On the other hand, in the range 3.0 ≤ ΔG bind  ≤ 12.0 kcal/mol, the residuals are more evenly distributed with no clear bias.

Fig. 7
figure 7

Distributing the carbohydrate–protein data set into binding site topological categories according to the proposed classification scheme leads to a substantial improvement in the performance of the GA1 empirical free-energy model (N = 236). Dashed lines mark tenfold deviations from experimental binding affinity

The improvement in the performance of the GA1 model could be a mere consequence of reducing the dimensionality of the problem from the total of 236 complexes in the complete data set to smaller subsets of 29–70 complexes per category. To examine this possibility, carbohydrate–protein complexes were randomly allocated to five dummy categories having the same sizes of the binding-site topological categories disregarding the actual binding-site topology. The GA1 model was then applied to the resultant categories and its performance was evaluated. Average performance results from 100 category-randomization runs are presented in Table 1. The apparent deterioration of the GA1 model performance confirms that mixing complexes with differing binding site topologies in small categories is not alone sufficient to yield useful free-energy correlations. This further confirms the relevance of actual binding site topology in defining the free-energy response surface within categories and also verifies the validity of the proposed classification scheme.

Since the GA1 free-energy model was fitted five times, once for each binding site topological category, five sets of empirical weighting coefficients were obtained. The empirical coefficients are listed in Table 2 after multiplying each of them by the mean and the standard deviation of the corresponding energy components for each category. The resulting values are the mean (±SD) of the free energy contributed by each component in the GA1 model to the total binding free energy within individual categories. As seen from Table 2, the values of the average energy contributions (and the underlying empirical weighting coefficients) show evident category-dependent variations. Interpretation of these coefficients, however, could be complicated by their unavoidable dependence on the training set and the inherent complexity of the free-energy landscape. Nevertheless, a couple of interesting trends can be noted. Firstly, the contribution of electrostatic interactions to the total free energy is relatively larger in the fully buried and partially buried categories. This could be attributed to the differences in rewards for releasing the more trapped water molecules in these two categories compared to the relatively more easily exchangeable waters in the remaining categories. Secondly, the existence of charged groups (reflected by the formal charge of the ligand, Q lig ) is associated with moderate penalty in the fully buried, partially buried and small mouth categories. In the big mouth and shallow categories, however, the contribution of Q lig to binding free energy is nearly negligible. This could be justified by the expected higher cost for moving charges from the bulk solvent to the protein interior in the former three categories, while in the latter two categories the formal charge could interact with the solvent to some extent. It is also noteworthy that the contribution of electrostatic interactions to the binding free energy is roughly similar to those of vdW interactions, which is in agreement with the JA model reported by Hill and Reilly on the expanded carbohydrate data set [27].

Table 2 Average contributions of individual free-energy components in the GA1 free-energy model to the total binding free energy in different binding site topological categories

Conclusion

The increasing interest in carbohydrate-based therapeutics in the past few decades has intensified the need for reliable and efficient molecular modeling tools specifically dealing with quantification of carbohydrate–protein interactions. We thoroughly investigated the performance of well-established computational methodologies on a specially curated set of 236 diverse carbohydrate–protein crystal structures with known binding affinity. Although the descriptor pool (with approximately 170 entries) extends across a significant portion of the potential solution space, none of the assessed models satisfactorily predicted the binding affinities in our data set. Binding site topologies were clustered and the complexes in our data set were allocated into five topological categories based on the shape and degree of surface exposure of the carbohydrate-binding site: fully buried, partially buried, small-mouth groove, big-mouth groove, and shallow. Free-energy models independently fitted for individual categories exhibited a substantial improvement in prediction accuracy. The best performing free-energy model (GA1 model) exhibited an overall r 2 of 0.71 and a RMSE of 1.25 kcal/mol in the predicted binding affinity (corresponding to a factor of 10 in the affinity). The results would seem to indicate that topological classification could be used to reduce the large and heterogeneous problem to a set of smaller more homogenous problems, for which simple free-energy formulations could be applied.

Despite the known difficulties in calculating binding affinities for carbohydrate–protein complexes, this study have achieved three important goals. First, a high-quality binding affinity data set for a large and diverse collection carbohydrate–protein complexes has been compiled and thoroughly revised. Second, we proposed a rigorous function for predicting binding affinity from the atomic configuration of carbohydrate–protein complexes. Finally, we propose classification of carbohydrate-binding proteins according to the topology and surface exposure of the binding site. Differences between the free-energy models individually calibrated for each topological class reflect the differences in the nature of the local binding micro-environments. Although it might be difficult to fully explain how such differences might affect the shape of the free-energy response surface, the results of this study show how these differences complicate the free-energy prediction problem and demonstrate the usefulness of calibrating free-energy functions individually according to binding-site topology and surface exposure.

Computational methods

Preparing carbohydrate–protein complexes

Compiling the data set

A pool of ligand–protein complexes was gathered by mining three databases: the Protein Data Bank for structural information, and Binding MOAD [59] and BindingDB [60] for binding affinities. Complexes used previously in similar studies were also included [2628, 30]. The crude collection was refined to a data set of 273 entries of reviewed experimental affinities for carbohydrate–protein complexes (a detailed listing is given in Table S3 in Online Resource 1). Some complexes were excluded during the structure preparation step due to uncertainties in geometry or the inability of common force fields to handle some ligand atoms (cf. Online Resource 1). The final data set employed in the study of free-energy models contained 236 complexes. The employed set comprised 90 unique proteins (corresponding to 65 unique SCOP and 43 unique CATH domain classes) and 175 unique carbohydrate ligands (cf. Fig. S10 in Online Resource 1 for more details). All binding affinity values were converted to binding free energies (∆G, kcal/mol) using the thermodynamic master equation ∆G = −RTlnK.

Preprocessing complexes

All ligand–protein complexes were retrieved from the Protein Data Bank (www.pdb.org) and processed using Maestro’s Protein Preparation Wizard (Maestro, version 9.2, 2011, Schrödinger, LLC, New York). All hydrogen atoms in the input structures were deleted, bond orders were automatically assigned, and hydrogens were added accordingly. Water molecules within 5.0 Å from non-standard residues (e.g. ligands, cofactors, metals) were kept and all other water molecules were deleted. Missing side chains were completed and optimized using Prime (Prime, version 3.0, 2011, Schrödinger, LLC, New York).

Multiple ligand copies

When a complex exhibited multiple chains with several copies of the ligand molecule in the asymmetric unit, the individual chains were superimposed and heavy-atom RMSDs were computed for the ligand and the surrounding residues. In most complexes all the copies had RMSD values within 1.0 Å; in which case the first chain having a resolved ligand was used and its chain identifier was noted. Complexes where ligand copies differed significantly in conformation and/or orientation in the binding site, i.e. RMSD > 1.0 Å were discarded (examples: 1A0T and 1JZ7). In some complexes, the ligand had two overlapping representations, mostly resulting from the α- and β-anomers being simultaneously resolved in the binding pocket. Unless the affinity measurement explicitly refers to the β-anomer, the α-anomer was used in subsequent computations and the β-anomer copy was deleted. In some complexes there was a ligand copy in an allosteric binding site, as indicated in the original publication of the PDB structure. In such cases, we confirmed that the measured affinity was competitive by revisiting the respective publication, and subsequently deleted the allosteric copy of the ligand (examples: 2QN8 and 2QNB). Before proceeding, we made sure that each complex had one, and only one, ligand copy. Relevant processing notes—e.g. retained chains in case of multiple-chain PDB’s, deleted ligand copies, etc.—are given in Table S3 in Online Resource 1.

Covalent structure and protonation

Each ligand’s chemical structure was cross-checked against the corresponding primary citation and inconsistencies resulting from incorrect bond order assignments were corrected manually. Protonation and tautomeric states for all HET groups were automatically assigned using Epik [61]. We used the protonation state of the ligand whenever it was explicitly mentioned in the original publication; otherwise the top-ranked suggestion from Epik was used. At this stage, fully-atomistic models of all 236 ligand–protein complexes, each having a unique ligand molecule with revised chemical structure and protonation state, were ready for the subsequent analyses.

Geometry optimization

The geometry and orientation of all added hydrogen atoms were exhaustively sampled for optimal H-bond formation, including any necessary flipping of glutamine, asparagine, and histidine side chains. Finally, each complex was refined by full minimization using OPLS_2005 force field as implemented in Schrödinger’s MacroModel (MacroModel, version 9.9, 2011, Schrödinger, LLC, New York). Minimization was set to converge within heavy-atom RMSD of 0.3 Å from the input geometry to avoid significant deviations from the experimental geometry.

Complex descriptors

A complex descriptor is a quantity measuring some geometric or energy-based feature of a given ligand–protein complex. In the context of this study, they serve as the building blocks of the investigated empirical scoring functions (cf. Table S2 in Online Resource 1 and Online Resource 2).

Non-bonded interaction energies from force fields

The first force field employed in this study was OPLS_2005, the MacroModel implementation of the OPLS-All-Atom force field [62]. Optimized potentials for liquid simulations (OPLS) was originally optimized for protein simulations [63], and later upgraded to the all-atom variant OPLS-AA [64], then extended to carbohydrates by refitting some of the parameters to ab initio results for complete hexopyranoses [65] and by applying additional scaling factors for the 1.5 and 1.6 electrostatic interactions [66]. Moreover, OPLS-AA-driven MD simulations have been successfully employed for studying carbohydrate–protein interactions [67, 68]. The second force field employed in this study was MMFFs, MacroModel implementation of the MMFF94s force field [6971]. The Merck molecular force field (MMFF) was parameterized using a wide variety of chemical systems, and targets simulations of small molecules as well as proteins and biological systems. The MMFF94s variant enforces planarity around sp 2 hybridized nitrogens. The chemical classes included in MMFF94 core parameterization do not include carbohydrates, though. We included the MMFFs as a general-utility biomolecular force field to compare its performance against OPLS-AA, which has been optimized for carbohydrates. The non-bonded interaction energy components (electrostatic, van der Waals, and solvation) were calculated for each complex by performing a single-point energy calculation using the respective force field on the ligand–protein complex, the protein alone, and the ligand alone according to the formula:

$$E_{{non{-}bonded}} = E_{complex} - (E_{ligand} + E_{protein} )$$

MM/GBSA and MM/PBSA free-energy functions

The combined Molecular Mechanics/implicit solvent models such as the Generalized Born Surface Area (MM/GBSA) and the Poisson–Boltzmann Surface Area (MM/PBSA) approaches offer a good compromise between computational efficiency and accurate treatment of solvation effects [72, 73]. In the current study, MM/GBSA computation were performed in Schrödinger’s Prime, using the VSGB 2.0 energy model [74] to calculate the GBSA contribution and the OPLS-AA force field to calculate the molecular-mechanics energy [6466]. The VSGB 2.0 model includes physics-based correction terms for improved handling of π–π stacking, hydrogen-bonding interactions, hydrophobic interactions, and self-contacts of the side chains of certain residues. Moreover, the VSGB 2.0 model employs a Surface Generalized Born (SGB) model [75, 76] in conjunction with a variable dielectric (VD) treatment to account for polarization effects from protein side chains by varying the internal dielectric constants from 1.0 to 4.0 [77].

For MM/PBSA calculations, carbohydrate–protein complexes were prepared with the Leap module of the AMBER 12 suite [78] using the AMBER 99SB force-field [79]. Prior to processing, structures were minimized with the Sander module (25 cycles). The MMPBSA.py script was used for all energy calculations [80]. Ions and water molecules were removed and the ionic strength was set to 0.15 M. The PB equation was solved numerically by the pbsa program. The MM/GBSA and MM/PBSA-derived ΔG bind and their components employed as complex descriptors are listed in Table S2 in Online Resource 1.

Glide XP and AutoDock scoring functions

We included two well-established scoring functions as sources for complex descriptors in our study; namely Glide XP and AutoDock. Glide (Grid-based Ligand Docking with Energetics) is a widely used docking software [81], which has been successfully employed to predict and rank binding configurations of carbohydrate ligands to protein targets [8284]. The scoring function employed in Glide is based on the empirical ChemScore function [85] and has two variants; Glide SP (Standard Precision) and Glide XP (eXtra Precision). Glide XP has numerous specific reward and penalty terms and covers a wider range of ligand–protein interaction motifs, which makes it more suitable for our study [42]. Glide (Glide, version 5.7, 2011, Schrödinger, LLC, New York, NY) was used to calculate the docking scores for the studied complexes. Scores were computed using two modes: (1) the in place mode, where the input ligand coordinates are used directly for scoring, and (2) the refine input mode, where the input ligand coordinates are optimized in the field of the receptor prior to scoring.

The second scoring function considered in this study was the AutoDock empirical scoring function [41, 86]. AutoDock has been used in several studies for modeling and quantification of ligand–protein interactions [19, 82, 84] and has provided the basis for two empirical carbohydrate-specific free-energy models [26, 27]. The AutoDock scoring function employs the change in solvent-accessible surface area of non-polar ligand atoms to account for the solvation contribution [41]. AutoDock scores for the studied complexes were computed using the scoring function implemented in AutoDock 4.2 [87].

Entropic penalty

Change in entropy upon ligand–protein association is probably the most elusive component of the binding free energy. Commonly, a constant penalty is assigned for each freely rotatable bond in the ligand, ranging in value from 0.4 to 1.0 kcal/mol [20]. We also included the entropic term proposed by Hill and Reilly, which employs an empirical coupling coefficient, ξ, to account for loss of translational and rotational degrees-of-freedom upon binding [27]. Moreover, we included the entropic penalty term employed in Glide scoring function, which accounts for the residual ligand mobility by applying the penalty only to bonds expected to be frozen in the bound conformation [85]. Finally, we used the rigid-rotor harmonic oscillator approximation to estimate the changes in vibrational, rotational, and translational components of ligand’s entropy upon binding (MacroModel, 2011, Schrödinger, LLC, New York).

Characterization of binding sites

Changes in the polar and non-polar molecular surfaces play a key role in ligand–protein interactions [20, 3840]. To account for these changes, several SASA components were calculated in Maestro using a water-sized spherical probe (radius = 1.4 Å) scanning the surface of the analyzed molecule(s) at 0.1 Å spaced grid points (cf. Table S2 and Fig. S11 in Online Resource 1). To characterize the topology of carbohydrate-binding sites, the studied complexes were analyzed using DoGSite [57]. DoGSite employs a 3D Difference-of-Gaussian filter to identify and characterize binding pockets and splits identified pockets into subpockets, thereby allowing a refined structural description of the topology of active sites. DoGSite captures the key topological features binding sites including volume, surface area (total, protein-contact, and solvent exposed), pocket depth, ligand coverage, and pocket coverage. Carbohydrate–protein complexes were allocated into five non-overlapping categories by applying the Density Based Spatial Clustering of Applications with Noise (DBSCAN) unsupervised clustering algorithm [88] to the pool of SASA and DoGSite descriptors (cf. Online Resource 2).

Ligand-based descriptors

A number of ligand-derived descriptors were included to represent potentially relevant structural and energetic features, in our descriptor pool. The molecular weight and number of heavy atoms of the ligand were included to compensate for the potential size bias observed in scoring function [19], e.g. by penalizing large ligands and/or rewarding relatively smaller ligands [42]. We also included descriptors to account for ligand internal strain; defined as the energetic cost paid for forcing the relaxed unbound conformation of the ligand to assume the bioactive conformation. The relaxed conformation could be taken to be the nearest local minimum found in by typical energy minimization or to the global minimum [89]. The global minima for the studied carbohydrate ligands were obtained through an exhaustive conformational search using MacroModel, setting the maximum number of generated conformers to 5,000 and employing a wide energy window (40.0 kcal/mol) for conformer rejection. In addition, the SM8 quantum mechanical aqueous continuum solvation model [90] was employed to estimate ligands’ desolvation penalties. The computation was carried out on the crystallographic ligand conformation using B3LYP density functional and the 6-31G** basis set in Jaguar (version 7.8, Schrödinger, LLC, New York, NY). We also employed SM8 solvation free energy weighted according to the ligand’s buried surface area to account for partial ligand desolvation, particularly for ligands bound close to the surface.

Statistical validation

Empirical free-energy models investigated in this study were linear combinations of terms each representing a component of the free-energy change associated with binding.

$$\Delta G_{bind} = c_{1}\Delta G_{1} + c_{2}\Delta G_{2} + \cdots$$

The experimental binding affinity, ΔG bind , is the dependent (or response) variable (y) while the complex descriptors, ΔG i ’s, constitute the independent (or predictor) variables (x’s). Standard multiple linear regression was used to derive the weighting coefficients, c i ’s, by fitting the linear equation(s) to experimental binding affinities. All generated models were subjected to rigorous validation using traditional statistical methods; including coefficient of determination r 2, cross-validation r 2 (q 2), scrambling of response variable (binding affinity), as well as random allocation of the complexes to topological sub-categories (cf. Online Resource 1 for details). In all cases, models lacking physicochemical sense were not considered.