Background

All over the world, tumor is the second incurable disease only to cardiovascular disease. A wide range of proteins are found to be related to tumor formation and metastasis. However, only proteins with widespread biological significance for the tumor cells growth regulation are most possible to be the targets of broad-spectrum low-toxic antitumor drugs. In recent studies, histone deacetylases (HDACs) are proved to be novel epigenetic targets for the treatment of cancer [13]. Histone deacetylase inhibitors (HDACi) have extensively demonstrated the antitumor efficacy in vitro and in vivo. Therefore, the related study of HDACi has become one of the most important research fields of the antitumor drugs, especially during the coming area of epigenetics.

Histone deacetylases comprise a superfamily of 18 genes which is divided into two families and four classes in eukaryotic cells. Classes I, II, and IV consist of 11 family members, which are referred to as “classical” HDACs, whereas the 7 class III members are called “sirtuins” [4].Classical HDACs which require Zn2+ as a cofactor for their deacetylase activity are a promising novel class of anti-cancer drug targets that can be inhibited by Zn2+ chelating compounds such as hydroxamic acids. In contrast, these compounds are not active against sirtuins as these class III enzymes have a different mechanism of action in requiring NAD+ as an essential cofactor [5]. Recent researches indicate that sirtuins are linked to aging as well as metabolic and neurodegenerative diseases [6].

Classical HDACs are classified based on their homology to yeast proteins. HDACs 1, 2, 3, and 8 which belong to Class I have homologies to yeast RPD3, and they are located within the nucleus. HDACs 4, 5, 6, 7, 9, and 10 which belong to Class II have homologies to yeast HDA1 and located in both the nucleus and the cytoplasm. It should be noted that Class II HDACs can be further subdivided based on their sequence homolog and domain organization, i.e. Class IIa, which include HDACs 4, 5, 7, and 9 containing an N-terminal extension with regulatory function, and Class IIb, which include HDACs 6 and 10 containing two catalytic domains. HDAC 11 is categorized into class IV with conserved residues in its catalytic center that are shared by both classes I and II HDACs. The classification of classical HDACs is summarized in Table 1.

Table 1 “classical” HDACs

Histone deacetylase inhibitors (HDACi) that act on 11 zinc-dependent HDAC isozymes generally possess a zinc-binding group which coordinates the zinc ion in the active site, a cap substructure that interacts with amino acids at the entrance of the N-acetylated lysine binding channel, and a linker connecting the cap and the zinc-binding group at a proper distance [18]. HDACi can be categorized into four subtypes based on their chemical structures: (1) short chain fatty acid; (2) hydroxamic acid; (3) benzamides; and (4) cyclic peptides. Since HDACi do not inhibit all HDAC isoforms to the same extent, they can be categorized into pan-HDAC inhibitors and selective HDAC inhibitors including class I-specific inhibitors, class II-specific inhibitors, and class IV-specific inhibitors. Currently, many HDAC inhibitors have already been tested in clinical trials and shown certain antitumor or other biological activity. However, some HDAC inhibitors, especially pan-inhibitors, indicate serious side effects, such as fatigue, nausea, anorexia, diarrhea, thrombus formation, thrombocytopenia, neutropenia, anemia, myalgia, hypokalemia, hypophosphatemia, etc.[3]. Thus, HDAC inhibitors are possible to greatly improve the efficacy and reduce the certain toxicities only when they target the most relevant HDAC isoform rather than multiple ones. Consequently, it should be useful to discover or design novel antitumor drugs with fewer side effects when one method can analyze the interaction of inhibitors against multiple HDACs with further sorting out isoform- or class-specific inhibitors.

As for in silico drug discovery, there are many methods available such as molecular docking [19, 20], pharmacophore models, quantitative structure-activity relationship (QSAR) [2123], protein-ligand interaction fingerprint-based screening [24, 25] and others [2629]. QSAR is a widely applied computational method for predicting chemical compounds’ interactions with a single target protein. However, when thousands of chemical compounds interacted with 11 different HDAC isoforms, 11 separate QSAR models for each HDAC isoform are needed to create, which is quite complicated and time consuming. In addition, these separate models cannot extended to predict inhibitions of new HDACs [30]. Therefore, a new method should be proposed to predict cross-interactions of chemical compounds to multi-HDAC isoforms simultaneously.

More recently, proteochemometric (PCM) modeling has been widely used to study the cross-interactions between a series of compounds and a series of proteins. In this area Maris Lapinsh et.al studied melanocortin chimeric receptors using partial least-squares projections (PLS) to deduce PCM models [31, 32]; Hanna Geppert et.al derived PCM models of eleven proteases from four different protease families by support vector machine [33]; Ilona Mandrika and Maris Lapinsh et.al applied PLS to model interactions of HIV mutants [30, 34] and antibodies [35]. Contrary to traditional QSAR, PCM is based on the similarity of a group of ligands together with that of a group of targets [36]. Consequently, PCM can integrate several separate QSAR models into a global one. With the global PCM model in hand, we can study the cross interactions of all the ligands with all the targets in the data set or even outside the data set. By predicting the affinity for each ligand-target pair, PCM models can describe the specific interaction between a ligand and a target and discriminate the interaction strength between different ligand-target pairs. Therefore, in our study PCM models were applied to study the cross-interactions of a series of HDAC inhibitors to five HDAC isoforms, i.e., HDAC2, HDAC4, HDAC6, HDAC7, and HDAC8.

Results and discussion

Proteochemometric modeling

In our study, 18 proteochemometric models were created from training set with combinations of different descriptors. As shown in Table 2, goodness-of-fits (R2s) of all models were higher than 0.9619 and their cross validation coefficients Qcv2 ranged from 0.5734 to 0.7162. The model derived based on P1 and GD showed to be the best model with the highest predictive ability (Qcv2 = 0.7162 and Qtest2 = 0.7542). Accordingly P1-GD model was used in the subsequent analysis.

Table 2 Goodness-of-fit (R 2 ) and predictive ability (Q 2 cv , Q 2 test ) of the obtained 18 models

P0 vs P1 vs P2

Three protein descriptors, i.e., sequence similarity descriptor (P0), structure similarity descriptor (P1) and geometry descriptor (P2), were used to describe HDACs in our study. Sequence similarity descriptor is based on the sequence identities of HDACs, while structure similarity descriptor and geometry descriptor leftacterize HDACs based on their 3D-structures. Protein descriptors are different from ligand descriptors since proteins have larger molecule structures to describe. If available, proteins are likely to be described on the basis of crystal structures. Protein structure similarity descriptor was calculated by protein 3D-struture alignment with more sufficient information considered. Contrary to P1, P0 only leftacterizes protein based on sequence alignment, and may lose certain 3D information of proteins. Not surprisingly, models derived from P1 showed a better predictive ability than those of P0 (Table 3). In addition, although P2 is also derived based on 3D-structure, it only measures bond length, bond angle and dihedral angle statistically without much of the detailed information of proteins, thus it is not sufficient to leftacterize proteins comprehensively. As a result, we also found that models based on geometry descriptor obtained the worst predictive ability (Qtest2 of models based on P2 in every group is the lowest) compared to the others.

Table 3 R 2 and Q 2 test of 18 models grouped for comparing three protein descriptors ability

GD vs DLI

Similar to protein descriptors, two typical kinds of ligand descriptors, i.e., General Descriptor (GD) and Drug-Like Index (DLI) were applied. Our result indicates that there was no significant difference between Q2 values of models based on GD and DLI ( Table 4), with p-value bigger than 0.1 by paired t-test.

Table 4 Q 2 test of 18 models grouped by ligand descriptors

It should be noted that there are a large number of different descriptors available for ligands, and there is no optimal one suitable for all the data sets. Therefore, it is wise to try several different descriptors to identify the optimal one in a particular scenario [37]. In our study, we used two different ligand descriptors, GD and DLI to create PCM models. These two kinds of descriptors leftacterize physical properties and topological indices of ligands respectively. For our particular data set, there was no statistically difference in predictive ability between these two ligand descriptors.

Model performance with or without cross-terms

A multiplied cross-term was used in our models and it was shown to be helpless in the improvement of the predictive ability of PCM models. The Qtest2 of models with cross-terms is lower than that without cross-terms in every group (Table 5).

Table 5 R 2 and Q 2 test of 12 models grouped by with- or without- cross-terms

Although cross-terms are intended to describe the properties of the interface between ligand and protein, there is still no good descriptor for the representation of local receptor-ligand interfaces [37], which may possibly result in the worse performance of the multiplied cross-term in our PCM models. Recently, a new Protein-Ligand interaction fingerprint was derived for in silico screening [24, 25]. This interaction fingerprint is a local descriptor to represent the interfaces of receptor-ligand and proved to be a good candidate cross-term in PCM. Theoretically, it should achieve better performance if the crystal complex structure exists. However, since there is no crystal structure available for most of the receptor-ligand pairs in our data set, thousands of complex structures have to be produced by molecular docking to apply interaction fingerprint, which may result in biases. Therefore, the interaction fingerprint was not adopted in our study.

Selective ability of proteochemometric model

In our study, we aimed to exploit an effective method to screen selective HDAC inhibitors which has selective activity on a single or a specific class of HDAC isoforms. For this purpose, proteochemometrics was applied to analyze the interaction strength of inhibitors against multiple HDACs, and then select out isoform-specific, class-specific as well as pan inhibitors. To verify the performance of the derived PCM models, an external validation of ten inhibitors was carried out to predict affinity with the best model (P1-GD model). The predicted values are compared with the corresponding experimental ones as shown in Table 6.

Table 6 The activity data and P0-GD model predict affinity data of ten HDAC inhibitors a

Among the ten inhibitors for external validation, TSA, SAHA, LBH589 and PXD-101 are reported as pan-HDAC inhibitors and almost all their predicted affinity values are high for all the HDAC isoforms in our test (e.g. LBH HDAC2 0.742, HDAC8 0.391, HDAC4 0.524, HDAC7 0.347, HDAC6 0.996). In addition, MGCD0103, FK228 and Apicidin are reported as class I-specific inhibitors and our results also indicated that the predicted values for class I HDACs are higher than those of others (e.g. Apicidin HDAC2 0.238, HDAC8 0.096, HDAC4 -0.501, HDAC7 -0.176, HDAC6 -0.120). Finally APHA, Tubacin and NCT-10a are reported as class II-specific inhibitors and our results are consistent with the validation data that their predicted values are higher for class II HDACs (e.g. NCT-10a HDAC2 -0.405, HDAC8 -0.731, HDAC4 0.137, HDAC7 0.159, HDAC6 0.010).

As a conclusion, our best PCM model performs well in screening selective HDAC inhibitors and distinguishing pan-HDAC inhibitors, class I-specific inhibitors and class II-specific inhibitors successfully. Therefore, this model can be further used to screen class-selective inhibitors as well as isoform-selective inhibitors of HDACs with fewer side effects.

Conclusion

Although more and more HDAC inhibitors have been identified to date, the number of class-selective inhibitors or isoform-selective inhibitors is insufficient. Thus, it is important to find these selective inhibitors which are candidate therapeutic agents for tumor with reduced side effects. In this study, proteochemometric models were derived to analyze the inhibitory activity of 1275 compounds with 5 HDAC isoforms simultaneously. Among these models, the best one, P1-GD model, was highly predictive (Qtest2 = 0.7542) and presented powerful ability to distinguish selective HDAC inhibitors from the pan ones. As a conclusion, proteochemometric modeling proves to be a suitable methodology for the prediction of inhibitor interactions with HDAC isoforms. Our study also indicates that the obtained optimal model can be potentially used for designing candidate antitumor drugs which can selectively target on a single HDAC or a specific class of HDAC isoforms.

Methods

Data set

To describe proteins more efficiently, five HDAC isoforms with known crystal structures were selected (Table 7). Among these isoforms, HDAC2 and HDAC8 are Class I HDACs; HDAC4, HDAC6, and HDAC7 are Class II HDACs, and more specifically, HDAC4 and HDAC7 belong to Class IIa; HDAC6 belongs to Class IIb.

Table 7 HDACs’ sequences and 3D structures from NCBI and PDB

The half maximal inhibitory concentration (IC50) values for 1443 chemical compounds (Additional file 1: Table S4) interacting with these HDAC isoforms were collected from the Binding Database (BindingDB, http://www.bindingdb.org). After filtration, the data set was reduced to 1275 compound-HDAC pairs with IC50 values, and it contained 215 pairs for HDAC2, 197 for HDAC4, 531 for HDAC6, 46 for HDAC7, and 286 for HDAC8 respectively (Table 8).

Table 8 The distribution of binding affinity IC50 data

The distribution of data set for every HDAC isoform is unbalanced. Therefore, we divided the data set into a training set (65%) and a test set (35%) by stratified sampling [38] (Additional file 2: Table S1, Additional file 3: Table S2).

Description of proteins

Three different sets of descriptors were used to leftacterize the five HDAC isoforms, i.e. sequence similarity descriptor (P0) [32], structure similarity descriptor (P1) and geometry descriptor (P2).

Sequence similarity descriptor

The amino acid sequences of all the HDACs were retrieved from NCBI (the entries are listed in Table 7). EMBOSS [39, 40] was used to calculate sequence identities of the five selected HDAC isoforms with all the HDAC isoforms. Finally we obtained 11 sequence similarity descriptors (Table 9).

Table 9 11 sequence similarity descriptors of HDAC2, 4, 6, 7 and 8

Structure similarity descriptor

This descriptor extends protein sequence alignment to structure alignment based on sequence similarity descriptor. By pairwise structure alignment using Protein Comparison Tool [41], we calculated pairwise structure identities of the five selected proteins and obtained five descriptors (Table 10).

Table 10 Five protein structure similarity descriptors of HDAC2, 4, 6, 7 and 8

Geometry descriptor

Protein contains various bonds like C-N, C-O, C-N-CA, and CA-C-O etc. By measuring the various bond length, bond angle and dihedral angle [42], 30 protein Geometry descriptors were obtained for each HDAC protein (Additional file 4: Table S3).

Description of inhibitors

In our study, the HDAC inhibitors were represented by two kinds of feature space, i.e. 32-dimensional General Descriptors (GD) and 28-dimensional Drug-Like Index (DLI). These descriptors are widely applied to the construction of QSAR models. For general descriptors, they include atomic contributions to van der waals surface area, log P (octanol/water), molar refractivity, and partial leftge. GD leftacterize physical properties and describe organic compounds in boiling point, vapor pressure, free energy of salvation in water, solubility in water, thrombin/trysin/factor Xa activity, blood–brain barrier permeability, and compound classification etc. [43]. On the other hand, DLI leftacterize simple topological indices of compounds and measure the hierarchy of drug structures in terms of rings, links, and molecular frameworks [44].

Protein-inhibitor cross-terms

Evidently, ligand-receptor recognition can only be partially explained by linear combinations of ligand and receptor descriptors. In reality, protein-ligand interactions are governed by complex processes that depend on the complementarity of the properties of the interacting entities. In PCM, this is accounted for by protein-inhibitor cross-terms [31, 36], which in the simplest case is obtained by multiplication of mean centered descriptors of proteins and inhibitors. Therefore, we obtained 11 × 32 = 352, 5 × 32 = 160, 30 × 32 = 960, 11 × 28 = 308, 5 × 28 = 140, 30 × 28 = 840 cross-terms for P0-GD, P1-GD, P2-GD, P0-DLI, P1-DLI, and P2-DLI respectively.

Preprocessing of data

To reduce the bias of the model, all descriptors were mean centered and scaled to unit variance prior to the calculation of protein-ligand cross-terms. Moreover, the binding affinities (IC50) were logarithmically transformed to pIC50 and then mean centered and scaled to unit variance.

Proteochemometric modeling

Support vector machine (SVM) is a non-linear modeling technique applied multiple times in PCM [33, 4550]. We created PCM models using support vector regression (SVR) built in Weka suit (Weka implementation “SMOreg”). Eighteen different combinations of descriptor blocks were constructed to derive PCM models, i.e., six combinations of protein and ligand descriptors (P0-DLI, P1-DLI, P2-DLI, P0-GD, P1-GD, and P2-GD), six combinations of protein and ligand descriptors with cross-terms, and the only six kinds of cross-terms.

There are a lot of kernel functions used in SVM, such as Normalized Poly Kernel (normalized polynomial kernel), Poly Kernel (polynomial kernel), Precomputed Kernel Matrix Kernel, Puk (Pearson VII function-based universal kernel), RBF Kernel (Radial Basis Function kernel), and String Kernel. Although Poly Kernel and RBF Kernel are most commonly used kernel functions, Puk Kernel is considered as a universal kernel that is capable of serving as a generic alternative to the common linear, polynomial and RBF kernel functions [51]. In fact, we also found that Puk kernel had a stronger mapping power than the other kernels for our data set. For this reason, all models were created using SVR with Puk kernel.

Validation of PCM models

For each combination of descriptors, 10-fold cross-validation was carried out for the model. The performance of the derived eighteen models was assessed by the goodness-of-fit (R2) and predictive ability (Qcv,2 Qtest2).

Finally, ten known inhibitors [4] were selected as the external validation dataset to assess the specificity performance of the best model. These inhibitors are listed in Table 6 including four pan-HDAC inhibitors (TSA, SAHA, panbinostat, and belinostat), three class I-specific inhibitors (MGCD0103, depsipeptide, and apicidin), and three class II-specific inhibitors (APHA, Tubacin, and NCT-10a). We predicted all the affinity values of the ten inhibitors against all the HDACs with the best model. According to the predicted results, we analyzed the interaction strength of the inhibitors with multiple HDACs and then select out isoform-specific, class-specific as well as pan inhibitors.

The framework of this work is presented in Figure 1.

Figure 1
figure 1

General framework for our proteochemometric modeling.