NLDB: a database for 3D protein–ligand interactions in enzymatic reactions
NLDB (Natural Ligand DataBase; URL: http://nldb.hgc.jp) is a database of automatically collected and predicted 3D protein–ligand interactions for the enzymatic reactions of metabolic pathways registered in KEGG. Structural information about these reactions is important for studying the molecular functions of enzymes, however a large number of the 3D interactions are still unknown. Therefore, in order to complement such missing information, we predicted protein–ligand complex structures, and constructed a database of the 3D interactions in reactions. NLDB provides three different types of data resources; the natural complexes are experimentally determined protein–ligand complex structures in PDB, the analog complexes are predicted based on known protein structures in a complex with a similar ligand, and the ab initio complexes are predicted by docking simulations. In addition, NLDB shows the known polymorphisms found in human genome on protein structures. The database has a flexible search function based on various types of keywords, and an enrichment analysis function based on a set of KEGG compound IDs. NLDB will be a valuable resource for experimental biologists studying protein–ligand interactions in specific reactions, and for theoretical researchers wishing to undertake more precise simulations of interactions.
KeywordsProtein–ligand interactions Docking Enzymatic reactions Variant residues Database
Protein–ligand interactions play key roles in almost all biological processes, ranging from enzyme catalysis to signal transduction. The molecular recognition of a ligand by its host protein requires non-covalent interactions, such as hydrogen and hydrophobic bonding, between molecules. Thus, characteristics of these interactions, including the ligand-binding modes and binding affinities, are valuable information to facilitate the elucidation of molecular mechanisms of ligand recognition to understand molecular functions in vivo.
The large-scale structural information for protein–ligand interactions is currently available in the Protein Data Bank (PDB; ), and has contributed to the physicochemical analyses of their interactions. However, this wealth of information is not still enough to physicochemically explain all of the enzymatic reactions with the enzymes that are important potential drug targets .
The information about various metabolic pathways and their related reactions has been manually curated and stored in the Kyoto Encyclopedia of Genes and Genomes (KEGG; ). In the KEGG REACTION database, enzymes that catalyze reactions are linked to their structural information in PDB, if their structures are known and accessible. However, even if the structure of an enzyme catalyzing a reaction of interest is available in PDB, its structures in a complex with substrates or products in the reaction are not always experimentally determined. In such cases, detailed information about the 3D interaction characteristics in natural ligands may not be obtained.
Due to the importance of the information about 3D protein–ligand interactions for studying the molecular functions of enzymes, it would be valuable to organize and complement the missing structural information of the metabolic pathways with computational approaches. There are some high-quality databases of biologically relevant ligands bound to proteins, such as Binding MOAD  and BioLiP , however, these databases have not registered such missing structural data, which can be pre-calculated with predictions. Therefore, we collected and predicted the complex structures of protein–ligand interactions by focusing on the interactions in the reactions of the KEGG REACTION database, and then constructed the database, named NLDB (capitals denote Natural Ligand DataBase; URL: http://nldb.hgc.jp).
Materials and methods
Procedure for the NDLB data construction
The NLDB data were constructed through four main steps: (1) preparation of a list of protein–ligand interactions in enzymatic reactions (2) collection of natural complex structures, (3) prediction of analog complex structures, and (4) prediction of ab initio complex structures (Fig. 1).
Preparation of a list of protein–ligand interactions in enzymatic reactions
All the reactions in the KEGG REACTION database along with the KEGG IDs for the metabolic pathways, enzymes and compounds related to each reaction were obtained using the KEGG API [7, 8]. In KEGG, the reactions are manually curated, and each reaction has information of reactants (substrates and products). In addition, other IDs corresponding to each of the enzymes, such as PDB IDs and UniProt accession numbers, and chemical compound IDs found in the PDB chemical component dictionary (PDB-CCD; ) for ligands bound to the enzyme, were also obtained using the same API. Then, these IDs were merged together, and reactions with enzymes to which PDB IDs were assigned, i.e. whose structures were known from the KEGG REACTION database, were selected for the list for the NLDB data construction. On the other hand, the other reactions with enzymes to which PDB IDs were not assigned were not included in the list, and were immediately registered in NLDB.
The following two clustering processes were then carried out. The first clustering for a set of chemical compounds was done by the chemical similarity scores (Tanimoto coefficients) between two different compounds, calculated by the connected maximum common substructure (C-MCS) search using the fkcombu program . In C-MCS, the substructures between two small molecules are defined as a connected graph, and have the same atom types and bond connections. The Tanimoto coefficient of 0.7 was used as the threshold to define the similar compound groups, because the average RMSD of 3D conformations was reportedly 2.0 Å for compound pairs with more than 0.7 chemical similarity . The second clustering for a set of proteins was done by sequence similarity using the CD-HIT program  with the default sequence identity threshold of 0.9, and with the option of 0.8 for the alignment coverage of a longer sequence. In each cluster, protein sequences were further aligned using the clustalw2 program , and then locations about their known ligand-binding residues within 4.5 Å from any atoms of a ligand were shared and mapped between the aligned sequences.
The two data sets were then merged together, to form three different types of ligand–protein complex lists: (I) a list of known protein–ligand complex structures, (II) a list of known protein-analog complex structures, and (III) a list of known apo-protein structures. To make the list (III), proteins with a ratio of missing residues (#missing residues/#total residues in the protein) of more than 10 % were removed from the list (III), and a representative protein structure with the largest sequence length, which precedes structures with the highest resolution, was selected for each reaction on the same list.
Collection of natural complex structures
The natural complexes were defined as the complex experimentally determined and registered in PDB. We used the word ‘natural’ to distinguish compounds naturally found in vivo from compounds artificially generated in vitro. Note that NLDB deals with only the compounds found in the KEGG REACTION database. According to the list (I), the coordinates of the ligands in complex with proteins were extracted from the PDB files.
Prediction of analog complex structures
Prediction of ab initio complex structures
The ab initio complexes are protein–ligand complex structures predicted based on the docking simulation, according to the list (III). The quality for ab initio complexes is largely dependent on the accuracy of the predictions of the binding sites and the initial conformations of ligands. Thus, we used two different flows to construct complex structures. When homolog information about the ligand-binding sites was available, the target compound was docked onto the binding sites on the target protein using VINA. When homolog information was not available, the target-specific ligand-binding sites and the energetically favorable conformations of a target ligand were predicted using BUMBLE, which predicts them based on known fragment–fragment interactions observed in PDB  (Fig. 2II-A). BUMBLE reported that the average success rate of the conformations with the RMSD of <5.0 Å between the native and predicted ligands was 53 % for the first-ranked-predicted conformations and 70 % in the top 10 conformations in the test for bound structure . However, even though the prediction of the ligand-binding sites was reportedly more accurate than AutoDock , the conformations of the compounds built from the predicted interaction hotspots were still less accurate . Therefore, the ideal coordinates of the target compound were superimposed onto the predicted conformation using the fkcombu program, and then the aligned conformation was optimized using VINA, in the same manner as the data construction of the analog complexes (Fig. 2II-B, C).
Keyword search and enrichment analysis
NLDB also provides an enrichment analysis of a set of KEGG compounds. This function enables users to retrieve enriched KEGG pathways with expected p-values. In addition, a list of the reactions in each enriched pathway is also shown, in the same table format as that in the first result page of the keyword search.
An example: complementation of missing complex structures
NLDB can be used to complement the missing complex structures in chemical reactions as in the following example.
Bisphosphoglycerate mutase (BPGM; EC: 22.214.171.124) is an erythrocyte-specific enzyme, and its main function is to regulate the oxygen affinity of hemoglobin by controlling the synthesis of 2.3-bisphosphoglycerate (DG2; C3 H8 O10 P2), which is an allosteric effector of hemoglobin, via a phosphoryl transfer reaction [15, 20]. The deficiency of BPGM (BPGMD) increases the hemoglobin oxygen affinity, leading to a decrease in the DG2 concentration, and is characterized by hemolytic anemia . There are three relevant reactions catalyzed by BPGM in the KEGG REACTION database, and they can be searched with the keywords ‘bisphosphoglycerate mutase’ in NLDB (KEGG reaction IDs; R01516, R01518 and R01662). In particular, BPGM in the complex with 1,3-bisphosphoglycerate (X15; C3 H8 O10 P2) in R01662, which is the main function of BPGM, to convert X15 to DG2, is unknown in PDB. This missing complex structure would be strongly required, for clarifying the binding mode of the ligand and also for identifying key residues in the reaction. In the case of BPGM with X15, there are no analog complexes in which a ligand similar to X15 binds. In addition, the chemical similarity score between X15 and DG2, based on C-MCS, is 0.43. Thus, the ab initio complex structure was predicted, according to our data construction procedure (Fig. 2II). The final conformation of X15 bound to BPGM (PDB ID: 2A9 J-AB) is shown in Fig. 2II-D. Note that water atoms are not considered in the docking calculation, even though water molecules can enhance ligand stability and activity, by forming hydrogen bonds or water bridges. As a consequence, X15 was predicted to have a docking pose on the same binding pocket as DG2, with a low binding affinity of −5.08 (kcal/mol) (Fig. 2II-C, D). Furthermore, it was observed that X15 forms two hydrogen bonds with one of the variant residues of BPGMD, ARG-62, as well as a hydrogen bond with CYS-23, which is considered to have a large effect on the reactivity of BPGM . Moreover, ARG-90, which is a variant with a large effect on the stability of the protein , participated in the binding. While this ab initio complex structure of BPGM and X15 is reasonable, in terms of the binding affinity and the binding mode, it represents a starting point for further detailed analyses to evaluate the stability of the predicted pose, considering the effects of both the protein flexibility and water solvation.
Discussion and conclusions
NLDB is a unique, up-to-date database that collects 3D protein–ligand interactions from known structures, and also automatically predicts missing complex structures using reliable, state-of-the-art software programs, in the reactions of the KEGG REACTION database. As far as we know, there are no comparable databases focused on protein–ligand complex structures involved in these reactions.
NLDB registers 68,551 natural, 28,441 analog and 64,204 ab initio complexes, for 3248 KEGG reactions in which 1654 enzymes are involved, and also registers 4379 KEGG reactions in which 3291 enzymes without structural information are involved (As of July 2016). In total, 7627 reactions have been registered in NLDB. Furthermore, 1679 and 2131 entries with variant residues in their binding sites linked to rs number and OMIM ID, respectively, are registered and viewed in the ‘Variants’ page. The former entries are associated with 89 pathways and 367 reactions, and the latter entries are associated with 103 pathways and 354 reactions.
The number of the reactions with 3D protein–ligand interactions registered in NLDB
Binding affinity ≤−3.0
Binding affinity ≤−5.0
In particular, we believe that these predicted structures with lower binding affinity can provide some insights for experimental biologists studying protein–ligand interactions in specific chemical reactions, in which the 3D structures of the interaction are as yet unknown, and will facilitate breakthroughs in understanding or verifying chemical reactivities at specific ligand-binding sites, as shown in the above example. Furthermore, NLDB will be a starting point for theoretical researchers wishing to undertake more accurate simulations of the ligand-binding affinity and stability in the predicted conformation of a complex, by considering the effects of protein flexibility and water solvation. Therefore, NLDB will be continually improved, for the prediction of more accurate structures involved in reactions and for web interface usability. NLDB is freely accessible at http://nldb.hgc.jp, and will be regularly updated every 3 months.
We thank Dr. Takeshi Kawabata (Institute for Protein Research, Osaka University, Japan) and Dr. Kota Kasahara (Institute for Protein Research, Osaka University, Japan) for kindly providing the KCOMBU and BUMBLE programs and their thoughtful comments. This research is (partially) supported by the Platform Project for Supporting Drug Discovery and Life Science Research (Platform for Drug Discovery, Informatics, and Structural Life Science) from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) and the Japan Agency for Medical Research and Development (AMED). Furthermore, YM was supported by JSPS KAKENHI Grant Number 26870045, and KK was supported by JSPS KAKENHI Grant Number 15H02773.
Compliance with ethical standards
Conflicts of interest
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.