PubChem3D: a new resource for scientists
- 11k Downloads
- 48 Citations
Abstract
Background
PubChem is an open repository for small molecules and their experimental biological activity. PubChem integrates and provides search, retrieval, visualization, analysis, and programmatic access tools in an effort to maximize the utility of contributed information. There are many diverse chemical structures with similar biological efficacies against targets available in PubChem that are difficult to interrelate using traditional 2-D similarity methods. A new layer called PubChem3D is added to PubChem to assist in this analysis.
Description
PubChem generates a 3-D conformer model description for 92.3% of all records in the PubChem Compound database (when considering the parent compound of salts). Each of these conformer models is sampled to remove redundancy, guaranteeing a minimum (non-hydrogen atom pair-wise) RMSD between conformers. A diverse conformer ordering gives a maximal description of the conformational diversity of a molecule when only a subset of available conformers is used. A pre-computed search per compound record gives immediate access to a set of 3-D similar compounds (called "Similar Conformers") in PubChem and their respective superpositions. Systematic augmentation of PubChem resources to include a 3-D layer provides users with new capabilities to search, subset, visualize, analyze, and download data.
A series of retrospective studies help to demonstrate important connections between chemical structures and their biological function that are not obvious using 2-D similarity but are readily apparent by 3-D similarity.
Conclusions
The addition of PubChem3D to the existing contents of PubChem is a considerable achievement, given the scope, scale, and the fact that the resource is publicly accessible and free. With the ability to uncover latent structure-activity relationships of chemical structures, while complementing 2-D similarity analysis approaches, PubChem3D represents a new resource for scientists to exploit when exploring the biological annotations in PubChem.
Keywords
Rotatable Bond Conformer Model Substance Conformer PubChem Compound Similar ConformerAbbreviations
- 2-D
(2-dimensional)
- 3-D
(3-dimensional)
- MMFF
(Merck Molecular Force Field)
- RMSD
(root-mean-square distance).
Background
PubChem [1, 2, 3, 4] (http://pubchem.ncbi.nlm.nih.gov) is an open repository for small molecules and their experimental biological activities. The primary goal of PubChem is to be a public resource containing comprehensive information on the biological activities of small molecules. PubChem provides search, retrieval, visualization, analysis, and programmatic access tools in an effort to maximize the utility of contributed information. The PubChem3D project adds a new layer to this infrastructure. In the most basic sense, PubChem3D [5, 6, 7, 8, 9, 10] generates a 3-D conformer model description of the small molecules contained within the PubChem Compound database. This 3-D description can be employed to enhance existing PubChem search and analysis methodologies by means of 3-D similarity. Prior to PubChem3D, this similarity approach was limited to a 2-D dictionary-based fingerprint (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt) to help relate chemical structures. With the advent of PubChem3D, this is now expanded to use a Gaussian-based similarity description of molecular shape [11, 12, 13] used in software packages such as ROCS [14] and OEShape [15] from OpenEye Scientific Software, Inc.
It is reasonable to ask, why do we consider 3-D similarity methodologies at all? To put it simply, 2-D methods, while very useful and far cheaper computationally, may not be enough. A pitfall of most 2-D similarity methods is a general lack of ability to relate chemically diverse molecules with similar biological efficacy and function. For example, if a small molecule adopts an appropriate 3-D shape and possesses compatible functional groups properly oriented in 3-D space, it will likely bind to the biological moiety of interest. This "lock and key" binding motif is a major premise of structure-based drug design, docking, and molecular modelling applied with varying degrees of success over the past twenty years or more [16, 17, 18, 19, 20, 21, 22, 23]. These "compatible functional groups" involved in binding small molecules to proteins, which are typically used to define pharmacophores, are referred to here simply as "features". Therefore, in this context, 3-D similarity considering both shape and feature complementariness may be useful to find or relate chemical structures that may bind similarly to a protein target.
In its essence, 3-D similarity adds another dimension to data mining and it can provide some degree of orthogonality from 2-D similarity results. With 2-D similarity, one can typically see by eye increased changes in the chemical structure molecular graph with increasing dissimilarity [8, 10]. With 3-D similarity, it is not always obvious by looking only at the molecular graph, often requiring one to visualize 3-D conformer alignments to relate diverse chemistries. In all, 3-D similarity is complementary to 2-D similarity and provides an easy-to-grasp understanding (i.e., one can readily see by examining a conformer pair superposition that both shape and features are similar) that may help to provide a contrast or new insight to the same (biological) data.
This work gives an overview of the PubChem3D project and its current capabilities. The technology and background that allowed 3-D methodologies to be economically applied to the tens of millions of chemical structures in the PubChem Compound database are described elsewhere [5, 6, 7, 8, 9, 10] covering various aspects of the project, including conformer model generation validation [6], the relative uniqueness of molecular shape [7], and 3-D neighboring methodology [8].
Construction and Content
1. PubChem3D Coverage
- (1)
Not too large (with ≤ 50 non-hydrogen atoms).
- (2)
Not too flexible (with ≤ 15 rotatable bonds).
- (3)
Consists of only supported elements (H, C, N, O, F, Si, P, S, Cl, Br, and I).
- (4)
Has only a single covalent unit (i.e., not a salt or a mixture).
- (5)
- (6)
Has fewer than six undefined atom or bond stereo centers.
PubChem Compound database 3-D coverage. As one can see, 89.6% of all records have a 3-D conformer model. If one includes the parent compound of salts, this coverage can be considered to be 92.3%. Of the cases not having a 3-D conformer model, the majority are due to the flexibility of the chemical structure being too great to be suitable for conformer generation.
2. Conformer Models
The computed coordinates for the 3-D representations are the essence of the PubChem3D project. Creation of the stored conformational models consists of multistep processes involving separate conformer generation, sampling, and post processing steps.
All conformers were generated by the OpenEye Scientific Software, Inc., OMEGA software [27, 28, 29, 30, 31] using the C++ interface, the MMFF94s force field [24, 25, 26] minus coulombic terms, and an energy filter of 25 kcal/mol. (Removal of coulombic terms [6, 32, 33, 34, 35] eliminated a bias towards conformations with energy-lowering intra-molecular interactions that tend not to be important for inter-molecular interactions, an important consideration given that the 3-D coordinates are generated in vacuo. Removal of attractive van der Waals terms did not have any noticeable effect [6].) A maximum of 100,000 conformers per chemical structure stereo isomer were allowed. When undefined stereo centers were present, each stereo isomer was enumerated and conformers independently generated. These stereo isomer conformers were then combined (2**5 = 32 maximum stereo permutations, 32 * 100,000 = maximum 3.2 million conformers).
Limiting to 100,000 conformations per stereo isomer can be a significant factor in limiting exploration of the conformational space. Ideally, one would want to explore the conformational space of a molecule exhaustively. In reality, it is not tractable to do so. For example, if one considers only three angles per rotatable bond and there are eleven rotatable bonds, this would yield 3**11 (= 177,147) possible conformers. If one considers four torsion angles per rotatable bond, and there are nine rotatable bonds, this would yield 4**9 (= 262,144) possible conformers. One can see how quickly systematic approaches can run into trouble with such exponential growth in the count of conformations and why there is a limit on how flexible a molecule is allowed to be.
With conformers generated, another important consideration is immediately obvious. It is not practical to store many thousands of conformers per compound. Therefore, after conformer generation is complete, the conformation count is reduced by sampling using root-mean-square-distance (RMSD) of pair-wise comparison of non-hydrogen atomic coordinates using the OEChem [36] OERMSD function with the automorph detection (which considers local symmetry equivalence of atoms such that, for example, rotation of a phenyl ring does not yield an artificially high RMSD) and overlay (which minimizes RMSD between conformers by rotation and translation of one conformer to the other) options selected. In some rare cases, the automorph detection was prohibitively expensive computationally and not used.
The sampling procedure employed is described elsewhere [7] but involves a two-stage clustering approach with an initial pass to partition-cluster conformers using an exclusion region hierarchy of decreasing dissimilarity (NlogN computational complexity, each cluster representative forms an exclusion region at a particular RMSD), followed by a step to remove edge-effects from the partition clustering (N2 computational complexity using only the cluster representatives at the desired RMSD). The RMSD value used when sampling was dependent on the size and flexibility of the chemical structure.
where "er" is the effective rotor count, "rb" is the rotatable bond count (computed using the OEChem "IsRotor" function) and "nara" is the count of non-aromatic ring atom count (OEChem OpenEye aromaticity model) excluding bridgehead atoms and SP2 hybridized atoms.
A post processing step was performed, after conformer model RMSD sampling, to completely relax the hydrogen atom locations by performing a full energy minimization where all non-hydrogen atoms were kept frozen. A subsequent "bump" check removed any conformers that had MMFF94 atom-atom interactions greater than 25 kcal/mol. Finally, each conformer was rotated and translated to their principal steric axes (i.e., non-mass weighted principal moments of inertia axes) considering only non-hydrogen atoms.
It is important to note that the conformers produced are not stationary points on a potential energy hypersurface. In fact, one can readily achieve lower-energy conformations of a given chemical structure by performing an all-atom energy minimization to remove any bond, angle, or torsion strain present in vacuo. The PubChem3D conformer model for a chemical structure is meant to represent all possible biologically-relevant conformations that the molecule may have. In theory, one should have a reasonable chance to find any biologically accessible conformation within the RMSD sampling distance of the conformer model.
3. Conformer Model Properties
PubChem3D properties and descriptors
For each Compound | For each Conformer | |
---|---|---|
MMFF94 partial charges | X | |
Pharmacophore features | X | |
Conformer model sampling RMSD | X | |
Diverse conformer ordering | X | |
Conformer identifier | X | |
MMFF94s Energy (minus coulombic terms) | X | |
Conformer volume | X | |
Steric shape moments | X | |
Shape self-overlap volume | X | |
Feature self-overlap volume | X | |
Shape fingerprint | X |
The feature definition lists the set of non-hydrogen atoms that comprise a given fictitious feature atom. The feature definitions are computed using the OEShape "ImplicitMillsDeans" forcefield [15, 37]. Care is taken to (iteratively) merge feature definitions of common type that are within 1.0 Å distance of each other. Each feature definition is used to generate a fictitious "color" atom, whose 3-D coordinates are at the steric center of the atoms that comprise it (i.e., at the average {X, Y, Z} value). There are six feature types used: anion, cation, (hydrogen-bond) acceptor, (hydrogen-bond) donor, hydrophobe, and ring.
where ComboT is the combo Tanimoto, ST is the shape Tanimoto, and CT is the color Tanimoto.
A diverse ordering of conformers is provided for each compound conformer ensemble [8, 39, 40]. Using the lowest energy conformer in the ensemble as the initial default conformer, the conformer most dissimilar to the first is selected as the second diverse conformer. The conformer most dissimilar to the first two dissimilar conformers is chosen as the third diverse conformer. This process is repeated until there are no more conformers to be assigned a dissimilarity ordering. Similarity is measured by ST (Equation 3) and CT (Equation 4), involving a conformer superposition optimization [11, 36] to maximize the shape volume overlap between two conformers by means of rotation and translation of one conformer to the other. This is followed by a single point CT computation at the ST-optimized conformer pair overlay. The ST and CT are then added to yield a combo Tanimoto (Equation 5). The conformer with the smallest sum of combo Tanimoto to all assigned dissimilar conformers is selected as the next most dissimilar. In the case of a tie, the one with the largest sum of combo Tanimoto to unassigned conformers is used.
Note that PubChem has another source of 3-D information of small molecules, besides PubChem3D. The PubChem Substance database (unique identifier: SID) contains 3-D structures of small molecules deposited from individual depositors, which can be either experimentally determined or computationally predicted. For clarification, these depositor-provided structures are called "substance conformers", and the theoretical conformers generated by PubChem3D for each PubChem Compound record (unique identifier: CID) are called "compound conformers". For an efficient use of the PubChem3D resources, it is necessary to assign a unique identifier to each of compound conformers in the PubChem Compound database and substance conformers in the PubChem Substance database. The global conformer identifier (GID) uniquely identifies each conformer and is stored as a hex-encoded 64-bit unsigned integer, where the first 16-bits (0x000000000000FFFF) correspond to the local conformer identifier (LID), which is specific to a given conformer ensemble, the next 16-bits (0x00000000FFFF0000) are the version identifier (always zero for PubChem3D compound conformers, but nonzero for deposited substance conformers) and the last 32 bits (0xFFFFFFFF00000000) correspond to the structure identifier. This identifier is a compound identifier (CID), if the version identifier is zero, and a substance identifier (SID), when the version identifier is non-zero (the version identifier indicates the substance version to which the conformer corresponds). Substance conformer identifiers allow deposited 3-D coordinates to be utilized effectively by the PubChem3D system. As one can see, the GID provides global conformer identification system across all PubChem conformers.
A shape fingerprint is computed for the first ten diverse conformers. To generate this property, each conformer is ST-optimized to a set of reference conformers that describe the entire shape space diversity of the contents of PubChem3D. If the conformer is shape similar beyond a particular threshold to a reference conformer, the reference conformer identifier (CID and LID) and a packed rotational/translational matrix (64-bit integer) are retained. This makes each set reference conformer like a bit in a binary fingerprint, however; in this case, additional information (the superposition) is also retained. One can imagine that these shape fingerprints are a little like coordinates in shape space, mapping where a given conformer is located.
This shape fingerprint can be used in several ways during 3-D similarity computation and was born out of our earlier research [8, 41] on "alignment recycling." This work demonstrated that similar conformers align to a reference shape in a similar way. This means that, if one is interested only in finding similar shapes, conformer pairs that do not have common shape fingerprint "bits" can be ignored (i.e., there is no need to perform a computationally intensive conformer alignment overlap optimization between two conformers when no common shape fingerprint reference exists, because the two conformer shapes are dissimilar to the extent that they may not need to be considered further). Additionally, when a common shape fingerprint reference exists between two conformers, one can "replay" the alignments of the two conformers to the common reference shape to yield a conformer alignment overlap between conformers that is (typically) very close to the optimal overlay; thus speeding up any conformer alignment overlap optimization but also providing an opportunity to further skip overlap optimization, when the best pre-optimized alignment overlap is not sufficient.
4. Similar Conformer Neighboring Relationship
Analogous to the precomputed "Similar Compounds" relationship for 2-D similarity, PubChem3D now provides a "Similar Conformers" neighboring relationship [8] using 3-D similarity. This neighboring takes into account both conformer shape similarity and conformer pharmacophore feature similarity. Essentially, this is equivalent to performing a shape-optimized similarity search using ROCS [14, 15] at a threshold of ST > 0.795 and CT > 0.495, when both conformers have defined pharmacophore features. To allow for compounds devoid of features to be neighbored, a threshold of ST > 0.925 is used, but with the caveat that both conformers must not have any defined pharmacophore features. Currently, three diverse conformers per compound are neighbored; however, this may change, with up to ten conformers per compound used as computational resources allow. The conformers used for neighboring correspond to the first "N" conformers in the diverse conformer list property. (See the Conformer Model Properties section.) This ensures maximal coverage of the unique shape/feature space of a chemical structure as additional conformers are considered in neighboring.
5. FTP Site
PubChem3D data is available on the PubChem FTP site (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound_3D). One may download in bulk 3-D descriptions of PubChem Compound records. On average there are approximately 110 conformers per compound in the PubChem3D system; however, not all data is provided for public download, in part due to the overall size being many terabytes, more data than one can readily share publicly. Therefore, two different subsets are provided in various file formats (SDF, XML, and ASN.1) that correspond to either the default conformer or the first ten conformers in the diverse conformer list property. (See the Conformer Model Properties section.) Beyond these two conformer subsets of PubChem3D, one may also find a description of the conformers that comprise the PubChem3D shape fingerprint. These conformers represent all shape diversity present in the PubChem3D system for a given analytic volume range and a given level of shape similarity ST threshold.
The "Similar Conformers" neighboring relationship is also provided for download. This conformer pair relationship (one per line) includes the respective conformer identifiers, ST, CT, and the 3 × 3 rotation matrix and translation vector (applied in that order) to superimpose the second conformer to the first. The rotation/translation refers to the coordinates provided in the download set of ten diverse conformers or otherwise available for download from our PubChem download facility. (See the Utility: Download section.)
Utility
1. NCBI Entrez Interface
PubChem3D Entrez indexes
Index name | Description and example query |
---|---|
ConformerCount3D | Conformer count (e.g., for compounds with only 1-5 conformers in their ensemble: "1:5[ConformerCount3D]") |
ConformerModelRmsd3D | Conformer sampling RMSD in Å (e.g., for compounds with 0.4 Å sampling RMSD: "0.4[ConformerModelRmsd3D]") |
EffectiveRotorCount3D | Effective rotor count (e.g., for compounds with 5.2-5.6 effective rotors: "5.2:5.6[EffectiveRotorCount3D]") |
FeatureCount3D | Total count of individual features (e.g., for compounds with 0-4 total features: "0:4[FeatureCount3D]") |
FeatureAcceptorCount3D | Hydrogen-bond acceptor count (e.g., for compounds with three acceptor features: "3[FeatureAcceptorCount3D]") |
FeatureAnionCount3D | Anion count at pH 7 (e.g., for compounds with 0-1 anion features: "0:1[FeatureAnionCount3D]") |
FeatureCationCount3D | Cation count at pH 7 (e.g., for compounds with one cation feature: "1[FeatureCationCount3D]") |
FeatureDonorCount3D | Hydrogen-bond donor count (e.g., for compounds with 5-11 donor features: "5:11[FeatureDonorCount3D]") |
FeatureHydrophobeCount3D | Hydrophobe count (e.g., for compounds with no hydrophobe features: "0[FeatureHydrophobeCount3D]") |
FeatureRingCount3D | Ring count (e.g., for compounds with 1-2 ring features: "1:2[FeatureRingCount3D]") |
Volume3D | Analytic volume of the first diverse conformer (default conformer) for each compound (e.g., for volume range 110-220.5 Å3: "110:220.5[Volume3D]") |
XStericQuadrupole3D | Qx (describes length) of the first diverse conformer (default conformer) for each compound (e.g., for Qx range 20.2-30.4 Å5: "20.2:30.4[XStericQuadrupole3D]") |
YStericQuadrupole3D | Qy (describes width) of the first diverse conformer (default conformer) for each compound (e.g., for Qy range 5.1-9.8 Å5: "5.1:9.8[YStericQuadrupole3D]") |
ZStericQuadrupole3D | Qz (describes height) of the first diverse conformer (default conformer) for each compound (e.g., for Qz range 1.3-4.2 Å5: "1.3:4.2[ZStericQuadrupole3D]") |
The indexes for "Volume3D", "XStericQuadrupole3D", "YStericQuadrupole3D", and "ZStericQuadrupole3D" correspond, respectively, to the analytic volume and the three steric quadrupole moments [9, 12, 42] for only the first conformer in the diverse conformer list (i.e., the default conformer). The steric quadrupoles essentially correspond to the extents of the compound, where X, Y, and Z correspond to the length, width, and height. For example, to find very long, near-linear compounds, one may give the PubChem Compound Entrez query "50:100[XStericQuadrupole3D] AND 0:1[YStericQuadrupole3D] AND 0:1[ZStericQuadrupole3D]". Please note that shortcuts exist for most indexes. These are documented in the PubChem Help "PubChem Indexes and Filters in Entrez" section (http://pubchem.ncbi.nlm.nih.gov/help.html#PubChem_index).
PubChem also provides filtering capabilities. Unlike indexes, which hold discrete values, filters are Boolean-based (i.e., either a record is in the list or it is not). PubChem3D provides some additional filtering capabilities. In the case of the PubChem Compound database, there is a filter "has 3d conformer" that will indicate whether a given compound record has a 3-D conformer model by means of the PubChem Compound query: ""has 3d conformer"[filter]".
Filtering capabilities were also expanded in the PubChem Substance database. Two filters were added: "has deposited 3d" and "has deposited 3d experimental" to indicate when a substance record has 3-D coordinates and when the contributed 3-D coordinates were determined experimentally, respectively. For example, to find all experimentally determined 3-D structures for substance records, one would use the PubChem Substance databases query: ""has deposited 3d experimental"[filter]".
2. Visualization
Summary page enhancements. A snapshot of the PubChem Compound summary page of dopamine (CID 681). Clicking on the "3D" tab on the right side of the page shows the 3-D structure of the molecule. Clicking the "Compound information" in the "Table of Contents" box directs users to 2-D neighbors ("Similar Compounds") and 3-D neighbors ("Similar Conformers").
Visualization of a 3-D structure conformer. Clicking on the 3-D image on the PubChem Compound summary page (left) shows links to the web-based 3-D viewer (top right) and the Pc3D desktop helper application (bottom right).
The Pc3D viewer application can be downloaded and installed on PC, Mac, or Linux computers. A link to download this application can be found below the image on a given summary page or other PubChem3D aware pages (e.g., see the "Pc3D Viewer Download" icon in Figure 2). The viewer provides an interface for rendering 3-D structures of PubChem Compound records and visualizing their superpositions. With a customizable 3-D rendering engine that provides dynamic molecular visualization experience, it has the ability to create high-resolution, publication-quality images. It allows use of XYZ model files and SDF files and supports PubChem native formatted files (with the .pc3d or .asn extension).
Visualization of 3-D structure conformer superpositions. Superpositions between compound conformers are accessible from various PubChem3D-aware applications. The PubChem Compound summary page (top left) allows the "Similar Conformers" neighboring relationship to be visualized. The PubChem3D web-based viewer (bottom left) allows arbitrary superpositions to be generated. The PubChem Structure Clustering tool (bottom right) enables all pair-wise superpositions to be examined.
3. Search
The PubChem Structure Search system [1] (accessible via http://pubchem.ncbi.nlm.nih.gov/search/) allows one to search the PubChem Compound database using a chemical structure in various formats. PubChem3D adds a new capability to this system by allowing one to perform a 3-D similarity search and to visualize the results. At the time of writing, this similarity search is essentially equivalent to that described in the Similar Conformer Neighboring Relationship section. If 3-D coordinates are not provided for a chemical structure query, they are generated automatically, as-is possible, while keeping in mind that not all chemical structures can be covered by the PubChem3D system. (See the PubChem3D Coverage section for more details.) To aid in performing automated queries, a programmatic interface is available. (See the Programmatic Interface section for more details.)
A 3-D conformer search currently considers the first three diverse conformers per compound as candidates for "Similar Conformers". (See diverse conformer ordering in the Conformer Model Properties section.) Given that there are more than 27 million CIDs and three conformers per compound are being considered, this means that there are around 81 million conformers considered by each 3-D query. This count will change as a function of time as data is added to PubChem and as the count of conformers per compound are increased. To achieve adequate query throughput, an "embarrassingly parallel divide-and-conquer" strategy is employed. The PubChem Compound conformer data set is subdivided into multiple evenly-sized subsets. Each subset is then searched in parallel. If more query throughput is desired, and the computational capacity exists, the solution is simple; one simply needs to increase the count of evenly-sized subsets to simultaneously process.
4. Download
The PubChem Download facility [1] (http://pubchem.ncbi.nlm.nih.gov/pc_fetch) allows one to download PubChem records resulting from a search or a user-provided identifier list. With the advent of the PubChem3D layer, there is now the ability to download up to ten diverse conformers per compound. Alternatively, 3-D images may be downloaded (for the default conformer, only). A programmatic interface is available. (See the Programmatic Interface section for more details.)
5. Similarity Computation
The PubChem Score Matrix facility (http://pubchem.ncbi.nlm.nih.gov/score_matrix) allows one to compute pairwise similarities of a set of PubChem compound records (up to 1,000,000 similarity pairs per request). The PubChem3D layer adds the ability to compute 3-D similarities using up to ten conformers (either the first N-diverse conformers or a user-provided conformer set) per compound per request. Additionally, this service allows one to select the type of superposition optimization (shape or feature) to perform. A programmatic interface is available. (See the Programmatic Interface section.)
6. Clustering and Analysis
The PubChem Structure Clustering tool [10] (http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?p=clustering) allows one to perform single-linkage clustering for up to 4,000 compounds at a time. This interactive tool provides visualization, subset, selection, and analysis capabilities. For example, the dendrogram allows compounds to be grouped into clusters by clicking the Tanimoto bar provided above and below the dendrogram (see the bottom right panel in Figure 4). One can then click on the cluster to view the individual compounds or perform other operations. The PubChem3D layer adds the ability to cluster compounds according to their 3-D similarities, with up to ten diverse conformers per compound. This service allows one to select: the superposition optimization type (shape or feature); whether to cluster all conformers or just the most similar conformer pair; and the conformer similarity metric.
7. Programmatic Interface
PubChem provides a programmatic interface called the Power User Gateway (PUG) [1]. This extends the capabilities provided by the NCBI eUtils programmatic interface [43], which interfaces the NCBI Entrez search engine contents. PUG can be used to send programmatic requests (e.g., to perform queries or other tasks). If a request does not complete, a request ID is returned. One uses this to "poll" whether the request is completed, at which point an URL is provided to obtain the results. This is necessary, considering that most user requests are queued and may not be executed or completed immediately. A PUG/SOAP interface exists to allow the SOAP-based protocol to be used to route requests. SOAP-interfaces are readily available for most programming (e.g., Java, C#, VisualBasic) and scripting languages (e.g., Perl, Python), as well as workflow applications (e.g., Taverna [44], Pipeline Pilot [45]). The PubChem3D layer extensions are now available in individual PUG-aware interfaces and by means of the PUG/SOAP interface.
Examples of use
To assist in understanding how PubChem3D can be useful to locate additional biological annotation and enhance one's ability to identify potential structure-activity relationships, a series of illustrative examples were prepared. These examples benefit from a recent study [10] of the statistical distribution of random 3-D similarities of more than 740,000 biologically tested small molecules in PubChem using a single conformer per compound, where the average (μ) and standard deviation (σ) of the shape-optimized ST, CT, and ComboT scores between two randomly selected conformers were found to be 0.54 ± 0.10, 0.07 ± 0.05, and 0.62 ± 0.13, respectively. The probability of two random conformers having a ST-optimized similarity score greater than or equal to the μ + 2σ threshold (i.e., 0.74, 0.17, and 0.88 for ST, CT, and ComboT, respectively) was 2%, 4%, and 3% for ST, CT, and ComboT, respectively. This statistical information is meaningful to provide reasonable 3-D similarity thresholds, whereby one can be confident that most of the 3-D similarities between chemical structures is not simply by chance. When a group of chemical structures with similar biological activity and function are shown to have 3-D similarity to each other above these thresholds, it suggests that a common macromolecule binding interaction orientation exists and, furthermore, that the features required for such binding are present.
1. Finding additional biological annotation
3-D similarity relationship finds additional biological annotation. Comparison of the 2-D "Similar Compound" and 3-D "Similar Conformer" neighboring relationships using dopamine to demonstrate how both neighboring relationship complement each other when locating related chemical structures with unique biological annotation.
2. Relating chemical probes for same biological target
Relating biologically active compounds by means of PubChem3D. Chemical probes ML088 (CID 704205) and ML087 (CID 25199559) from PubChem BioAssay 1548 against tissue non-specific alkaline phosphatase (TNAP, GI:116734717) are not similar by 2-D similarity but are by 3-D similarity.
3. Relating chemically diverse structures with same pharmacological action
Similarity score matrix for selected histamine H1 receptor antagonist anti-inflammatory drugs. The lower triangle of the score matrix corresponds to the 2-D similarity computed using the PubChem fingerprint. The upper triangle corresponds to the 3-D similarity ST/CT scores. The matrix elements in red text indicate a 2-D similarity ≥ 0.75 or 3-D similarity with ST ≥ 0.74 and ComboT ≥ 1.0. The first ten diverse conformers per molecule were superimposed using shape-based optimization and the single conformer-pair per compound-pair with the largest ComboT retained.
3-D superposition of selected histamine H1 receptor antagonist anti-inflammatory drugs. Although there is little 2-D similarity, using the PubChem fingerprint, substantial 3-D similarity is found between various structurally diverse anti-inflammatory drugs.
Conclusions
A new resource for scientists, PubChem3D, layered on top of PubChem, provides a new dimension to its ability to search, subset, export, visualize, and analyze chemical structures and their associated biological data. With a broad suite of tools and capabilities, 3-D similarity is given equal footing to assist in finding non-obvious trends in experimentally observed biological activity. As a complement to 2-D similarity, 3-D similarity demonstrates a capability to relate chemical series that are not sufficiently 2-D similar.
Notes
Acknowledgements
This research was supported (in part) by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, U. S. Department of Health and Human Services. This effort utilized the high-performance computational capabilities of the Helix Systems at the National Institutes of Health, Bethesda, MD (http://helix.nih.gov).
EEB is very appreciative of the folks at OpenEye Scientific Software, Inc., for allowing their tools and methodologies to be utilized in the PubChem3D project and for the many fruitful suggestions/discussions/insights. EEB is also very thankful for the significant and continuing support of the NCBI systems staff, specifically Ron Patterson, Charlie Cook, and Don Preuss, without which the PubChem3D project would not exist.
We are very grateful to the reviewers for their careful consideration of this manuscript and useful suggestions.
Supplementary material
References
- 1.Bolton EE, Wang Y, Thiessen PA, Bryant SH: PubChem: integrated platform of small molecules and biological activities. Annual Reports in Computational Chemistry. Edited by: Ralph AW, David CS. 2008, Elsevier, 4: 217-241.Google Scholar
- 2.Wang YL, Xiao JW, Suzek TO, Zhang J, Wang JY, Bryant SH: PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37: W623-W633. 10.1093/nar/gkp456.CrossRefGoogle Scholar
- 3.Wang YL, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, Wang JY, Xiao JW, Zhang J, Bryant SH: An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2010, 38: D255-D266. 10.1093/nar/gkp965.CrossRefGoogle Scholar
- 4.Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2011, 39: D38-D51. 10.1093/nar/gkq1172.CrossRefGoogle Scholar
- 5.PubChem3D Thematic Series. [http://www.jcheminf.com/series/pubchem3d]
- 6.Bolton EE, Kim S, Bryant SH: PubChem3D: conformer generation. J Cheminformatics. 2011, 3: 4-10.1186/1758-2946-3-4.CrossRefGoogle Scholar
- 7.Bolton EE, Kim S, Bryant SH: PubChem3D: diversity of shape. J Cheminformatics. 2011, 3: 9-10.1186/1758-2946-3-9.CrossRefGoogle Scholar
- 8.Bolton EE, Kim S, Bryant SH: PubChem3D: similar conformers. J Cheminformatics. 2011, 3: 13-10.1186/1758-2946-3-13.CrossRefGoogle Scholar
- 9.Kim S, Bolton EE, Bryant SH: PubChem3D: shape compatibility filtering using molecular shape quadrupoles. J Cheminformatics. 2011, 3: 25-10.1186/1758-2946-3-25.CrossRefGoogle Scholar
- 10.Kim S, Bolton EE, Bryant SH: PubChem3D: biologically relevant 3-D similarity. J Cheminformatics. 2011, 3: 26-10.1186/1758-2946-3-26.CrossRefGoogle Scholar
- 11.Grant JA, Pickup BT: A Gaussian description of molecular shape. J Phys Chem. 1995, 99: 3503-3510. 10.1021/j100011a016.CrossRefGoogle Scholar
- 12.Grant JA, Gallardo MA, Pickup BT: A fast method of molecular shape comparison: a simple application of a Gaussian description of molecular shape. J Comput Chem. 1996, 17: 1653-1666. 10.1002/(SICI)1096-987X(19961115)17:14<1653::AID-JCC7>3.0.CO;2-K.CrossRefGoogle Scholar
- 13.Grant JA, Pickup BT: Gaussian shape methods. Computer Simulation of Biomolecular Systems. Edited by: van Gunsteren WF, Weiner PK, Wilkinson AJ. 1997, Dordrecht: Kluwer Academic Publishers, 150-176.CrossRefGoogle Scholar
- 14.ROCS - Rapid Overlay of Chemical Structures, Version 2.2. 2006, OpenEye Scientific Software, Inc.: Santa Fe, NMGoogle Scholar
- 15.ShapeTK - C++, Version 1.8.0. 2010, OpenEye Scientific Software, Inc.: Santa Fe, NMGoogle Scholar
- 16.Meanwell NA: Synopsis of some recent tactical application of bioisosteres in drug design. J Med Chem. 2011, 54: 2529-2591. 10.1021/jm1013693.CrossRefGoogle Scholar
- 17.Nicholls A, McGaughey GB, Sheridan RP, Good AC, Warren G, Mathieu M, Muchmore SW, Brown SP, Grant JA, Haigh JA, et al: Molecular shape and medicinal chemistry: a perspective. J Med Chem. 2010, 53: 3862-3886. 10.1021/jm900818s.CrossRefGoogle Scholar
- 18.Mohan V, Gibbs AC, Cummings MD, Jaeger EP, DesJarlais RL: Docking: successes and challenges. Curr Pharm Design. 2005, 11: 323-333. 10.2174/1381612053382106.CrossRefGoogle Scholar
- 19.Simmons KJ, Chopra I, Fishwick CWG: Structure-based discovery of antibacterial drugs. Nat Rev Microbiol. 2010, 8: 501-510. 10.1038/nrmicro2349.CrossRefGoogle Scholar
- 20.Andricopulo AD, Salum LB, Abraham DJ: Structure-based drug design strategies in medicinal chemistry. Curr Top Med Chem. 2009, 9: 771-790. 10.2174/156802609789207127.CrossRefGoogle Scholar
- 21.van Montfort RLM, Workman P: Structure-based design of molecular cancer therapeutics. Trends Biotechnol. 2009, 27: 315-328. 10.1016/j.tibtech.2009.02.003.CrossRefGoogle Scholar
- 22.Sun H, Scott DO: Structure-based drug metabolism predictions for drug design. Chem Biol Drug Des. 2010, 75: 3-17. 10.1111/j.1747-0285.2009.00899.x.CrossRefGoogle Scholar
- 23.Kuntz ID: Structure-based strategies for drug design and discovery. Science. 1992, 257: 1078-1082. 10.1126/science.257.5073.1078.CrossRefGoogle Scholar
- 24.Halgren TA: Merck molecular force field. 1. Basis, form, scope, parameterization, and performance of MMFF94. J Comput Chem. 1996, 17: 490-519. 10.1002/(SICI)1096-987X(199604)17:5/6<490::AID-JCC1>3.0.CO;2-P.CrossRefGoogle Scholar
- 25.Halgren TA: Merck molecular force field. 2. MMFF94 van der Waals and electrostatic parameters for intermolecular interactions. J Comput Chem. 1996, 17: 520-552. 10.1002/(SICI)1096-987X(199604)17:5/6<520::AID-JCC2>3.0.CO;2-W.CrossRefGoogle Scholar
- 26.Halgren TA: MMFF VI. MMFF94s option for energy minimization studies. J Comput Chem. 1999, 20: 720-729. 10.1002/(SICI)1096-987X(199905)20:7<720::AID-JCC7>3.0.CO;2-X.CrossRefGoogle Scholar
- 27.OMEGA, Version 2.0. 2006, OpenEye Scientific Software, Inc.: Santa Fe, NMGoogle Scholar
- 28.OMEGA, Version 2.1. 2006, OpenEye Scientific Software, Inc.: Santa Fe, NMGoogle Scholar
- 29.OMEGA, Version 2.2. 2007, OpenEye Scientific Software, Inc.: Santa Fe, NMGoogle Scholar
- 30.OMEGA, Version 2.3. 2008, OpenEye Scientific Software, Inc.: Santa Fe, NMGoogle Scholar
- 31.OMEGA, Version 2.4. 2009, OpenEye Scientific Software, Inc.: Santa Fe, NMGoogle Scholar
- 32.Musafia B, Senderowitz H: Biasing conformational ensembles towards bioactive-like conformers for ligand-based drug design. Expert Opin Drug Discov. 2010, 5: 943-959. 10.1517/17460441.2010.513711.CrossRefGoogle Scholar
- 33.Hawkins PCD, Skillman AG, Warren GL, Ellingson BA, Stahl MT: Conformer generation with OMEGA: algorithm and validation using high quality structures from the Protein Databank and Cambridge Structural Database. J Chem Inf Model. 2010, 50: 572-584. 10.1021/ci100031x.CrossRefGoogle Scholar
- 34.Sadowski J, Bostrom J: MIMUMBA revisited: Torsion angle rules for conformer generation derived from X-ray structures. J Chem Inf Model. 2006, 46: 2305-2309. 10.1021/ci060042s.CrossRefGoogle Scholar
- 35.Bostrom J: Reproducing the conformations of protein-bound ligands: a critical evaluation of several popular conformational searching tools. J Comput Aided Mol Des. 2001, 15: 1137-1152. 10.1023/A:1015930826903.CrossRefGoogle Scholar
- 36.OEChem, Version 1.7.4. 2010, OpenEye Scientific Software, Inc.: Santa Fe, NMGoogle Scholar
- 37.Mills JEJ, Dean PM: Three-dimensional hydrogen-bond geometry and probability information from a crystal survey. J Comput Aided Mol Des. 1996, 10: 607-622. 10.1007/BF00134183.CrossRefGoogle Scholar
- 38.Szybki TK, Version 1.5.0. 2010, OpenEye Scientific Software, Inc.: Santa Fe, NMGoogle Scholar
- 39.Barnard JM, Downs GM: Clustering of chemical structures on the basis of two-dimensional similarity measures. J Chem Inf Comput Sci. 1992, 32: 644-649. 10.1021/ci00010a010.CrossRefGoogle Scholar
- 40.Lajiness MS, Johnson MA, Maggiora GM: Implementing drug screening programs using molecular similarity methods. Prog Clin Biol Res. 1989, 291: 173-176.Google Scholar
- 41.Fontaine F, Bolton E, Borodina Y, Bryant SH: Fast 3D shape screening of large chemical databases through alignment-recycling. Chem Cent J. 2007, 1: 12-10.1186/1752-153X-1-12.CrossRefGoogle Scholar
- 42.Mansfield ML, Covell DG, Jernigan RL: A new class of molecular shape descriptors. 1. Theory and properties. J Chem Inf Comput Sci. 2002, 42: 259-273. 10.1021/ci000100o.CrossRefGoogle Scholar
- 43.Entrez Programming Utilities Help. [http://www.ncbi.nlm.nih.gov/books/NBK25501]
- 44.Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 2006, 34: W729-W732. 10.1093/nar/gkl320.CrossRefGoogle Scholar
- 45.Pipeline Pilot, Version 8.5. 2011, Accelrys, Inc.: San Diego, CAGoogle Scholar
- 46.Chan X, Brown B, Hedrick M, Rascon J, Millan JL, Sergienko E, Roth G, Reddy S, Dad S, Chung TDY, et al: HTS identification of compounds activating TNAP at an intermediate concentration of phosphate acceptor detected in luminescent assay. Probe Reports from the NIH Molecular Libraries Program. 2010, Bethesda, MD: National Center for Biotechnology Information, 2011/03/25 editionGoogle Scholar
- 47.Medical Subject Headings. [http://www.ncbi.nlm.nih.gov/mesh]
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.