1. Overall architecture
We have since extended the ChemBioGrid infrastructure to be the primary data source for WENDI. Additionally, for WENDI we have introduced the idea of aggregate web services that call multiple individual, or atomic, web services and aggregate the results from these services in XML. For example, the main web service used by WENDI takes as input a SMILES string representing a compound of interest, and outputs an XML file of information about the compound aggregated by calling multiple web services. This XML file can then be parsed by an intelligent client to extract information pertinent to compound properties. The overall architecture uses a four layer approach which we described previously [14] that includes storage, interface, aggregation and smart interface layers (see Figure 1). The storage and interface layers are implemented using the Web Service Infrastructure, and our initial work developing aggregate web services and smart clients comprises the work described here.
Web services either follow the Simple Object Access Protocol (SOAP) standard [16] or REpresentational State Transfer (RESTful) approach [17], the latter of which are often better integrated with Hypertext Transfer Protocol (HTTP) than SOAP-based services. Whilst we have both kinds of web service in operation, we primarily use REST service. For example, we have created a 3D similarity searching Web Service is based on our local PubChem 3D database which stores 3D structures [18] and 12 distance moments [19] for all the compounds in the PubChem database. This service is called by the WENDI web service.
Our SOAP-based services are deployed in a in Tomcat 5.5 application container, which allows us to maintain these services easily and provides a high level of integration with our development environments, and with the service developed by Java 1.6.0. Our Web service layer is handled by the AXIS libraries 1.6 [20], which accept a SOAP message, decode it to extract the relevant function arguments, call the appropriate Web service classes, and finally encode the return value into a SOAP document for return to the client. Our Web service is published as WSDL [21] which is an XML-based standard for describing Web services and their parameters. Increasingly, we are converting our services to REST for even easier maintenance and access. A list of some of our atomic web services can be found on the web [22]
2. Database Services
Our infrastructure contains a large number of compound-related databases, including mirrors of existing databases (such as PubChem), databases derived from these (such as 3D structures of PubChem compounds), and completely new databases (particularly those derived from the literature). Our databases are housed on a Linux server running the PostgreSQL database system, with gNova CHORD [23] installed to allow chemical structure searching and 2D similarity searching through the generation of fingerprints. Mirrored databases are updated monthly. By housing the databases in a homogenous environment, it is easy to perform searches that cross multiple databases using single SQL queries, and to routinely expose the databases with web service interfaces. The following databases are used in the WENDI system:
PubChem Compound
A mirror of the PubChem Compound database, containing compound ID's (CIDs), InChI, SMILES, compound properties, and 166-key MACCS-style fingerprints [24] generated by the gNova CHORD system.
PubChem Bioassay
A mirror of the PubChem Bioassay database containing AIDs (assay ID's), CIDs of compounds tested, and bioassay outcomes and scores
PubChem BioDesc
Descriptions of all PubChem bioassays
Pub3D
A similarity-searchable database of minimized 3D structures for PubChem compounds
Drugbank
A mirror of the DrugBank dataset [25] containing CID's (mapping to PubChem), DBID's (Drugbank ID's), drug names, SMILES, usage descriptions, and 166-key fingerprints. The database contains nearly 4800 drug entries including >1,350 FDA-approved small molecule drugs, 123 FDA-approved biotech (protein/peptide) drugs, 71 nutraceuticals and >3,243 experimental drugs.
MRTD
An implementation of the Maximum Recommended Therapeutic Dose (MRTD) set [26] including name, SMILES, and 166-key fingerprints. The database contains 1,220 current prescription drugs available in SMILES format from the FDA Web site.
Medline Chemically-aware Publications Database
PubMed IDs of papers indexed in Medline[27], with SMILES of chemical structures (from the title and abstract) extracted using the Oscar3 program [28]
Phenopred
a matrix of predictions of gene-disease relationships based on known relationships mined from the literature and machine learning predictions [29].
Comparative Toxicogenomics Database (CTD)
cross-species chemical-gene/target interactions and chemical-disease relationships derived from experimental sets and the literature [30].
HuGEpedia
an encyclopedia of human genetic variation in health and disease [31].
ChEMBL
a database of bioactive drug-like small molecules, containing 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data) [32].
2D Tanimoto similarity searching of these datasets is made available by the gNova CHORD tanimoto function applied to the 2D public 166 keys, an implementation of the popular MACCS keys. Without indexing, it runs very effectively for a single query or on a small dataset, but the speed reduces significantly for large datasets. We have 56,911,891 compounds in our PubChem Compound table as the time of writing. To speed up the searching, we implemented a method described by Swamidass & Baldi to reduce the subset of molecules that need to be searched in similarity calculations [33]. The method uses simple bounds on similarity that can be applied when a similarity threshold is used (given two fingerprints A and B, and a threshold t, we can calculate a maximum similarity between the fingerprints as min (a,b)/(a+b-min (a/b)), where a and b are the number of bits set in A and B respectively).
In addition to 2D similarity searching, 3D similarity searching is provided on Pub3D database using 12-dimensional molecular shape descriptors [20] calculated for our Pub3D database of 3D minimized structures of PubChem compounds. Similarity to a query is calculated using Euclidean Distance. We use PostgreSQL to store all these 12D vectors for all compounds, with the CUBE type [34] extension.
3. Prediction services
We have made available a variety of predictions through our web service framework, particularly:
-
Tumor cell line predictions. We created 40 Random Forest models for prediction of human tumor cell line inhibition, trained using data from the NCI Developmental Therapeutics Program Human Tumor Cell Lines [13]. These predictions output a probability of activity for a compound (0-1).
-
Toxicity prediction. We implemented a special modified Web service implementation of ToxTree [35] for prediction of toxic effects
-
Gene-disease relationships. We have implemented a table of predictions of gene-disease relationships extracted from the PhenoPred tool developed at Indiana University [29]. Also we employed the CTD and HuGEpedia data to expore gene-disease relationships,
4. Aggregate web service and client
We have created a main WENDI aggregate web service, and a web-based client that employs the web service. The web service takes a query SMILES string as input (through a SOAP or REST interface), and calls a variety of web services and database searches using the query. Results are returned as an aggregate XML file with sections delineated according to the atomic web service that was called. Additional XML tags are added by the web service, in particular, Gene Ontology terms in the PubChem Bioassay descriptions, Drug descriptions (from Drugbank) and paper titles and abstracts, are extracted and tagged with Gene Ontology ID's (GOID's). These permit associations to be made between genes and assays, drugs and papers.
The client permits the user to input a SMILES string, or to draw a structure in using the JME editor [36], and then uses JSP (Java Server Pages) to submit the query request to the web service and display and parse the XML results, and JavaScript to handle the XML file as the response return back from the server side. The layer between request submitted by the client and response returned back from the server is effected using AJAX (shorthand for Asynchronous JavaScript + XML) technology. With Ajax, web applications can retrieve data from the server asynchronously without interfering with the display and behavior of an existing page.
The primary way that the databases are employed in WENDI is through similarity searching: finding compounds in the databases that are similar to the query, which have some known property: for example, we retrieve compounds that are similar (>0.85 Tanimoto) to a query molecule that are active in a given bioassay, are known drugs, or are referenced in a journal article. Based on the similar property principle [37] we can assume that these molecules are likely to have similar properties to the query compound, thus be of interest in understanding the potential properties of the query.
The WENDI interface is organized into six major sections:
Predictive models results presents the predicted probability of activity of the compound in 40 Human Tumor Cell line assays, organized by panel type (renal, non-small cell lung, breast, colon, etc) and color coded according to probability of activity (red for > = 0.7, yellow for > = 0.6 and <0.7, and grey for <0.6). Confusion metrics are also presented to allow the validity of these models to be assessed. Also presented are the results of a ToxTree analysis, particularly the classification according to Cramer rules [38] and a breakdown of presence or absence of known toxic fragments.
Activities of similar compounds presents a list of similar compounds (Tanimoto similarity values given) in PubChem that have been tested in bioassays, and shown to be either active. A link to the bioassay along with the bioassay name is given, and an additional column uses the extraction of Gene Ontology terms from the bioassay description along with the PhenoPred predictions of gene-disease relationships to list possible related diseases. The DrugBank and MRTD sets are also similarity searched with the results presented in a similar fashion; in the case of DrugBank, drug usage descriptions are given along with predictions of diseases extracted in a similar way to the PubChem section
Similar compounds from chemogenomics data presents a list of similar compounds (Tanimoto similarity values given) from CTD, ChEMBL data that include the relationships with compounds and genes/diseases.
Similar compounds from Systems data presents a list of similar compounds (Tanimoto similarity values given) from KEGG data that include the relationships with compounds and Pathways/Enzymes.
Similar compounds in the literature lists journal articles in Medline where the title or abstract contains compounds with a Tanimoto similarity >0.85 to the query. Links are given to the Journal articles
Inactivities of similar compounds presents the same informations as Activities of similar compounds sections, except for all of the similar PubChem compounds found that have been tested in bioassays and shown to be inactive.
Finally, a link is given to the raw XML file, and PDF file for download.