Introduction

The path to successful drug discovery frequently demands understanding the complex network-pharmacology (NP) landscape [1]. Understanding how the action of a drug perturbs the intricate regulatory relations among protein targets is pivotal to surmounting this challenge. Thus, it is imperative to associate drugs, targets, clinical outcomes and side effects within an integrated network [2]. Specifically, for drug-target interactions (DTIs), NP opens the possibility of evaluating potential DTIs in the context of their complex biological and chemical setting. Adopting and utilizing NP-based workflows in everyday research remains a challenging task, yet one that is likely to lead to more fruitful scenarios.

A rational design workflow that integrates NP needs to consider multiple interactions between a drug molecule and its intended (target, or “on-target”) and not-intended (“off-target”) interaction partners. Here, we consider proteins to be the key DTI partners. Due to the existence of multiple pathways between proteins, the effects of a drug molecule might ripple to proteins not in direct contact with the drug in question. Such perturbations may also alter gene-expression profiles of cells. The complex nature of this interaction-landscape demands novel and integrated target-chemistry-bioactivity approaches [3] in drug discovery and facilitates the devising of new therapeutic strategies, e.g., multitarget therapies. Therefore, it is imperative to develop new ways of analyzing such complex relationships. The ever-growing nature of biomedical data demands the use of modern technologies such as graph-databases, to facilitate the analysis, processing and retrieval of drug-target and target-target relations at scale. A computational NP framework addressing these aspects is essential in the quest to translate biomedical information into clinically relevant knowledge. Here we present such a framework: “SmartGraph”, which has two aims:

First, create an easy-to-use graphical user interface (GUI) that can facilitate various steps of the drug discovery process without requiring in-depth expertise in bioinformatics and cheminformatics. Target deconvolution of phenotypic assays, on- and off- target identification, lead and chemical probe discovery as well as drug repurposing are among the major activities that SmartGraph can support. Second, SmartGraph facilitates the development of predictive and analytic NP workflows.

Other network-analysis inspired tools can facilitate the analysis of DTIs and PPIs, such as the CARLSBAD Web Application and Cytoscape plugin [4], the ChemProt database [5], the STRING-based STITCH platform [6, 7], the Drug Target Explorer [8], Bioactivity-explorer [9], and the PathInsight [10] and PathLinker Cytoscape plugins [10, 11]. SmartGraph combines some of the distinctive characteristics of these platforms: It integrates the use of common chemical patterns as implemented in CARLSBAD and CarlsbadOne [12], the pathway analytic features of STITCH and STRING, the ability to define starting and end nodes (PathLinker), the integration of DTIs with PPIs (PathInsight) and the ability to predict new drug-target interactions (CarlsbadOne and Drug Target Explorer). Bioactivity-explorer has some similar functionalities with SmartGraph, mainly the use of scaffolds for relating targets but it lacks the integration of pathway analysis and bioactivity prediction. In addition, SmartGraph is built on Neo4j [13], a state-of-the-art graph database, and uses their bolt websocket interface [14] for fast data transfer. Angular is used to generate object models and UI templates, and RxJS [15] is used to asynchronously update the data retrieved from Neo4j. This data is coupled with a robust D3.js force-directed graph visualization to efficiently handle and display this large-scale data, and to support the integration of additional layers of biomedical data [16, 17]. Finally, distinctive features of SmartGraph compared to existing tools are a Google-Maps-like [18] “start and destination”-based pathway discovery feature, and an integrated bioactivity prediction algorithm based on target-class agnostic “potent chemical patterns” [19]. The use of the Neo4j database engine enables these features to work on a large scale of data and in a practically instantaneous manner, unlike existing tools.

Computational methods and datasets

Rationale

With compounds and targets as nodes and relationships as edges, we can model DTIs and PPIs as a network. An edge linking a compound and a target could represent an experimentally determined bioactivity value, whereas an edge between two targets could be a regulatory relation between the two. Nodes and edges can have various attributes, such as the IC50 (bioactivity) value or the specific regulatory mechanism linking two targets.

From a structure–activity-relations (SAR) standpoint, targets tend to recognize certain chemotypes, i.e. common chemical patterns in a series of molecules. Derived from known active compounds, chemotypes can be exploited in the prediction of novel DTIs. The Bemis-Murcko scaffolds [20] are examples of chemotypes, but other algorithms could be used for this task, such as Maximal Common Edge Subgraph algorithm [21], Scaffold Network Generator [22] and Hierarchical Scaffolds [23, 24] to name a few. We refer to chemotypes as ‘patterns’ throughout this text to imply the general nature of these objects. Detailed descriptions of object (node) and relation (edge) types are provided in the “Terminology” section in Additional file 1.

Preprocessing DTI and PPI information

Information for compounds, targets and their interactions were extracted from the ChEMBL (version: 24.1, downloaded on: 10/26/2018) [25, 26] and SIGNOR (version: 2.0, Homo Sapiens, downloaded on: 04/23/2017) [27] databases. With the help of a KNIME [28] workflow (detailed in Additional file 1) bioactivity data related to human targets were aggregated by unique compound-target-bioactivity type triples, in conjunction with Bemis-Murcko pattern detection (see “Bioactivity Data Aggregation” in Additional file 1). The aggregated bioactivity values represent the median value of all observed bioactivities for a given triple.

Data integration

Pre-processed DTI and PPI information (compounds, targets, patterns and their relationships) was then assembled into a backend Neo4j database, via JDBC-based Java code connecting the PostgreSQL [29, 30] (ChEMBL and SIGNOR) and Neo4j [13, 31] databases; the code is available publicly at https://github.com/ncats/smartgraph_backend.git. This repository contains an SQL script to create a database and load corresponding data. As input, the script uses KNIME-created files [28]. The Java code then uses DTI compounds, and DTI and PPI targets, represented by their InChI [32] keys and UniProt IDs [33], respectively. For convenience, the non-stereo part of the InChI key (NS-InChI key) of compounds are also provided as a compound-attribute, see: “Terminology” in Additional file 1. Each node and edge in the Neo4j database is assigned a unique UUID identifier [34] for efficient data handling and database queries.

Metadata

SmartGraph includes metadata associated with compounds, patterns, targets and their relationships as node and edge attributes, respectively. Such attributes include the median activity between a compound and a target (when multiple activities are available), the confidence value of a given PPI, its nature (e.g., phosphorylation, ubiquitination, etc.), and the overlap of a pattern with the associated compound as the ratio of their heavy atoms. Compound and pattern nodes are also annotated with the respective SMILES and structure depictions. Target nodes include gene symbols and names as attributes as well.

Network perturbation

To interpret phenotypic screening results, one can use SmartGraph to find all shortest paths between user-defined start and end nodes. When multiple such shortest paths are found, all connecting shortest paths between the selected two nodes are returned. Once merged into sub-networks, these shortest paths can be used to generate hypotheses for the explanation of mechanism-of-action of phenotypic screen hits. Despite the large amount of information stored in the backend database, this operation is performed efficiently by the underlying Neo4j database engine using the Cypher query language [13] and bolt websocket interface [14].

When the activity of a protein is perturbed, downstream proteins within the same pathway are affected as well. Given the complex nature of each pathways, such perturbations affect targets at increasing distances in a less and less deterministic manner. To account for this phenomenon, the parameter (k) introduced into the shortest path detection process limits the length of shortest paths in connecting start and end nodes. By default, k = 5.

Start nodes can be defined in two ways: First, a set of small molecules with their associated NS-InChI keys can be start nodes. In such cases, SmartGraph identifies proteins on which any of these small molecules were assayed using the DTIs extracted above. An additional parameter (p) can be used to filter DTIs that are of potency lower than p (in negative log units). Hence, proteins associated with DTIs of AC50/IC50/EC50 below p will not be considered as start nodes after filtering. By default, p = 5. Second, UniProtID IDs for proteins can also be used. SmartGraph supports mixed queries, i.e. input sets can contain both small molecules and proteins as start nodes.

The shortest path exploration process can be influenced by an additional parameter (c), regardless of the start node definition method. The SIGNOR database stores confidence (c) values for each PPI. By adjusting the value of c, low confidence PPIs can also be excluded from the shortest path discovery process.

It is possible to leave the end nodes undefined, which enables manual exploration of pathway perturbations. SmartGraph includes features to expand the node neighbors list on a node-type basis. For instance, in the case of a protein node it is possible to expand compound/protein/pattern neighbors. It is also possible to selectively hide neighbors based on node types.

Bioactivity prediction

Experimental bioactivities are often used to classify compounds as active or inactive. The cut-off value for such categorization is often context-dependent, particularly with respect to target type, and the wealth of information (e.g., number of compounds with measured bioactivities) available at any given time. The active/inactive separation may further depend on SAR, as some chemotypes may be considered more active than others.

In seeking to emulate a medicinal chemist's approach, we decided to classify the top 20% of the compounds (ranked in decreasing order of bioactivity) as “potent”, with the remaining compounds tested on that same target being labeled as “not potent”. This approach is preferable when working with multiple target classes, since the typical enzyme or ion channel bioactivity profile may vary by two or three orders of magnitude compared to a nuclear or G-protein coupled receptor. Thus, the 0.1 μM cut-off value (7 on the –log scale) does not always appropriately separate actives from inactives across all target classes. The 80/20 principle was generally applied to partition compounds, but for certain proteins, an alternative formula was required. When a protein had at most five (aggregated) bioactivity values, the cutoff value was set to 7 (0.1 μM). Cutoff values were lumped across all bioactivity values, regardless of type, e.g. AC50, IC50, EC50. Although some might use Ki as surrogate for IC50/AC50/EC50 values, in this study we did not consider Ki or other type of bioactivity values besides IC50/AC50/EC50 types.

Potent compounds for each target were tagged using bioactivity values from ChEMBL and the above criteria. We then examined the relationship between proteins based on their molecular recognition ability, estimated by chemical patterns (here: scaffolds) featured in potent compounds. These target-compound and compound-pattern associations allow us to infer novel associations between targets and patterns. All patterns contained in a potent compound can thus be considered potent patterns for that specific target. Associations between a target and potent patterns suggest that certain targets might be specifically perturbed by other (untested) compounds featuring the same potent patterns.

Results and discussion

Knowledge base of SmartGraph

The knowledgebase of the framework was implemented in Neo4j graph-database which contains 1,732,553 unique compounds and 4583 protein targets. With the help of Bemis-Murcko scaffold analysis, we extracted 523,944 unique scaffolds from compounds. We observed that some of these scaffolds are connected to too few or too many compounds. In order make the bioactivity prediction efficient, we decided to remove scaffolds deemed too general or too specific from the knowledge base, i.e. scaffolds connected at least to 100 or to less than 5 compounds, respectively. This process gave rise to 63,783 unique scaffolds, i.e. patterns, that participate in 793,085 one-to-many pattern-compound relations. Furthermore, 271,098 out of all compounds participate in 420,526 DTIs involving 2018 protein targets. Out of these targets, 1455 are associated with a potent pattern and 901 participate in PPIs, see: Fig. 1. All information about compounds, targets and scaffolds is stored in the Neo4j graph-database.

Fig. 1
figure 1

Knowledge base of SmartGraph. C2P pattern-of relations, T2P potent pattern-of relations, DTI drug-target interactions, PPI protein–protein interactions

Graphical user interface of SmartGraph

The network analysis and bioactivity prediction methods of SmartGraph are exposed through a web-based GUI available at https://smartgraph.ncats.io/. We recommend Chrome, Firefox and Safari browsers to use SmartGraph. This GUI was implemented using the D3 visualization library to generate a force directed graph interface, and the Angular [35] framework to track and update the graph data as it changed. The GUI also uses RxJS, an asynchronous event programming library to set up a websocket to communicate with the backend server through the Neo4j “bolt” protocol [14], making the queries much faster than if they were done through traditional API calls. The GUI itself is organized into three main panels (see: Fig. 2).

Fig. 2
figure 2

Web-based graphical user interface (GUI) of SmartGraph. Shown is an example network that was generated by 8 compounds and one destination target and after adjusting parameters k = 2, p = 3 and c = 0.10 in the control panel. The structures of two compounds are oriented on the left-hand side of the subnetwork. The structures of pattern nodes (BM-scaffolds) are outlined in blue, and their nodes are embordered by a yellow line. The end node is colored by dark red, whereas other targets are colored by orange. These targets constitute the shortest paths between the compounds and the destination target

The control panel (left side) provides the means to construct and fine-tune subnetworks for pathway analyses and is collapsible to allow for a larger view of the generated network. The selection of start and end nodes and the values of k, p and c can be set using this panel. The main (visualization) panel is used to visualize and interact with the generated subnetworks, clicking on a node reveals a floating menu with the following features: Expand/collapse compound/pattern/target/all neighbor nodes; and generate bioactivity predictions, respectively. The main panel also contains a smaller sub panel, which shows node details on hover, and allows the user to search for protein nodes within the displayed graph and isolate the nearest neighbors. The third, bottom (information) panel is used to visualize selected node and edge metadata in two separate tabs. Information is displayed in this table upon hovering over edges or by clicking on nodes. Clicking on a node or edge will persist it in this table, allowing users to generate a list of interesting nodes or edges. Some of the information displayed is the depiction of chemical structures and patterns, bioactivity in µM unit, and confidence of PPIs, to name a few.

The information displayed on the nodes can be configured via a collapsed menu accessed by clicking on the gear icon on the top-right corner of the visualization panel (next to “Help & Settings”). The user can set the label type on various nodes: protein/gene name, symbol, identifier and compound/pattern identifier. Structures of compounds and patterns can be displayed by clicking on “Show Chemical Structure” under “Compound Labels” and “Show Structures” under Pattern Labels, respectively. This menu also contains a legend denoting the different node types as well. The control panel can be collapsed by clicking on the icon with three bars on the top-left corner of the visualization panel to provide more visual real estate for the graph visualization. Finally, an additional panel overlaid on the network-panel provides means to search for target nodes in the network based on their names and to lookup their record in the Pharos database [36] via link-out. The same panel also provides the most important information of compound and pattern nodes. The source code used to generate the GUI is available publicly at https://github.com/ncats/smartgraph_frontend.

Use cases

To illustrate some of the features and typical use cases of the SmartGraph platform, we selected an mTOR signaling pathway assay (PubChem [37, 38] AID: 2660), using the qHTS technology [39] in confirmatory mode, i.e., dose–response using 7 concentration points. mTOR is linked to a number of diseases, including cancer [40]. The assay identified a downstream effect of mTOR activation. Out of 1280 compounds, 8 were confirmed actives. Below, we describe three typical use-cases for SmartGraph.

Manual exploration of biological relations

When active compounds are identified via a phenotypic screening, one of the next steps is to investigate targets associated with these actives and to try to link them to other targets that may play important role in the observed phenotype. Such an exploratory analysis can be initiated by providing the NS-InChI key of active compounds in the “Start Nodes” field of the control panel or by inputting the UniProt identifier of known targets. Here we use a selected active compound (NS-InChI key: DGWXOLHKVGDQLN, PubChem CID: 398148) as a starting point. The identified start node(s) appear in the visualization panel (compound and target nodes shaded blue and orange, respectively). Next, when left-clicking on the nodes a floating menu with several options appears. Manual exploration of relevant biological relationships can then be performed using the “Expand Target” feature, which in this case reveals three targets for which the compound at hand was tested: CDK1, CDK2 and ATR (gene symbols). On the information panel the bioactivity value of a DTI can be revealed by clicking on the “Link View” tab and hovering the mouse over, or clicking on the edge defined between the respective compound and the target. The result of further expanding this network on ATR target is shown in Fig. 3.

Fig. 3
figure 3

First use-case. Shown is a subnetwork that can be revealed by tracking the effect of a compound by the means of manual exploration. The network was assembled by defining a compound as a starting node (PubChem CID: 398148), and subsequently the neighboring target nodes of the compounds were expanded. Finally, the neighboring target nodes of ATR protein target, as a potential target of interest, were also expanded

Automated hypothesis generation for mechanism-of-action

While the first use-case can be performed using multiple start nodes, the process can be laborious given that some of targets are involved in multiple PPIs. As described in the “Network perturbation” section, SmartGraph can reveal shortest path(s) between a start node and an end node. Such an analysis can be initiated, for example, by entering a space/new-line/comma-separated list of the NS-InChI keys of the 8 active compounds pasted into “Start Nodes” and the UniProt identifier of mTOR (P42345) into “End Nodes”, given that the assay at hand interrogates mTOR activation. We recommend clicking the “clear graph” button, followed by re-inserting the respective node identifiers when one wants to re-define start or end nodes. After clicking the “find shortest path” button, SmartGraph will attempt to find shortest paths between the two sets using the default values of parameters. By setting the three filtering parameters to permissive values, i.e. k = 5, p = 3, c = 0.01, the resulting subnetwork includes three compounds with multiple shortest paths. This subnetwork and subsequent ones in this section are shown in Fig. 4.

Fig. 4
figure 4figure 4

Second use-case. a Shown is a network generated by using the 8 active compounds (PubChem AID: 2660) as start nodes and mTOR as end node. Values of parameters k = 5, p = 3, c = 0.01 were equal to their default values as they were not adjusted in the control panel. b Showcasing weak interactions: values of parameters k = 5, p = 5, c = 0.01. c The value of k was set to 2 allowing for a maximal length of 2 of shortest paths (k = 2, p = 5, c = 0.01). Other parameters were not adjusted. d Generating subnetwork comprised of interactions of higher confidence based on curation (k = 10, p = 5, c = 0.73)

It is possible to restrict the network to only strong DTIs, which we can achieve by increasing the value of p, e.g. p = 5 (or 10 µm) which sets a typical upper limit for active compounds. Consequently, one of the compounds (PubChem CID: 237) and paths induced by only that compound are eliminated from the subnetwork. Sometimes, it may be useful to limit the length of the shortest paths in order to eliminate distant (less certain) downstream effects. Accordingly, setting the value of k to 2 results in a smaller subnetwork. In this graph, the two compounds are more likely to perturb the action of the associated proteins.

Finally, the integration of an annotated and curated PPI database into the knowledge-base of SmartGraph (e.g., SIGNOR) allows the user to depict only PPI subnetworks with higher confidence. From a subnetwork generated with permissive parameter settings (k = 10, p = 5), we identified the maximal value of c that resulted in a subnetwork: c = 0.73. Increasing the value of c further eliminates the “weakest link” between AKT1 and mTOR, consequently destroying the path between the only remaining compound and mTOR. This use-case might be preferable for someone who prioritizes curated data in MoA analysis. SmartGraph additionally provides the literature source (extracted from SIGNOR) of individual PPIs in the “Link View” of the information panel upon hovering the mouse over, or clicking on target-target edges.

In summary, the above scenarios allow the generation of hypotheses regarding the potential MoA of the active molecules. In these workflows, parameters k, p and c provide great versatility in prioritizing different aspects of the exploration process. Furthermore, the generated subnetwork can assist in analyzing how small molecules might perturb other targets in the network besides the end node(s).

Bioactivity prediction via potent patterns

The final aspect of SmartGraph is its integrated predictive functionality. As described in “Bioactivity prediction” the investigator can predict potentially novel compounds for a target. In the frame of our example, bioactivity predictions can be made for mTOR as follows. By left-clicking on the mTOR node, then “Get predictions” in the revealed menu, compounds predicted to be associated with mTOR are shown. These compounds are clustered around potent patterns of mTOR that provided the basis for the prediction (see: Fig. 5a). The overlap ratio between compounds and patterns can be obtained using “Link View” when mouse-hovering over or clicking on the respective compound-pattern edges. Generally, the higher the overlap ratio the more likely the predicted compound will be confirmed as a perturbagen of the target at hand. If the predicted compound is an approved drug, such a prediction could serve as the basis of a drug-repurposing study. When the number of predictions is large, one can perform a similarity-network based clustering of the predicted compounds to select a set of diverse compounds for further studies [41,42,43].

Fig. 5
figure 5

Third use-case. a Result of prediction of active compounds for mTOR using the “Get Prediction” function of SmartGraph. Yellow nodes: patterns (BM-scaffolds), nodes depicting structures: compounds, orange and brown nodes: protein targets. This subnetwork is derived from the subnetwork generated by the last use-case of “Automated hypothesis generation for mechanism-of-action” section. b Predicting polypharmacology for compound A (NS-InChI key: QNUKRWAIZMBVCU, PubChem CID: 5289419). Scaffold overlap ratio: 0.89. Predicted targets: ALK (p68 kinase, P19525) and NEK2 (NimA-like protein kinase 1, P07949). Compound A was already tested in EGFR because it is connected to it

Off-target interactions can be predicted in a similar fashion, i.e. taking advantage of the concept of potent patterns. In this use-case we use one of the active compounds from the above set of eight actives as a start node that is referred-to as Compound A (PubChem CID: 5289419, NS-InChI key: QNUKRWAIZMBVCU, InChI key: QNUKRWAIZMBVCU-WCIBSUBMSA-N). Of note, defining Compound A as a start node via its NS-InChI key results in two compound nodes that differ only in the second section of their InChI key considering that these compounds are E/Z-isomers. Expansion of Compound A results in 11 neighbor nodes (10 targets, one pattern) for Compound A. Expanding the protein nodes of the pattern node reveals three targets (EGFR, NEK2, ALK) for which this is a potent pattern. Compound A was already tested on EGFR, but not on the other two targets (see: Fig. 5b). Thus, we hypothesize that Compound A is a modulator of NEK2 and ALK. The overlap ratio between the Bemis-Murcko scaffold and Compound A is 0.89, a reasonable probability for compound A being active on these two targets.

Conclusions

In this study we presented SmartGraph, an integrated and predictive web-based platform which supports complex cheminformatics and bioinformatics workflows in a simple-to-use manner. The knowledge-base of SmartGraph is derived from ChEMBL v24.1 and the SIGNOR PPI databases, and contains 420,526 unique DTIs involving 271,098 unique compounds and 2,018 unique targets. Utilizing state-of-the-art technologies for backend (Neo4j) and frontend (Angular, RxJS, D3) designs, SmartGraph features powerful analysis workflows and data visualization. This was showcased through several use-cases that cover hypothesis generation for elucidating MoA, drug repurposing and off-target-prediction.

The design of SmartGraph and the underlying Neo4j graph-database allows for the facile integration of additional layers of biomedical data, such as pharmacological action of drugs, non-small molecule drugs, disease information and target development categories [44] to name a few. We also intend to integrate more cheminformatics and network analysis features into the platform in the future. Further, the graphical interface of SmartGraph can be used in many scenarios when efficient visualization is required to display, organize and analyze large-scale data. Finally, we plan to create a stand-alone API based on the GUI of SmartGraph and release it as an open-source code base.

SmartGraph is available at https://smartgraph.ncats.io/.