NetControl4BioMed: a pipeline for biomedical data acquisition and analysis of network controllability
- 614 Downloads
Network controllability focuses on discovering combinations of external interventions that can drive a biological system to a desired configuration. In practice, this approach translates into finding a combined multi-drug therapy in order to induce a desired response from a cell; this can lead to developments of novel therapeutic approaches for systemic diseases like cancer.
We develop a novel bioinformatics data analysis pipeline called NetControl4BioMed based on the concept of target structural control of linear networks. Our pipeline generates novel molecular interaction networks by combining pathway data from various public databases starting from the user’s query. The pipeline then identifies a set of nodes that is enough to control a given, user-defined set of disease-specific essential proteins in the network, i.e., it is able to induce a change in their configuration from any initial state to any final state. We provide both the source code of the pipeline as well as an online web-service based on this pipeline http://combio.abo.fi/nc/net_control/remote_call.php.
The pipeline can be used by researchers for controlling and better understanding of molecular interaction networks through combinatorial multi-drug therapies, for more efficient therapeutic approaches and personalised medicine.
KeywordsNetwork controllability Software pipeline Web service Data acquisition and integration Protein-protein interaction networks Personalized medicine Cancer
Over the last decade, high-throughput experimental technologies like gene sequencing, proteomics, etc. became the core of biomedical research and have generated a large set of biomedical data . The recent advances in experimental data acquisitions allow researchers to study functions and properties of proteins, RNAs and genes, as well as to explore a network of interactions between them. The signal transduction network of protein-protein interactions (PPIs) is the backbone of signalling pathways , metabolic pathways , and various essential cell processes for normal cell function [4, 5]. Such networks are modelled mathematically as directed graphs, consisting of nodes standing for all the proteins in the network, and directed edges between them standing for each signal transduction relationship between them. Each edge carries a positive “weight” signifying the relative strength of the corresponding interaction. One may associate to nodes variables that follow the dynamic level of the protein corresponding to that node. Each variable is influenced through its incoming edges by the level of its predecessors in the network, and it influences itself through its outgoing edges the level of all its successors in the network. The quantitative level of this influence is usually described through a computational model based on difference equations or ordinary differential equations. The result is a linear dynamic system where changes in some variable cascade through the network eventually influencing the levels of many nodes in the network. We call configuration or state (at some given time point) the collection of the levels of all variables associated to nodes in the network (at that time point).
In recent years, analysis of such directed signalling PPI networks through linear dynamical systems has been central for the current biological research, providing novel insights into modern molecular biology from the network perspective . In order to study the structure, function and dynamics of directed PPI networks, multiple computational system biology approaches have been employed to reveal important links in various biological networks . This includes, among others, finding physical interactions (e.g., between proteins in PPI networks) and functional interactions (e.g., between genes with similar or related functions, direct or indirect regulatory relationships between genes), identifying network modules (clusters of intensively interacting molecules) , interaction patterns and topological properties of disease networks (such as cancers, HIV infections, diabetes mellitus, Parkinson, Alzheimer, etc.) .
A number of computational pipelines and softwares have been developed  to perform various analysis of interaction patterns, topological properties, and visualisation of PPI networks. The majority of these approaches are focusing on finding structurally important disease-associated protein interactions in a network [10, 11]. However, so far there are no known software solutions analysing interaction networks for the purpose of identifying strategies to gain control over (parts of) the network. Recently, several algorithms have been developed to perform network structural analysis and suggesting optimal sets of so-called driven nodes through which one can control a network [12, 13, 14]. This paper aims to fill this gap by introducing the first open web-based tool implementing network controllability for biomedical networks.
A linear dynamical system is said to be (fully) controllable through a set of driven nodes if there exists a time-dependent sequence of input signals delivered through these nodes in such a way that, through cascading changes, the system can be driven from any initial state to any desired final state within finite time [12, 15]. In the biomedical domain, the interventions can be thought of as drugs delivered to a patient, and the driven nodes can be thought of as the drug targets. An efficient method to select a minimal set of driven nodes in gene regulatory network in order to reach its full controllability was recently presented in . However, computer-based experimental tests in  shows that in biological networks one may have to control as much as 80% of the nodes of a gene-regulatory network in order to gain full controllability. This makes the full network controllability approach impractical for biological and medical purposes. In many cases, it is more practical to control only a certain subset of the network’s nodes (for instance, a disease-specific set of essential proteins) in order to reach a desired overall behavior of the system [13, 14, 16]. This approach, called target controllability, may lead, for instance, to realistic suggestions for combined multi-drug therapies for a particular disease . We focus in this paper on target controllability.
We develop a bioinformatics data analysis pipeline (called NetControl4BioMed) and its web-based front-end in order to provide a web-based service for automatic generation of combined multi-drug therapy suggestions through the analysis of directed biochemical interaction networks. The pipeline generates automatically intracellular molecular interaction networks by combining the seed nodes provided by the user with interactions among proteins and other intracellular components from several public pathway repositories: KEGG, WikiPathways, and Pathway Commons. The core of the pipeline consists of the implementation of the algorithm proposed in . For a given set of disease-specific essential proteins, the algorithm identifies in the network a small set of driven nodes through which one can gain control over the essential proteins. To boost the practical applicability of the pipeline, we implemented a version of the algorithm that uses data from DrugBank to maximize the use of drug-targetable proteins as driven nodes. The pipeline can be accessed and downloaded from .
Structural network control
We give a brief presentation of the network controllability approach and of the algorithm proposed for it in . This algorithm aims to find a small set of driven nodes that can be used to control a given set of target nodes. The algorithm uses several heuristic strategies for an efficient exploration of the search space, which leads to faster and better (smaller sets of driven nodes) results in comparison to the original version of the target controllability algorithm proposed in .
We denote by N the set of nonnegative integers and by R the set of real numbers.
The targeted structural controllability was proved to be computationally highly difficult in , where it was shown to be NP-hard. This means that calculating the minimal (in the sense of smallest) set of driven nodes to control a given set of targets is exponential in the size of the network, and thus unfeasible for practical real-life case studies. Instead, the authors in  proposed heuristics for giving some set of driven nodes, hopefully small, and in any case not guaranteed to be minimal. In  faster algorithms were proposed, based on stochastic searches for paths to the target nodes. These algorithms remain approximation heuristics and give no guarantee that they will find a minimal set of driven nodes; in the tests we made they returned results that are a degree of magnitude smaller than those in . The implementation we chose for them in our pipeline is based on thousands of independent runs of the algorithm, with the best of the results reported as the final result.
Here we discuss the software tools used to build our pipeline and the data used in it.
Workflow engine: Anduril
The pipeline is developed for the Anduril workflow framework . Anduril is an open source component-based pipeline engine for scientific data analysis. Anduril defines an API (Application programming interface) that allows to integrate rapidly a vast range of existing software analysis and simulation tools and algorithms into a single data analysis pipeline. An Anduril pipeline represents a set of interconnected executable programs (called components) through well-defined I/O ports. Upon the termination of the execution of an Anduril component, its output results are delivered as inputs to the other (downstream) components by means of connecting the output port of the component to the input ports of its downstream components. When an Anduril pipeline is being executed, a component can be executed as soon as all the necessary input data at the input ports (from the upstream components) become available.
Biological data and network generation
Our pipeline uses the Moksiskaan platform  to generate molecular interaction networks based on the user’s query. Moksiskaan integrates pathways, protein-protein interactions, genome and literature mining data into comprehensive networks, starting from a given list of proteins (so-called “seed nodes”). It combines the relations among proteins from different known pathways in order to address the fact that pathways crosstalk and influence each other. The Moksiskaan platform defines a generic database schema to store the pathways from a number of different pathway databases and can be scaled to include the pathway data from new sources (such as new databases and user’s own data). Currently, Moksiskaan has built-in support for the integration of the pathway data from, among others, KEGG pathway database , Pathway Commons , and WikiPathways [23, 24].
In our pipeline, Moksiskaan constructs a comprehensive network for the list of seed nodes by using and combining all imported pathways in the following manner: it connects all seed nodes by all known paths of length not exceeding the “gap” value. The gap, a parameter that the user may set in the pipeline GUI, is the maximum number of intermediate nodes the network may have between the seed nodes. For higher gap values, the network will grow quickly in size as the pipeline will search for any paths of length up to gap+1 between the seed nodes, and add them to the network, along with all the intermediary nodes. The higher the gap, the more comprehensive the network will be and the smaller the set of identified driven nodes will be, but also the slower the network analysis will become. The pipeline currently includes the option of selecting a gap value up to 5.
We use drug-target protein data from the open source DrugBank database . The DrugBank database combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug-target (i.e. sequence, structure, and pathway) information from bioinformatics and cheminformatics resources. For drug-target identifiers we selected all FDA (Food and Drug Administration)-approved drug-target proteins with known mechanisms, in total 1507 proteins.
We provide the user with a number of predefined sets of target proteins associated to some specific cancer cell lines. These target proteins are cancer-specific essential proteins. We have included in the pipeline data for three types of cancer after mapping from the COLT-Cancer database . In particular, we considered 29, 23 and 15 cell lines respectively for breast, pancreatic and ovarian cancer. Previous studies  showed that proteins with lower GARP (Gene Activity Rank Profile) score are stronger associated with oncogenesis. Therefore, we have selected only those essential proteins whose GARP value is in the negative range, and whose GARP-P value is less than 0.05. For more details about calculating GARP score, see .
Seed proteins: List of proteins that will be used as seed nodes by Moksiskaan to generate the network. This input can be any protein ID of Homo sapiens.
User-defined network: The user has the option to use a custom network in the pipeline instead of the Moksiskaan network.
Cancer Cell Lines: The user has the option to include data on a cancer cell line, whose set of essential proteins will be used as target nodes and/or as seed nodes. If the user does not include any cancer line, then the next field should not be empty.
Additional target proteins: A set of target nodes defined in addition to those in the “Cancer Cell Lines”. This input can be left empty if the previous field is set to a cancer cell line. These nodes may also be included as seed nodes.
Gap: The gap parameter used by Moksiskaan to generate the network.
Include drug information: This is an option on whether the pipeline should include also the drug-target information for the driven nodes. If so, then the driven nodes for which there exist FDA approved drugs will be specifically highlighted in the output of the pipeline.
User defined drug-target proteins to be included in the analysis: The user has an option to include also set of custom drug-target proteins. If the “Target By Drug” field is chosen, the user-defined custom drug-targets will be considered along with the FDA-approved drugs-targets.
Results and discussion
The network in Fig. 3 is generated based on breast cancer specific proteins. Here, we selected the AKT1, AKT3, NRG1, MTOR, ERBB3 protein as seed nodes to generate the network. We chose MTOR and ERBB3 proteins as target proteins, as we found these as essential proteins in cancer cell lines MBD-MB-231. Here, AKT1 is a drug-targetable driven node through which control can be gained over the cancer essential protein MTOR. Dysregulation of MTOR pathways lead to oncogenesis in breast cancer . It has been seen that HER2 over-expression by MTOR is one of the main cause of breast cancer [29, 30]. It has also been shown that AKT is one of the critical anticancer drug-targets for rational drug discovery being present as a site in various multiple oncogene and tumor suppressor signaling networks . The non-drug-targetable node NRG1 is also predicted by our algorithm to be able to gain control over cancer essential protein ERBB3. NRG1 is known to be involved in the dysregulation of ERBB3 (ERBB3 has prominent role in oncogenesis) [32, 33].
To demonstrate the wide applicability of the pipeline and its algorithmic back-engine, we also analyzed two case studies on Type 2 diabetes and on Alzheimer disease protein-protein interaction networks. For Type 2 diabetes we gathered literature data on essential proteins from [34, 35, 36, 37]. Alzheimer’s essential protein data was gathered from [38, 39, 40, 41, 42].
In the case study on the Alzheimer disease, our pipeline reported MTOR as a driven node through which control can be gained over the essential protein NOS3, see (Additional file 1: Figure S1). NOS3 is well known for its association with G894T as a main risk factor of Alzheimer’s disease [43, 44]. Previous research shows that MTOR could be a remarkable target for Alzheimer’s disease [45, 46]: the dysregulation of MTOR signaling pathway is involved in the pathogenesis and progression of Alzheimer’s Disease. Also, the use of MTOR inhibitors was reported as a therapeutic target for Alzheimer’s disease in .
In Type 2 diabetes, our pipeline reported MYC as a driven node through which control can be gained over the essential protein CDKNB2 see (Additional file 1 Figure S2). This result correlates with earlier predictions of MYC as drug-target in various cancers ; interestingly, MYC is not yet documented to be used in treatment options for Type 2 diabetes. With SNPs in their 3’ UTR miRNA binding sites, CDKN2B increase the risk phenotype. Further, pancreatic beta-cell replication is regulated by CDKNB2  and its faulty regulations increase the risk of diabetes.
The structural network controllability approach allows to get a better insight into a system modeled as a directed graph: for a set of target nodes it is possible to identify a set of driven nodes through which one can control the target nodes by an external intervention through using the internal “wiring” of the network. It is a promising approach that allows one to design a system-level handle into directing the evolution of a complex system. Moreover, the approach even allows the modeler to focus on the structure of the network, while avoiding the need to measure or identify many numerical parameters. It is widely applicable to any model presented as a directed network, with a set of key nodes whose indirect control is to be gained. Signalling transduction networks are particularly suitable for this approach. Other types of networks, e.g., metabolic networks, remain outside the applicability domain of this approach, as they are not amenable to being modeled as directed graphs.
We use here a recently developed algorithm  for structural targeted network controllability that identifies a minimal set of driven nodes for a user-given set of target nodes. We implemented this algorithm through a pipeline (that can be downloaded and installed as a stand-alone software) and through a related online service (a publicly available web interface for an instance of the pipeline installed on our servers). The pipeline performs an automatic generation of intracellular molecular interaction networks (by combining publicly available pathway data) and identification of driven nodes (which also can be targeted by FDA approved drug target-proteins) for a set of target proteins defined by the user.
In this paper we also address the interesting problem of using the controllability approach for a combination of data on FDA-approved drug-targets and data on cancer essential proteins for different types of cancers. Users can also apply this pipeline if they have other disease-specific target proteins. We anticipate that our pipeline has the potential in suggesting novel therapeutic strategies by using currently known drugs.
The benchmark tests have shown the following results for our pipeline. When using under 10 seed nodes and gap 1, the pipeline generates networks of a size close to 30 nodes and 100 edges (the exact values depend on what seed nodes have been chosen exactly and what interactions between the nodes are known in the databases). Our structural network controllability algorithm processes networks of this scale and finds the driven nodes (in the pipeline GUI called input nodes) in time of 1 second. For 10 seed nodes and gap 2 the pipeline generates networks in range 20 to 50 nodes and 30 to 300 edges. Networks of this scale are being analyzed by our algorithm in range of 1 to 3 seconds. When used near 20 seeds and gap 1 or under 10 seeds and gap 3, the pipeline generates networks of size close to 100 nodes and 1.000 edges. The algorithm analyzes the networks of this size in 5 seconds. If using near 20 seeds and gap 2, we get networks near 200 nodes and 2.500 edges. The analysis runs here near 20 seconds. For 20 nodes with gap 3 and 4 we get networks from 300 to 600 nodes and 6.000 to 9.000 edges. The analysis takes here from 30 to 50 minutes. The pipeline generates networks with near 800 nodes and 11.000 edges for near 20 seeds with gap 5. The algorithm computes driven nodes for this network in near 7 hours.
Hereby, we conclude that our pipeline is practical for analysis of networks of size up 1.000 nodes and 10.000 edges, since the results can be obtained within 1 day. For small networks (up to one hundred nodes and 2.000 edges) the result is obtained in time up to 2 minutes. We note that in practice the computational time needed for the algorithm starts growing extremely fast when approaching size of 3.000 nodes in a network. Also, the efficiency of the pipeline strongly depends on how many free CPU cores the host system provides, since the python implementation of our network target controllability algorithm relies heavily on usage of parallel threads. In particular, we have been running several computationally heavy pipeline tasks on a single system with 12 free CPU cores while performing the benchmarking for this article.
The pipeline can be accessed and downloaded from .
The software we discussed in this article opens up the network controllability methods for applications in a variety of domains. The focus has been on a user-friendly interface that includes a text-based input, a visual output, output files that are compatible with standard modelling software, web-based interface requiring no special installations on the user’s end. There is extra support offered by the software for users in cancer medicine in the pre-loaded list of essential genes in several types of cancer. We believe that the pipeline can be used by researchers for controlling and better understanding of molecular interaction networks through combinatorial multi-drug therapies, for more efficient therapeutic approaches and personalised medicine.
Availability and requirements
Project home page:http://combio.abo.fi/research/network-controlability-project/Operating system(s): Platform independent, browser-based.Programming language: Anduril, Python, PHP.Other requirements: Modern webbrowser.License: FreeBSD.Any restrictions to use by non-academics: none.
Publication of this article was funded by the Academy of Finland through grant 272451, by the Finnish Funding Agency for Innovation through grant 1758/31/2016, and by the Romanian National Authority for Scientific Research and Innovation, through the POC grant P_37_257.
Availability of data and materials
The source code of the pipeline as well as an online web-service based on this pipeline are available at http://combio.abo.fi/nc/net_control/remote_call.php.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 7, 2018: 12th and 13th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2015/16). The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-7.
KrKa collected the data for the case studies and analyzed the results. VR and KeKa integrated the Anduril and Moksiskaan components into the pipeline and designed the web interface. VR deployed the Web service. EC designed the heuristic strategies implemented in the back-end of the pipeline. All authors contributed to designing the software. KrKa, VR, EC and IP wrote the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Ethics approval and consent to participate
- 8.Zhou X, Menche J, Barabási A-L, Sharma A. Human symptoms–disease network. Nat Commun. 2014; 5(4212).Google Scholar
- 11.Jiang P, Wang H, Li W, Zang C, Li B, Wong YJ, Meyer C, Liu JS, C AJ, Liu XS. Network analysis of gene essentiality in functional genomics experiments. Genome Biol. 2015; 16(239). https://doi.org/10.1186/s13059-015-0808-9.Google Scholar
- 14.Czeizler E, Gratie C, Chiu WK, Kanhaiya K, Petre I. Target controllability of linear networks In: Bartocci E, Lio P, Paoletti N, editors. Computational Methods in Systems Biology. CMSB 2016. Lecture Notes in Computer Science, vol 9859. Cham: Springer: 2016.Google Scholar
- 17.COMBIO. NetControl4BioMed: Network Controllability for Biomedicine. 2017. http://combio.abo.fi/software/netcontrol/. Accessed Apr 2018.Google Scholar
- 18.Shields RW, Pearson JB. Structural controliability of multi-input linear systems. In: 1975 IEEE Conference on Decision and Control Including the 14th Symposium on Adaptive Processes. IEEE: 1975. p. 807–9. https://doi.org/10.1109/CDC.1975.270615.Google Scholar
- 21.Kanehisa M. Toward pathway engineering: a new database of genetic and molecular pathways. Sci Technol Japan. 1996; 59:34–8.Google Scholar
- 25.Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, Tang A, Gabriel G, Ly C, Adamjee S, Dame ZT, Han B, Zhou Y, Wishart DS. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2013; 42(D1):1091–7. https://doi.org/10.1093/nar/gkt1068.CrossRefGoogle Scholar
- 28.Lee JJ, Loh K, Yap Y-S. Pi3k/akt/mtor inhibitors in breast cancer. Cancer Biol Med. 2015; 12(4):342–54. https://doi.org/10.7497/j.issn.2095-3941.2015.0089.PubMedPubMedCentralGoogle Scholar
- 29.O’Brien NA, Browne BC, Chow L, Wang Y, Ginther C, Arboleda J, Duffy MJ, Crown J, O’Donovan V, Slamon DJ. Activated phosphoinositide 3-kinase/akt signaling confers resistance to trastuzumab but not lapatinib. Mol Cancer Ther. 2010; 9:342–54. https://doi.org/10.1158/1535-7163.MCT-09-1171.Google Scholar
- 30.Nagata Y, Lan K-H, Zhou X, Tan M, Esteva FJ, Sahin AA, Klos KS, Monia BP, Nguyen NT, Hortobagyi GN, Hung M-C, Yu D. Pten activation contributes to tumor inhibition by trastuzumab, and loss of pten predicts trastuzumab resistance in patients. Cancer Cell. 2004; 6(2):117–27. https://doi.org/10.1016/j.ccr.2004.06.022.CrossRefPubMedGoogle Scholar
- 31.Cheng JQ, Lindsley CW, Cheng GZ, Yang H, Nicosia1 SV. The akt/pkb pathway: molecular target for cancer drug discovery. Oncogene. 2005; 24:7842–492. https://doi.org/10.1038/sj.onc.1209088.Google Scholar
- 37.Ayub Q, Moutsianas L, Chen Y, Panoutsopoulou K, Colonna V, Pagani L, Prokopenko I, Ritchie GRS, Tyler-Smith C, McCarthy MI, Zeggini E, Xue Y. Revisiting the thrifty gene hypothesis via 65 loci associated with susceptibility to type 2 diabetes. Am Soc Hum Genet. 2010; 94:176–85. https://doi.org/10.1016/j.ajhg.2013.12.010.CrossRefGoogle Scholar
- 38.Talwar P, Silla Y, Grover S, Gupta M, Agarwal R, Kushwaha S, Kukreti R. Genomic convergence and network analysis approach to identify candidate genes in alzheimer’s disease. BMC Genomics. 2014;199(15).Google Scholar
- 39.Zirnheld AL, Regalado EL, Shetty V, Chertkow H, Schipper HM, Wang1 E. Target genes of circulating mir-34c as plasma protein bio markers of alzheimer’s disease and mild cognitive impairment. J Aging Sci. 2015;140(3).Google Scholar
- 41.Kim S, Nho K, Risacher SL, Shen L, Shaw LM, Trojanowski JQ, Weiner MW, Saykin AJ. Mapre2 as a novel alzheimer’s disease target gene from gwas of csf amyloid beta 1-42, tau and hyperphosphorylated tau in the adni cohort. J Alzheimer’s Assoc. 2015; 11(7):767.Google Scholar
- 43.Liu S, Zeng F, Wang C, Chen Z, Zhao B, Li K. The nitric oxide synthase 3 g894t polymorphism associated with alzheimer’s disease risk: a meta-analysis. Sci Rep. 2015; 13598(5). https://doi.org/10.1038/srep13598.Google Scholar
- 45.Cheng X, Zhang L, Lian Y-J. Molecular targets in alzheimer’s disease: From pathogenesis to therapeutics. BioMed Res Int. 2015; 2015:204–8.Google Scholar
- 47.Cai Z, Chen G, He W, Xiao M, Yan L-J. Activation of mtor: a culprit of alzheimer’s disease?. Neuropsychiatr Dis Treat. 2014; 11:1015–30.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.