CytoSolve: A Scalable Computational Method for Dynamic Integration of Multiple Molecular Pathway Models
- First online:
A grand challenge of computational systems biology is to create a molecular pathway model of the whole cell. Current approaches involve merging smaller molecular pathway models’ source codes to create a large monolithic model (computer program) that runs on a single computer. Such a larger model is difficult, if not impossible, to maintain given ongoing updates to the source codes of the smaller models. This paper describes a new system called CytoSolve that dynamically integrates computations of smaller models that can run in parallel across different machines without the need to merge the source codes of the individual models. This approach is demonstrated on the classic Epidermal Growth Factor Receptor (EGFR) model of Kholodenko. The EGFR model is split into four smaller models and each smaller model is distributed on a different machine. Results from four smaller models are dynamically integrated to generate identical results to the monolithic EGFR model running on a single machine. The overhead for parallel and dynamic computation is approximately twice that of a monolithic model running on a single machine. The CytoSolve approach provides a scalable method since smaller models may reside on any computer worldwide, where the source code of each model can be independently maintained and updated.
KeywordsSystems biology SBML Model composition Biochemical simulation Distributed computing Molecular pathways Kinetic modeling Parallel simulation Collaboratory Systems architecture Software
A molecular pathway model M can be thought of as “black box” which describes the biomolecular kinetics of a set of molecular species, S M . The input to the black box are the concentrations of species at time t = n, denoted as S M,n,. The internals of the black box contain computer source codes that evaluate the input using mathematical methods to yield an output. The output of the black box are the concentrations of species at time t = n + 1, denoted as S M,n+1.
The current methods for integrating models attempt to use direct computation to solve the problem, i.e. developing a program from scratch for each set of coupled reactions, or a monolithic approach, which takes individual component models in a single supported computer source code format such as Systems Biology Markup Language (SBML)11 and manually integrates them to create one monolithic software program. A variation on this second approach is to use semi-automation tools that help to automatically read and integrate source codes together to create one monolithic software program.24 The most common architectures such as Cell Designer,16 COPASI21 or E-CELL,26 use the monolithic approach.
The monolithic approach presents significant hurdles in scaling to large numbers of models. All models have to reside in the same geographical location, and the person integrating the models must be intimately acquainted with all the multiple pathways to be merged. Maintenance of the larger monolithic model emerges as a critical issue since new biological experiments may yield results that require changes to individual molecular pathways and their models’ source codes. If each model is in different source code format, conversion to a single format is required.
This paper describes CytoSolve, a computational environment for integrating multiple independent pathways dynamically in parallel to address the shortcomings of monolithic methods.
Characteristics of a Dynamic Methodology
A method that dynamically integrates the computations of the smaller molecular pathway models may obviate the need to: (1) merge source codes and (2) centrally maintain and update the source codes of each model. In fields such as ecology and earth sciences, platforms such DANUBIA3 have shown the value of integrating distributed models in parallel without the need to merge source codes. There are a number of important characteristics that such a dynamic method should exhibit.
First, it should be scalable. Scalability means the effort to integrate a new model is comparable to the effort to integrate the first model. Scalability has little to do with complexity. Two models with numerous equations, for example, can integrate easily if they are in the same programming language, have similar time scales, belong to the same knowledge domain and were developed on the same hardware and operating systems. Second, the method should be able to connect multiple knowledge domains in an opaque manner; such opacity will treat each model as a “black box” shielding the integrator from internal details. The integrator needs only know the inputs and outputs of each model. This will allow for integration of emerging compartmental models.
Third, the method should provide support for both public and proprietary models. Public models have accessible source codes. Proprietary models have inaccessible source codes. A pharmaceutical company, for example, with proprietary models may seek to integrate with public models, and researchers in an academic environment, alternatively, may seek to integrate their public models with proprietary models. Fourth, it should be extensible to provide support for heterogeneous source code formats including support for long standing formats such as MML generated by JSIM,21 and more recent standards such as SBML and CellML.12 Other important models have been published using programming languages such as MATLAB, Java or C. A computational architecture that supports dynamic integration will not require conversion to one standard format. Keeping a model resident on its native format will reduce time in source code rewriting and testing. An analogy is having a single program to display a collection of digital pictures stored in various graphical formats.
Last, the architecture should support localized integration. This means that users at all locations can initiate integration from their own local environment. Consider the scenario of the author of a Model A, who wishes to quickly test or integrate with an ensemble of three other models: Model B, C and D which are distributed across different machines. The author of Model A should not have to download the other three models to his/her local computer to perform the integration.
Ease of integrating an ensemble of disparate and distributed molecular pathway models, each owned by different authors is therefore critical to building large-scale models such as the whole cell. In his seminal work, Brooks5 demonstrated that as the number of different authors (or models) increases, the effort to perform such integration increases geometrically with the number of personnel communications required to simply coordinate the software development among the different authors. CytoSolve provides a new methodology to integrate models without requiring personnel interaction with the authors since each model need not be manually loaded, understood and interconnected into a single monolithic program.
The Presentation layer includes a Graphical User Interface (GUI) and Web Services. The user interacts with the GUI to specify one or more molecular pathway models to be integrated. A set of molecular pathway models may have common species and/or duplicate reaction pathways. For the former case, e.g. two models may refer to species Calcium but one may have refer it as “Ca++” and another as “Cal”, a Web Service is provided, which parses the models and detects potential naming conflicts and allows the user through the GUI to confirm or reject identical species. The Web Services also provides the user a mechanism to identify common reaction pathways to enable alignment across models. All user-defined changes or such annotations to the models to resolve species differences and reaction duplications are stored and updated within the Ontology of the Database, for later use by the Controller, during model integration. Once the user has specified models and resolved conflicts, the Controller via the Monitor, is invoked for executing model integration.
The Controller coordinates individual computations and couples models to derive the integrated solution. The Controller includes libraries that support direct model-to-model messaging as well as model-to-controller messaging. CytoSolve’s Controller has three components: the Monitor, the Communications Manager (Comm Mgr) and the Mass Balance. The Monitor serves to track the progress of each model’s computation. The Monitor knows, for a particular time step, which models have completed and which models have not completed their calculation. The Comm Mgr or Communications Manager coordinates the communication across all models. The Comm Mgr initiates a model to compute a time step of calculation and also can instruct a model to wait or hold on computing the next time step. The Mass Balance integrates, for each time step, the calculations across an ensemble of models by ensuring mass conservation of species, to derive the integrated solution.
The Communications layer contains the Inter-process Communications (IPC) infrastructure. The IPC allows communication of user parameters (e.g. which models to run) and results between the Controller and the Models. IPC allows the Controller to perform dynamic messaging using two important operations. First, the Controller may message a model with input values of species concentrations at time step n, S M,n and request the model to execute one time step of calculation. Second, a model, following execution of one time step of calculation, can message the Controller to send the output values of species concentrations at time step n + 1, S M,n+1. These operations enable the Controller to manage and steer the individual computations across multiple models in parallel.
The Data Base layer consists of storage of the Solution and the Ontology. The Solution (as detailed in Appendix B) holds memory resident data to track species concentrations across all models for each time step. The Ontology manages nomenclature and the annotations of species identification and duplicate reaction pathways, across models to ensure consistency during the Controller’s computation of the integrated solution (see Appendix C). The Ontology can be evolved to support more complex descriptions.
The Models layer denotes the set of models to be integrated. These models may each reside on different servers, remote to the GUI. Or, the models may reside on a single server, possibly the same server as the GUI, but may run as individual processes. CytoSolve treats each model as a module whose model code can be as simple or as complex as possible. However, what is important is the format of the inputs and outputs to and from the model, respectively. Appendix D provides a diagrammatic representation of a model, its input and outputs and how these inputs and outputs are linked with the temporal inputs and outputs of other models to evaluate a solution.
At the Controller layer, the three components, Monitor, Comm Mgr and Mass Balance, are programmed using J2EE. The Controller is executed on a Pentium 4 CPU 3.00 GHz Dell Workstation with 2 GB of RAM running Windows XP with Service Pack 2. Each component can be multi-threaded and communicates using message passing. At the Data Base layer, Virtual Memory is allocated for the Solution storage (as detailed in Appendix B) and a Java DB and ASCII files are used for the Ontology (as described in Appendix C).
At the Communications layer, IPC were originally implemented using the Simple Object Access Protocol (SOAP),23 but for performance reasons have been rewritten using the Java Native Interface (JNI lib).19 This XML-based messaging format established a transmission framework for IPC communication via HTTP. Both SOAP and JNI are vendor-neutral technologies that provide attractive alternatives earlier protocols, such as CORBA or DCOM. The Web Services Description Language (WSDL)6 supplied a language for describing the interface of the web services.
The Models used in the implementation are written in SBML and were acquired from the BioModels Database17 which provides access to published, peer-reviewed, quantitative models of biochemical and cellular systems delivered in SBML and CellML formats.
The CytoSolve architecture uses an interpreter to enter reaction and specie data relative to the pathway. In the current production version, we provide an input filter only for models coded in SBML. Other input filters for models stored in other formats can be incorporated within the architecture, provided the input filter for that format is developed and tested. In our earlier internal development efforts, we have produced prototype input filters for model formats written in MATLAB and Java. Our development roadmap, for future production versions looks to release these and other input filters to support CellML and MML. Such development may occur by our research team or through collaboration with the developers of those two formats.
Fundamentally, CytoSolve treats each individual pathway as a “black box” to which input concentrations are sent and from which time-evolved outputs are received. In principle, CytoSolve’s architecture is not dependent on the internal computational methods e.g. ODE, Petri Net, Stochastic, etc., for evaluating the models dynamics, nor the source code formats they are written. As aforementioned, MATLAB representations in parallel with SBML ODE solvers have been tested in earlier versions. The critical issue for integrating the black boxes across different computational methods is the pre-computational alignment of the individual pathways. This requires that the species and the equations be precisely defined such that common species and common pathways among models are identified. We are developing automated methods to do this for pathways expressed in SBML that we believe can likely be generalized to CellML and MML. We do not have standard packages to do this with models written in C and MATLAB, however, for example. For those cases, the alignment between the different pathways would have to be done by hand.
The web-based GUI, recently ported from a non-GUI environment, currently runs on a server through a web connection to http://www.cytosolve.com. The browser is cross-platform a design that can be run on workstations, laptops and even mobile devices such as the iPhone. The servers which the Models run on is a Pentium 4 CPU 3.00 GHz Dell Workstation with 2 GB of RAM running Windows XP with Service Pack 2, or a Mac Pro Server with two Intel quad-core processors with 4 GB of RAM running OSX10.6 (Snow Leopard).
The SBML ODE Solver (SOS) library (SOSlib)20 is used to enable symbolic and numerical analysis of chemical reaction networks. SOSlib takes as input a file encoded in the SBML and computes time history of species concentrations for specified initial conditions and time steps. In this implementation, the SOSlib was modified, using the C programming language, to enable single time-step evaluation. Other solvers supporting CellML, MML,4 and other pathway description dialects can be used interchangeably with SOSlib. Even MATLAB programs have been used. After the setting of the parameters, the model(s) are executed and the results are displayed and stored. An Appendix E has been added which provides the main steps for using the CytoSolve system. As the web-based environment is new and developing, documentation, on-line help, tutorials and video demonstrations are forthcoming and being updated to provide more detailed instructions.
The Controller performs initialization of the system by allocating memory storage for the computed species concentrations of each model and the integrated model in Local Vectors, and Global Vector, respectively, as detailed in Appendix B. During this initialization, the initial species concentrations, S O,0, are set in the Global Vector. In addition, CytoSolve during this initialization performs various types of pre-checks on models prior to integration.
One pre-check is to ensure coordination of physical dimensions e.g. conversion of species concentrations to uniform units, for example, molecules/cell or nM units. The other pre-check is to ensure coordination of common species, e.g. conversion of species names referring to the same species to a uniform name, for example, “Ca++” or “Calcium”. Finally, one other pre-check is to ensure coordination of common pathways, e.g. conversion of reactions referring to the same reaction to a consistent and common one. For the coordination of physical dimensions, CytoSolve currently performs unit checking and unit conversion on all input parameters. This is done both at the species level and at the equation level. For the latter two pre-checks, coordination of species names and pathways, CytoSolve uses a graph-theoretic approach, based on reaction-component (species) reachability to identify duplicate graph (reaction) pathways. A combination of Uniprot ID’s and sub-string matches are used to identify common species and duplications to align the graphs. During pre-checking, the user is alerted on any inconsistencies across species naming. In that event, the user can confirm or reject CytoSolve’s identification of common species. The user’s changes are annotated and both the Uniprot ID’s and species names in the model are updated.
Monitor, which monitors the progress of each model’s computation, during its initialization, accesses the initial conditions, S O,0, from the Global Vector, and sets these as the initial conditions for each model’s species concentration, S M,0. Control is then passed to the Comm Mgr which awakens all the models to start up and become ready to process a time step of calculation, and then invokes the Monitor.
The Monitor proceeds to invoke all models in parallel to execute a time step of calculation using S O,n, the species concentration values of the integrated model O at time step n, as the input to all models. Each model executes and computes one time step of calculation on its own Remote Server. Monitor tracks the progress of each model’s completion. Once a model completes its computation, the output is stored in its Local Vector within the Virtual Memory, and the model then goes to sleep to optimize use of the Remote Server’s CPU usage. By sleep, we mean that the model goes dormant until invoked again by the Comm Mgr to process another time step of calculation. Once all models have completed their processing for a time step, the Monitor passes control back to the Comm Mgr, and the Monitor itself goes to sleep.
NOTE: The Mass Balance component uses the Ontology (as described in Appendix C), to ensure that the species identities are correct. For example, if two different nomenclatures (e.g. Ca++ and CALCIUM) exist for the same species, and the species are treated as different, then one species may get depleted while the other floats at near its original value, which is not going to give the correct result. The Ontology is therefore critical in ensuring the species identification across models is correct.
If the last time step has been computed, the Controller stops, performs a variety of cleanup functions to release resources, memory, etc., and returns control back to the GUI; otherwise, the Comm Mgr cycles again for the next time step. All models, even those with different time scales, as of now are invoked with a homogeneous time step. This will be area for future research as noted in the sub-section “Adaptive Time Stepping of the Controller” in the “Discussion and Conclusions” section.
Currently CytoSolve is only concerned with the concentrations of species changes at each time step, and has limited its input output stream to these values; however, the architecture supports transfer of other attributes, which could be invoked, tracked and coordinated by the Controller.
Validation of Cytosolve
Note, in Fig. 7, that species (EGF_EGFR)2-P, encircled in red, is shared by all four models; however the species SOS, encircled in green, is shared only between two models. CytoSolve and Cell Designer are run for a total of 10 s in simulation time.
Comparison of cytosolve and cell designer
Solution Difference (%)
CytoSolve compute time (ms)
Cell Designer compute time (ms)
Column 2 and Column 3 in Table 1 provide the compute times for using CytoSolve and Cell Designer, respectively. CytoSolve’s compute time is roughly twice that of Cell Designer. The additional compute time is primarily due to network latency required for CytoSolve’s Controller to contact and receive information back from each model running on different Remote Servers. Cell Designer has no network latency since each model runs on a single computer.
Discussion and Conclusions
This paper has introduced CytoSolve, a new computational environment for integrating biomolecular pathway models. The initial results from the EGFR example has demonstrated that CytoSolve can serve as an alternative to the monolithic approaches, such as Cell Designer. Most important is CytoSolve’s core feature for integrating multiple pathway models, which can be distributed across multiple computing systems and solved in parallel, obviating the need to merge models into one system, running on a single computer. In the EGFR example, this means that if changes are made to Model 1, in Fig. 7, then CytoSolve simply has to be executed to evaluate a new solution; however, Cell Designer will require that changes to be manually merged back into the whole model and then executed. So in practicality, if each Model (1, 2, 3 and 4) is each owned by different authors, who make changes, then constant maintenance will be needed using Cell Designer to ensure the model is up to date.
The purpose of CytoSolve is to offer a platform for building large-scale models by integrating smaller models. Clearly, modeling the whole cell from hundreds of sub-models, each of which is owned by various authors (each making changes to their models) using the monolithic approach is not scalable. CytoSolve’s dynamic messaging approach offers a scalable alternative since the environment is opaque (treats each model as a black box) support for both public and proprietary models is extensible to support heterogeneous source code formats, and finally supports localized integration, a user can initiate integration from their own local environment.
From CytoSolve’s viewpoint a sub-model or module is a “black box”. We are agnostic to the meaning of that module and its biological context, and CytoSolve will simply integrate any set of modules. The user can of course intervene and decide which species are duplicates and provide context through the user interface. CytoSolve’s requirement, therefore, in this aspect, is minimal: each sub-model or module must accept as input a vector of species concentrations at a particular time step, and provide an output vector of species concentrations at the next time step.
Two other systems, CellAK and Cellulat, also offer an alternative to the monolithic approach.9,28 However, both of them use a static messaging approach. In the static messaging approach, the models remain independent programs and do not affect each other as they are executing. Any one model accepts as input a dataset and executes to completion to generate an output dataset. That output dataset is then given to another model, which that model uses it as input and also executes through to completion. This process can then be continued with other models, and they can be executed concurrently if there are no dependencies between their datasets.
CellAK and Cellulat treat each biological pathway model as a single entity (or agent) obeying its own pre-defined rules and reacting to its environment and neighboring agents accordingly. These two approaches offers many positive ways for integrating biomolecular pathway models; however, a non-specialist has a very high learning curve in preparing a set of biological pathway models for use with this approach because the integrator has to understand deeply the biology and architecture behind them. Furthermore these tools do not use ordinary differential equations to determine the time evolution of cellular behavior, since differential equations find it difficult to model directed or local diffusion processes and sub-cellular compartmentalization and they lack the ability to deal with non-equilibrium solutions. Most common biological modeling systems use traditional ODEs to simulate the models. Finally these two approaches are not designed to perform simulations on a distributed computational environment, which CytoSolve offers.
The implementation of CytoSolve to enable distributed and parallel computations across an ensemble of models also presents many new and unique challenges. Such challenges include the need to advance and optimise elements of the architecture, which particularly becomes relevant in a software implementation, which is a dynamic work in progress. These challenges have become important areas for our ongoing research, some of which are noteworthy to discuss in this manuscript.
Optimization of Time Step Selection
CytoSolve’s Controller, for example, ensures mass conversation of all species in the integrated molecular network. To ensure thermodynamic feasibility, however, the detailed balance constraint, which demands that at thermodynamic equilibrium all fluxes vanish must be imposed during computational integration by the Controller.8 This is a subtle but important issue that CytoSolve is aware of, which becomes apparent during the Controller’s evaluation of a common species concentration, across a set of models. To discuss by example, consider the case of two models, both with large rate constants, which share a common species, that are being communicated through the Controller. In this case, detailed balance demands that the associated species concentrations balance each other off. However, if the two opposing reactions do not communicate, the computational problems become far more difficult. Whereas, if the two competing rates net to zero the solution is stable and expressible as a simple ratio of concentrations. Segmenting the two competing rates into two separate independent pathways requires high accuracy calculations and a very small time step for the finite difference calculation to achieve a stable equilibrium meeting the requirements of detailed balance, one that will always be slightly displaced from the “true” equilibrium solution. We recognize that the displacement error is probably infinitesimal compared to the probable error in the reaction rates used in the calculation, but it is nonetheless a finite calculable error. The accurate and optimal selection of the time step during integration must be small to keep the error of species concentration, at each time step to an acceptable level. We are aware of this problem and are currently examining efficient ways to select time steps, by sharing information across pathways.
Spatial Scale Variation
At the present time, CytoSolve supports only computational models that represent one single pool of material or several distinct pools connected with specific transport relations. We have not considered changes in concentrations on a continuous spatial scale. We believe that the architecture, based on its modular approach and support for multiple compartments, can support varying spatial scales. However, more testing will have to be performed to understand the computation times required to fully support such spatial variations. The description language FieldML is available to support this process.
Adaptive Time Stepping of the Controller
All models are currently computed using a single adaptive time step, which is taken to be the fastest time step among the ensemble of models. This is not optimal, as some component models may be varying more slowly than others. Additional effort is required to implement intelligent adaptive time stepping at the Controller level to observe the time scales of different models and invoke them only when necessary. Such an effort will result in improved computation time performance.
Stiffness of ODE Systems
Stiffness is a common problem of ODE systems where integration of molecular pathway models may involve processes at different time scales. While adaptive time steps at the Controller level (during integration) will help towards addressing this problem, control of the ODE solver for simulation of sub-models will also be necessary. The current method within CytoSolve is to choose the time constant sufficiently small that additional decreases in the time step do not lead to different results. We are currently pursuing a more sophisticated algorithm that accounts for the different stiffness of different pathways that are being merged. This is complicated by the fact that some pathways assume a particular molecular component has a constant value whereas that component has production or removal within a parallel pathway. These interdependencies couple the time steps in novel ways. Additional work is underway to completely resolve this issue.
Implementation and Integration with Emerging Ontologies
The CytoSolve PID has support for integrating other ontologies such as MIRIAM; however, future research needs to be done to fully integrate MIRIAM and other such ontologies. This effort will enable CytoSolve to support many more model formats with greater ease, leveraging standards that the systems biology community globally accepts. Future work will include a more sophisticated native Ontology to manage nomenclature and species identification across all individual biological pathway models to be integrated by means of the web application, and automated searching for related biological pathways.
CytoSolve is now available at http://www.cytosolve.com as on-line web computational resource. We are providing access to source code on an as request basis. In such cases, we assume that the receivers of the source code are familiar with the coding language and are capable of independently updating and maintaining their revisions. The user interface of the CytoSolve system, as an ongoing research effort, continues to evolve based on feedback. On-line documentation, context-based help, tutorials, and video presentations are being added and will be updated and made available on an ongoing basis. Appendix E provides the important steps for using CytoSolve.
The author wishes to thank EchoMail, Inc., which made Dr. Ayyadurai’s research funding possible, and the International Center for Integrative Systems for additional support during the preparation of this document.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.