Assisting in the creation of computational models

The adoption of standard structured formats for scientific data, such as the Systems Biology Markup Language (SBML) [2], enables software developers to offer a greater diversity of powerful tools to researchers. These tools help to accelerate the pace of research and enable researchers to develop increasingly elaborate theories and models. This trend has been followed not only for the kinds of process-based network models that are SBML's bread and butter, but in other areas of biological research as well (for instance, the PDB format for three-dimensional molecular structures). The past decade has seen the production of a large number of software packages aimed at systems biologists. For example, more than 200 packages are known to support SBML today, embodying a wide variety of capabilities. This abundance of SBML-compatible systems is just one measure of the wealth of software resources available today in systems biology - there exist many other kinds of software resources, such as databases of chemical entities, used routinely by systems biologists.

The more comprehensive modeling environments for systems biology, such as CellDesigner [3], COPASI [4], Virtual Cell [5] and others, have been gradually enriched and now offer a wealth of features for modelers. They provide comprehensive capabilities for working with models once they are created (for example, for parametrization, simulation, analysis and visualization), but as yet, very few widely available systems provide significant automation to assist modelers with the intellectual activity of creating a model in the first place. Some systems, such as CellDesigner and Virtual Cell, now allow the importation of complete ready-to-run models from databases such as BioModels Database [6], which allows researchers to start with an existing base rather than have to recreate everything from scratch. A few others, such as MetNetMaker [7], also provide users with the means to search and import individual reactions and other entities from databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [8], building up a model interactively without having to type in the details of every component. Even fewer systems allow users to search for similar models based on model annotations; one of those few, semanticSBML [9], goes further by providing facilities for clustering models on the basis of similarities in their annotations, and also for merging sets of models into more comprehensive ones.

As capable as these systems are, they still place the responsibility of selecting all the model's components on the modeler's shoulders. Li et al.'s efforts push this frontier further forward by allowing a modeler to start with, for example, lists of pathway terms or biochemical entities, and then have the software automatically retrieve matching data about networks, species, and their cellular locations. In the authors' work, this process is guided by a large consensus model of the yeast metabolic network [10]. As more large network maps and models become available to serve as guides, future modelers will increasingly be able to start with existing maps and focus their efforts on creating subsets of the overall system - and that is where automation such as Li et al.'s will accelerate research. Their system is not entirely unique in providing this kind of capability (MetNetMaker, mentioned above and which uses the KEGG Ligand database, is another example), but it provides the greatest degree of automation so far, and it offers users the ability to adjust the workflow visually rather than having to rely on fixed procedures that are hard-coded into a given software environment.

Workflows as sharable data sets in their own right

The procedures required to produce a finished simulation result from a model are important to communicate, publish and store. SBML, by design, only expresses the static structure of a model: the variables and their relationships, and the values of the different numerical constants used in it. SBML does not provide a script for analyzing, simulating or otherwise doing something with the model. The nascent Minimum Information About a Simulation Experiment (MIASE) project and its associated effort to develop a structured file format, SED-ML [11], aim to create a software-independent representation of such procedures. A workflow of the sort described in Li et al.'s paper is similar to SED-ML but goes farther in both scope and kind. Whereas SED-ML describes the simulation and processing steps that will start with a parametrized model and produce a set of numerical results, Li et al.'s workflows start with a qualitative, unparametrized model and apply a wider variety of steps. Starting from a fully annotated, MIRIAM-compliant [12] model, these include, but are not limited to, procedures that can complete the model using a variety of identifier-matching and iterative inference procedures, parametrize the model using experimental data retrieved from multiple online databases, perform structural and numerical validations on the resulting model to help reduce errors, calibrate the model to match more closely some specific experimental conditions, perform parameter optimization via batched distributed processing, generate visualizations of the results, and store numerical results in a software-independent format.

Looking to the future, we believe that, because such workflows can be stored, exchanged and built on (both by humans and by other software workflows), they will eventually become standardized data objects in their own right - stored and exchanged just as models on their own are at the moment. Much as semanticSBML [9] pushes the frontier for comparing sets of SBML-based models, so too, the day will come when researchers perform analyses on sets of workflows. Indeed, Li et al. [1] already report an examination of different workflows' execution performance. The possibilities for ingenious new kinds of analyses, transformations, and maybe even automated mutations of workflows provoke the imagination.

Challenges ahead

Several common problems continue to confront all modelers, including workflow users. The first is simply the lack of fundamental biological data. A frequently cited example are the kinetic data characterizing biochemical reactions. Data resources such as the SABIO-RK (System for the Analysis of Biochemical Pathways - Reaction Kinetics) [13] used by Li et al.'s workflow system are making a difference, but the number of biochemical reactions of interest to modelers is vast and the existing data sources are minuscule by comparison. Even more sparse is the information on the locations and amounts of biomolecular participants in cellular reactions. A significant and more fundamental change is also needed in the way the relevant experimental data are produced and shared, in order to keep up with the needs of computational modelers. In particular, the lack of metadata (that is, information about the experimental context used to produce the data, sample information, and post-processing of the experimental results) makes it very hard to evaluate, compare and select suitable data sets.

A second problem is obtaining sufficient annotations from the creators of models so that referenced entities can be uniquely identified and matched to how they are known in centralized databases. Modelers will continue to use their own preferred names for biochemical entities, and that can be perfectly acceptable if they also provide enough identifying information so that those entities can be matched up to appropriate database entries. That is crucial to allowing workflows such as Li et al. 's to operate. Contextual information can sometimes be used to disambiguate entities that are poorly specified or identified by uncommon synonyms, but this process is error-prone. The ideal scenario is when the modelers themselves provide sufficient information to uniquely identify what they have in mind. Software tools can help modelers by providing facilities to make the identification process easier, and thankfully some software tools, such as semanticSBML [9], do, but more work is needed in this direction.

A final challenge concerns the long-term survival of software systems and web services. It is all very well to store and exchange workflows, but if the resources they rely upon go out of existence, the workflows become useless. Of course, this challenge is faced by biological research as a whole and is not unique to systems biology. Innovative and useful software packages are continually being created, but they are often small-scale efforts without the means for continued support over years or decades. The loss of these resources wastes time and funding at least twice: the first time when the system is abandoned, and the second time when someone else unknowingly recreates the same thing in a different way. Finding ways to mitigate this problem has been a surprisingly difficult, and so far intractable, challenge.