Inferring Executable Models from Formalized Experimental Evidence
Executable symbolic models have been successfully used to analyze networks of biological reactions. However, the process of building an executable model from published experimental findings is still carried out manually. The process is very time consuming and requires expert knowledge. As a first step in addressing this problem, this paper introduces an automated method for deriving executable models from formalized experimental findings called datums. We identify the relevant data in a collection of datums. We then translate the information contained in datums to logical assertions. Together with a logical theory formalizing the interpretation of datums, these assertions are used to infer a knowledge base of reaction rules. These rules can then be assembled into executable models semi-automatically using the Pathway Logic system. We applied our technique to the experimental evidence relevant to Hras activation in response to Egf available in our datum knowledge base. When compared to the Pathway Logic model (curated manually from the same datums by an expert), our model makes most of the same predictions regarding reachability and knockouts. Missing information is due to missing assertions that require reasoning about the effects of mutations and background knowledge to generate. This is being addressed in ongoing work.
Executable models of signal transduction provide insights into how cells work, and a means to understand and predict the effects of perturbations and mutations, key for cellular understanding of disease and therapeutics. For example, using an executable model one can apply algorithms to determine how one can prevent a given state from being reached or to compute alternative execution paths that reach a given state. Developing such models is extremely difficult. It requires collecting, organizing and interpreting experimental evidence, and assembling rules representing hypothesized biochemical reactions that make up a signaling network. This is very labor intensive and inferring a rule from experiments requires substantial biological knowledge. Several curated models of signaling and metabolic pathways are available [3, 11, 17, 18, 19]. However, there is a great need for tools to help automate the curation of executable models.
We describe a formal representation of experimental evidence called datums. Each datum captures relevant information about one or more experiments recording conditions under which a specific state or change in state (modification, activity, location) of a protein or other biochemical happens.
We define a language of logical assertions that corresponds to the elements of a datum, and a translation from datum syntax to logical assertions.
We define axioms that capture the semantics of datums interpreted as partial information about rules to be used as components of an executable model. The logic is that of Answer Set Programs  and we use an existing engine (DLV ) to derive minimal models called answer sets. Each answer set corresponds to one reaction rule. These models are then parsed into rules of an executable model.
Aspect (3) is being addressed as part of an ongoing DARPA project  to advance machine reading and reasoning techniques. We use Pathway Logic (PL)  as the formal system for representing and querying executable models of cellular processes. Automated analysis techniques such as forward collection and model-checking are used to assemble executable models and execution pathways by specifying a problem of interest (experimental conditions, targets, ...). The PL algorithms rely crucially on the fact that the rules are curated to work together. For example, rules that connect must use the same level of detail concerning location and modifications of participants. In contrast, automatically inferred rules capture all the relevant available experimental information, resulting in a knowledge base that is more precise and extensible. However, the model assembly process will require automation of the process of transforming rules to work together, without losing information unnecessarily. This is the topic of ongoing work.
We applied our algorithms to a collection of datums supporting a model of activation of Hras in response to Egf. The model is part of the PL collection of models manually curated by an expert. Although this first version of the rule generation logic does not account from some of the information in datums, the resulting model makes the same predictions as the curated model concerning response to Egf stimulation and effects of knockouts, with a small number expected exceptions.
Plan. Section 2 gives a brief overview of Pathway Logic executable models and an informal introduction to datums. Section 3 gives an informal introduction to the rule inference process using an Hras activation rule as an example. Section 4 presents the answer set programming axioms/rules of the datum logic. Section 4.2 describes the mapping of datums to assertions in the logic. Section 5 presents the Hras case study. Section 6 concludes with related and future work.
2 About Pathway Logic and Datums
2.1 Pathway Logic
Pathway Logic (PL)  is a system for modeling and reasoning about cellular processes such as signal transduction, metabolism, and cell-cell communication in the immune system. The PL execution model is based on rewriting logic [14, 15]. In PL, a cell state is represented as a ‘soup’ of occurrences, where each occurrence has three components: a protein or other biomolecule (gene, metabolite, ...), a modifier, and a location. The modifier indicates the state of the protein, including binding of small molecules or phosphates, or ability to act on other proteins (enzyme activity). For example, the term < [Hras - GTP], CLi> is the occurrence of the protein Hras modified by binding to the small molecule GTP (Guanosine-5’-triphosphate), attached to the inside of the cell membrane (CLi). The names used to form occurrences are semantically grounded using meta-data to provide links to standard databases.
Signal transduction steps are formalized as local rewrite rules operating on the relevant part of the cell state. Each rule describes a change in state of a small number of biomolecules (often just one) and the biological context that enables the change. A PL Rule Knowledge Base (RKB) consists of symbolic rules containing variables that range over a finite set of proteins, modifications or locations. STM (Signal Transduction Model) is a curated PL RKB that constitutes an executable model of signal transduction in the following sense: given an initial state called a (Petri) dish, which is a set of occurrences representing an experimental setup, the rules can be applied repeatedly, using the Maude rewrite engine , to transform the state. This represents a possible sequence of signaling events in a cell. A set of rule instances that can be applied/fired in some order from an initial state is called an execution pathway. Specific model networks can be obtained from an RKB by starting with a dish and using forward collection1 to collect all rule instances that might fire in an execution pathway of this dish. Such models can naturally be viewed as Petri Nets .
2.2 Datums: Formal Representation of Experimental Results
The PL STM model is an RKB whose rules are inferred from cell culture and test tube experiments. In cell-based experiments, cells are grown under known conditions. The cells may be modified by overexpressing some (possibly mutated) proteins, or knocking out some proteins (preventing expression). The resulting population of cells is treated with a stimulus or stress. Some property of the cells is measured before treatment and at one or more times after treatment to determine change in state, if any. The procedure that measures the property change is called an assay. Experiments can also be done in a test tube, and some experiments observe untreated cells.
Datum Structure. The syntax of a datum is designed to be readable by an experimental biologist, but constrained by structure rules and controlled vocabularies so it can be automatically parsed into a formal data structure. The full collection of datums collected for the STM RKB can be accessed via a web query page at light.csl.sri.com/datum. A more detailed description and query examples can be found at pl.csl.sri.com/datumkb.html. The curators notebook (pl.csl.sri.com/CurationNotebook/index.html) contains an intuitive description of datum syntax, catalogs of assays (with their detection methods and other attributes) and cell lines, and a glossary of terms.
The datum in Fig. 1 is a change datum that records an experiment in which the binding of GTP to the protein Hras is increased after addition of Egf (Epidermal Growth Factor) to a cell for 5 min. The first line contains the subject (Hras), the assay (GTP-association), the treatment (Egf) and the change (increased). The parenthetical text (times) at the end says the measurement was taken 5 min after the treatment. GTP-association is an assay that measures the amount of Hras bound to GTP. The first element of the second line describes the cellular environment. In this case VERO cells (a defined cell line) transfected with Gab1 (xGab1), grown in BMLS (Basal Medium Low Serum). The purpose of transfection is that it results in overexpression. The second element is called an “extra”. It records the result of an experiment that is a perturbation of the original experiment. In this case, the cells were transfected with Gab1 with a point mutation (xGab1(Y627F)), in which the tyrosine (Y) at position 627 is replaced by Phenylalanine (F) instead of wild type Gab1 ([substitution]). The third element gives the PubMed identifier of the paper in which the experiment was reported, and the figure where the experimental results were found (15574420-Fig-5a). Source information is not directly used to infer rules, but is crucial for review and updates.
3 Inferring Rules from Datums: An Example
Additional evidence that this happens in live cells is needed. The second datum provides such evidence. itpo is a treatment type in which a plasmid for the treatment (Sos1) is introduced into a cell culture and incubated for sufficient time for the treatment protein to become overexpressed. This datum tells us that it is possible that Sos1 can act as a GEF in a cellular environment. We say that Sos1 is an itpoGef for Hras. There are datums that report that knocking out Sos1 does not prevent the GDP-GTP exchange. This tells us that there are additional GEFs to be discovered.
4 A Logical Specification for Datums
The interpretation of datums is formalized using Answer Set Programming (ASP). We start by briefly explaining ASP before proceeding with the logical specifications of datums.
The meaning of an ASP program is a set of ground facts called an Answer Set. An answer set of a program Open image in new window contains a minimal number of facts that makes each clause of the program Open image in new window true. For a formal definition see [9, 12].
There are a number of engines that can compute the answer sets of an ASP program. In the present work we have used the DLV engine . Following the usual convention, variables appearing in programs are considered to be shorthand for the set of all possible ground instantiations using the constant and function symbols appearing in the program itself.
4.1 Assertions and Inference Rules for Datums
subject(S,Dt) denotes that S is the subject of the datum Dt.
assay(Type,Aux,Dt) denotes that Type is the assay type specified by Dt, for example, a phosphorylation or GTP-association. Aux is used for assay parameters such as modification sites (phos!Y627) or hooks in a binding assay (none is used if there are no relevant parameters).
treatment(T,Dt) denotes that T is the treatment specified by Dt.
increased(Dt), irt(Dt) denote that Dt specifies an increase in the changed state of the subject in response to the treatment.
We also have a collection of assertions that are common knowledge, or are implicit in datums collected from experiments by convention. The common knowledge assertions constitute a library used in the inference of the executable rules. An example is the fact that EgfR and its modifications are located at EgfRC. This is specified by assertions of the form: location(EgfR, EgfRC, ck), where ck stands for common knowledge.
Handling Multiple Datums. As described in Sect. 3, some datums contain the evidence for the changes of the subject of a reaction rule. We call these main datums. Other datums, called auxiliary datums, contain evidence about non-subject elements of the reaction, for example, required biomolecules or GEFs. We distinguish these datums using the assertions of the form useM(Dt) and useA(Dt), where the former specifies that Dt is the main datum and the latter that Dt is an auxiliary datum. We specify that an answer set should have exactly one main datum. We do not show the rules here.
occBf(X1,L1) denote that before the reaction, X1 is located at L1.
occAf(X2,L2) denotes that after the reaction, X2 is at location L2.
occ(X,L) denotes that the reaction requires X at location L in order to occur. Such an assertion can be used for a treatment complex or a require composite.
moveRule and reactRule denote that the rule to be extracted is either a rule specifying that the subject moves from one location to another without changing its modifications or it is a rule specifying that the subject changes its modifications without changing its location. This separation between move and react rules provides a finer grained specification of a model that simplifies the (meta) reasoning.
in(X) says that there is a possibly modified protein in the rule context, e.g., a treatment complex. inBf(X) and inAf(X) specify the state of the subject protein before and after the rule, respectively.
loc(X,L) says that a non-subject element X is at location L. locBf(X,L) and locAf(X,L) say that the location of the subject X is L before and after the reaction.
Here hasLocation(X) is an auxiliary assertion (rule omitted), denoting that it is possible to infer a concrete location for X.
4.2 Mapping Datums to Assertions
Each datum is mapped to a set of logical assertions that captures the subject, assay, treatment, treatment type, and change elements of a datum. The mapping algorithm takes as input the JSON representation of datums produced by the datum parser and produces input for the DLV engine as described above.
We ignore datums where the interpretation is complex and often requires specific biological knowledge. We currently ignore any datum with no subject, a mutated subject, a mutated treatment or more than one treatment.
In version 1 of the mapping algorithm, only extras of type “reqs” are captured as their interpretation is relatively straightforward. Extending the mapping algorithm to use “inhibited by” extras is a topic of future work.
Many datums report the same basic experiment, i.e. the same subject, assay and treatment. If these datums also have the same change (result) then the mapping will merge them, otherwise the datums are reported to the user as a conflict for manual inspection. Conflicts may be particularly troubling because datums span many different cell lines and cell types.
In the case of merged datums, the identifiers of the contributing datums are merged, thus "d1-d2" above. This allows us to track evidence and eventually reason about the quality/quantity of evidence used in generating a rule. The actual merged datum in our case study (Sect. 5) combines 51 datums from the datum knowledge base.
Because we merge all datums for the same change, each set of assertions corresponds to one rule in the model, and contains all information for the set of controls for the rule. Note that auxiliary datums will still be used to find assay specific enzymes such as GEFs or Kinases.
5 Signaling Model of Hras Activation by Egf
As discussed in Sect. 1, the final step is assembly of these rules into a model—a connected set of rules that can be executed to reach expected goals, including the activation of Hras. The basic assembly process is carried out using the PL model generation process. We adapted the initial state for the STM Hras subnet to specify the desired model. The abstraction of details to form a connected rule network was carried out by hand, guided by principles developed by the curator of the STM model. Abstracting includes dropping site details from modifications and formalizing knowledge/conjectures such as ‘modification implies activation’ in specific cases.
The resulting model is more detailed than the STM Hras model. This is expected, due to the separation of modification and translocation rules (the STM model typically collapses these into one step), and the use of location and modification variables that have multiple possible instantiations.
The inferred model answers most of the queries supported by PL in the same way that the STM Hras model does. Examples include reachability of given states, existence of multiple execution paths to the Hras goal, and (RasGrp3, Sos1) as a double knockout pair.
An execution pathway corresponding to the STM model pathway is shown in Fig. 2b. The STM rule 197 for phosphorylation of Sos1 (arrows labeled 1) becomes 3 rules in the inferred model (a move, a modification, and activation). The inferred model has Abl1 (red border) as a requirement for Sos1 phosphorylation. There is a single datum specifying this requirement; the STM curator did not consider one datum showing this requirement as sufficient evidence. Future work includes associating rules with some measure of quantity/quality of evidence, in order to able to assemble models using different criteria for inclusion of rules.
The STM rule 529 for Hras activation (GTP association, arrows labeled 2) includes a requirement for [Shp2 - Yphos] and a requirement for Pi3k (red borders), while the inferred rule does not. These requirements come from extras such as inhibited by: xPik3r?(mnr)"DN"... and inhibited by: xShp2(mnr)"CIA" that require substantial background knowledge to interpret. For example, CIA stands for ‘Constitutively InActive’. The inference is that if the endogenous protein is overwhelmed by a mutated form that is lacking some function, then that protein (with that function) is required. Future versions of the assertion generation tool will capture more of these inferences.
6 Related Work and Conclusion
Related work. An excellent survey of executable models of biological processes is given in . There are a number of network reconstruction algorithms based on statistical reasoning techniques such as Bayesian inference  or belief propagation . They provide a means of elucidating the networks underlying transcriptomics and proteomics data generated from perturbation experiments. These methods postulate causal relations, but do not capture mechanistic details such as necessary conditions.
Methods more closely related to our approach include the following. Net-synthesis [1, 2] is a software for synthesis, inference and simplification of signal transduction networks. The main idea is representing observed indirect causal relationships as network paths, introducing pseudo-vertices for unknown intermediaries of these paths and using techniques from combinatorial optimization to find the most parsimonious graph consistent with all experimental observations. A method based on Petri nets is described in . The reactions of individual proteins are represented as Petri net modules, stored in a database. These modules are similar in spirit to datums. Each place in a module corresponds to a specific functional state of a specific protein domain (e.g. a phosphorylated or unphosphorylated side chain, a catalytically active or inactive domain etc.). For each module, literature references are annotated as part of the modules database entry. Selected modules can be combined to assemble executable Petri net models. The method has been applied to assemble a model of JAK/STAT signaling. In , two methods to build signaling models from qualitative data (protein interactions from databases) are proposed, based on analyzing network connectivity and on non-linear optimization. Methods to convert BioPAX models into fully executable models have been proposed, including [5, 22]. The work presented here differs from these works in starting from experimental evidence to build knowledge bases and executable models, rather that relying on existing pathway databases.
Conclusion. We have presented an inference system for deriving signal transduction rules from formally represented experimental findings and applied the system to derive rules for a model of Hras activation2. Future work includes: extending the mapping of datums to assertions to capture the meaning of experimental perturbations using mutations and fragmentation, extracting formal background knowledge from databases, extending the logic to cover more assays and capture more complex reasoning, such as hypothesizing rule requirements and alternatives by similarity, adding logic to generate common rules (rules about protein interactions independent of stimulus), and automating assembly of models from generated rules.
- 2.Albert, R., DasGupta, B., Sontag, E.: Inference of signal transduction networks from double causal evidence. In: Fenyo, D. (ed.) Methods in Molecular Biology: Topics in Computational Biology. Springer Science+Business Media LLC, New York (2010)Google Scholar
- 3.Biocyc pathway/genome database collection (2015)Google Scholar
- 7.DARPA Big Mechanism Project (2015)Google Scholar
- 9.Gelfond, M., Lifschitz, V.: Logic programs with classical negation. In: ICLP, pp. 579–597 (1990)Google Scholar
- 11.KEGG: Kyoto encyclopedia of genes and genomes (2015)Google Scholar
- 13.Lincoln, P.D., Talcott, C.: Symbolic systems biology and pathway logic. In: Iyengar, S. (ed.) Symbolic Systems Biology, pp. 1–29. Jones and Bartlett, Boston (2010)Google Scholar
- 17.Pathway logic (2015)Google Scholar
- 18.Protein interaction database (2015)Google Scholar
- 19.Reactome pathway database (2015)Google Scholar
- 20.Ruths, D.: Deriving Executable Models of Biochemical Network Dynamics from Qualitative Data. Rice University (2009)Google Scholar
- 22.Willemsen, T., Feenstra, K.A., Groth, P.T.: Building executable biological pathway models automatically from BioPAX. In: Linked Science 2013: Supporting Reproducibility, Scientific Investigations and Experiments, pp. 2–14 (2013)Google Scholar