1 Introduction

In systems biology, dynamics play a crucial role in cellular decision making (e.g. whether a cell responds appropriately to a particular signal) (Voit 2017). Molecular interactions can be modelled as systems of chemical reactions with a choice of kinetics, such as the law of mass action, which assumes that the rate at which a chemical reaction proceeds is proportional to the product of the concentrations of its reactants. From a finite set of reactions, the mass-action modelling assumption gives rise to a system of polynomial ordinary differential equations (ODEs), which are sums of monomials in which each term includes concentrations of molecular species as variables and coefficients as rates of reaction. Chemical reaction network theory (CRNT) is a mathematical field developed by Horn and Jackson, and independently by Bykov, Gorban, Volpert and Yablonsky, for analysing such reactions, and the mathematical techniques employed extend beyond dynamical systems theory to include algebraic geometry, differential algebra, algebraic statistics and discrete mathematics (Dickenstein 2016).

CRNT often focuses on steady-state analysis through the lens of computational and real algebraic geometry, asking questions about the capability or preclusion of multiple positive real steady-states (i.e. multistationarity) or more complex dynamics, often without requiring specialised parameter values (Banaji and Craciun 2009; Craciun and Feinberg 2005; Millán et al. 2012; Angeli 2009; Wang and Sontag 2008; Feliu and Wiuf 2012; Müller 2016; Conradi and Pantea 2019). Multi-site protein phosphorylation systems, such as the ERK/MEK signalling pathway, can be translated into such chemical reactions and their multistationarity, corresponding to different biological cellular decisions, has attracted much attention (Thomson and Gunawardena 2009b; Gunawardena 2007; Aoki 2011; Takahashi et al. 2010; Markevich et al. 2004). Algebraic analyses and invariants of multi-site phosphorylation have revealed geometric information of steady-state varieties, informed experimental design and enabled model comparison using steady-state data (Manrai and Gunawardena 2008; Thomson and Gunawardena 2009a; Harrington et al. 2012; Gross et al. 2016; MacLean et al. 2015). However, such systems have also been shown to exhibit nontrivial transient dynamics and oscillations (Conradi et al. 2019; Qiao et al. 2007). In recent years, the fields of systems biology and CRNT have extended the repertoire of techniques to assert other dynamics (Banaji 2020; Conradi et al. 2019; Domijan and Kirkilionis 2009; Mincheva and Roussel 2007; Kay 2017; Errami 2015; Angeli et al. 2013), reduce models systematically (Pantea et al. 2014; Goeke et al. 2017; Feliu et al. 2019; Sweeney 2017; Boulier et al. 2011; Hubert and Labahn 2013), and assess identifiability (Ljung and Glad 1994; Ollivier 1990; Meshkat et al. 2009; Hong et al. 2020; Bellu et al. 2007). Furthermore, combinatorial structures, such as simplicial complexes, and techniques from computational algebraic topology have enabled comparison of chemical reaction network models and their parameters (Vittadello and Stumpf 2020; Nardini et al. 2020).

A previous algebraic systems biology case study (Gross et al. 2016) analysed a chemical reaction network model at steady state, by studying the steady-state ideal, chamber complex and algebraic matroids of the model. Here we present a sequel of such analysis to study the dynamics of chemical reaction networks with time-course data, which relies on studying the QSS variety (Sect. 3), the model prediction map (Sect. 4) and the topology of a parameter inference (Sect. 5).

We perform a detailed mathematical analysis of recently published models and experimental data (Yeung et al. 2019). The Full ERK model describes dual phosphorylation of ERK by MEK, two molecular species whose activation regulates cell division, cell specialisation, survival and cell death (Shaul and Seger 2007). The dynamics of the six ERK/MEK molecular species \(x\in {{\,\mathrm{{\mathbb {R}}}\,}}^{n=6}\) in the Full ERK model are governed by a polynomial dynamical system \({\dot{x}}(t) = f(x(t),\theta )\), where \(\theta \in {{\,\mathrm{{\mathbb {R}}}\,}}^{m=6}\) is the vector of parameters and there are two conservation relations between the species. The Full ERK model is presented in Sect. 2. Analysing the kinetic parameters of a model depends on the available data. The accompanying time-course experimental observations include measurements of ERK in 3 different states, at 7 time points following activation by its activated enzyme kinase MEK, which is either wild-type (WT) or mutated MEK. Mutations of MEK are known to be involved in human cancer and embryonic developmental defects; therefore, understanding their kinetics and differences between wild-type and 4 mutants (e.g. Y130C, F53S, E203K or SSDD) may increase fundamental biological understanding of the pathway and contribute to the development of potential therapies. The experimental data and relevant biological information are presented in Sect. 2.

Using algebraic approaches first presented by Goeke, Walcher and Zerz in Goeke et al. (2017), we decrease the number of variables and parameters in the Full ERK model. We derive two model reductions: the Rational ERK model and the Linear ERK model. We show, with known biological information (see Sect. 2), that the reduction to the Linear ERK model by Yeung et al. (2019) is mathematically sound. We note that the Rational ERK model was not analysed in Yeung et al. (2019), although it can be derived from the Full ERK model using singular perturbation methods. A natural question is whether a quasi-steady-state approximation is justified given the experimental setup, which equates to solving an algebraic problem (Goeke et al. 2017). We identify algebraic varieties \(V_\theta \) that are (analytic) invariant sets of the ODE system and characterise neighbourhoods in parameter space for which the ODE solutions remain close to these varieties. This systematic analysis allows us to simplify the model equations such that the dynamics of both reduced models are good approximations to the Full ERK model. Algebraic model reduction and derivation of the reduced ERK models are given in Sect. 3.

Before estimating the parameters of a model from observations, one must determine its identifiability. Identifiability is concerned with asking whether it is possible to recover values of the model parameters given data. A model is structurally identifiable if parameter recovery is possible with perfect data. Mathematically, this task is equivalent to asking whether the model prediction map is injective. The model prediction map, defined precisely in Sect. 4.1, is a map that takes a parameter to the corresponding predicted noise-free data point(s) (Dufresne et al. 2018). Real data is often noisy; testing whether parameter recovery is possible with imperfect data is the problem of practical identifiability (Raue et al. 2009; Dufresne et al. 2018). Mathematically, measurement noise induces a probability distribution in data space. Assuming that the model prediction map is injective (at least generically), practical identifiability can be defined in terms of the boundedness (with respect to a reference metric in parameter space) of the confidence regions of a likelihood test. Under our assumptions, this translates to asking whether the preimages of small bounded regions in data space are bounded in parameter space. We prove the following:

Theorem 1

The Linear ERK/MEK model, with the given experimental setup (number of species, number of replicates, number of measurement time-points and initial conditions), is structurally and practically identifiable.

We provide a definition of practical identifiability that improves a previous definition (Dufresne et al. 2018), and which is an alternative to that of Raue et al. (2009). We also propose a computable algorithm for practical identifiability, implement it and apply it to the ERK models. We prove Theorem 1 in Sect. 4.

We use the differential algebra method to show that the Full ERK model and the Rational ERK model are generically structurally identifiable. These results are guaranteed to be valid if we have at least \(2m+1\) generic time points by Sontag’s result (Sontag 2002), but can be valid with fewer generic time points. Indeed, as the Linear ERK model admits analytic solutions, we can prove that it is globally structurally identifiable for any choice of three distinct time points. Determining structural identifiability for specific time points in the absence of analytic solutions is an open problem.

We numerically show that the Full ERK model and Rational ERK model are not practically identifiable; however, the source of this practical non-identifiability is not completely clear (see Sect. 4).

Finally, for a model that is structurally and practically identifiable, one would like to infer parameters, i.e. what parameter values are consistent with the observations? We perform Bayesian inference, as done in Yeung et al. (2019), and extend this to the Rational ERK model. The result of the parameter inference on the Linear ERK model is a sample point cloud of posterior densities of inferred ERK parameter kinetics that are consistent with the data. We obtain five different sample densities corresponding to the five MEK variants.

In Sect. 5, we compare the geometry of the admissible regions of parameter space of the five MEK variants. The computational field of topological data analysis quantifies the shape and connectivity of data through computation of topological invariants across resolutions (or threshold values) of metric data. In recent years, topological methods have dramatically improved in computational speed as well as theoretical advancements that facilitate the analysis of scientific datasets. We implement a theoretical framework originally proposed by Taylor (2019a), in order to quantify the shape of the resulting posterior distributions of kinetic parameters and facilitate a comparison between mutants. Specifically, their theorem provides a direction for hypothesis testing of two densities using distances between topological summaries. The framework relies on approximating the persistent homology of super-level sets of posterior densities by simplicial complexes. We perform these measurements on the distributions obtained from Bayesian parameter inference for the 5 MEK variants and compare them via a topological bottleneck distance.

Biological Result

The topological data analysis quantifies that the Linear ERK model parameter posteriors are most different between the WT and SSDD mutant data. The kinetics of the SSDD mutant, which mimics phosphorylated MEK, has the largest topological distance from all other MEK/ERK mutants.

This biological result raises the question of whether the SSDD variant is a suitable approximation for wild-type MEK activated by Raf, and suggests further experimental studies are needed. While the previous analysis by Yeung et al. (2019) compared the variants by the inferred kinetics of each parameter, here we complement that analysis by comparing the three parameters together as a point cloud.

Our aim is to showcase how systematic algebraic, geometric and topological approaches can be applied to a biologically relevant model with state-of-the-art experimental time-course data. Each of these approaches incorporates the structure of the mathematical model, experimental observations, and experimental setup and observations (e.g. experimental initial condition, observable species, number of experimental replicates, number of time points collected, etc.), as well as known biological information (e.g. published parameter values). Due to the multiple disciplines and different notation conventions (as well as standard abbreviations), we include a glossary of symbols at the start of the paper. The framework we present is not limited to this case study and may enhance the analysis of similar models in systems and synthetic biology.

2 From ERK Biochemical Reactions to a Polynomial Dynamical System

Protein phosphorylation alters protein function in signalling pathways and plays a crucial role in cellular decisions and homeostasis. Phosphorylation is the addition of a phosphate group by an enzyme known as a kinase, and dephosphorylation is the removal of a phosphate group by an enzyme known as a phosphatase. Multisite phosphorylation is the process of having multiple possible locations on a protein phosphorylated, which increases the number of potential ways protein function can be altered. The algebra, geometry, combinatorics and dynamics of multisite phosphorylation has been a source of interesting mathematical problems (Dickenstein 2016; Manrai and Gunawardena 2008; Conradi and Pantea 2019). A protein with q phosphorylation sites has been shown to have \(2^q\) phospho-states; the sites on the protein can be phosphorylated in q! possible ways (Thomson and Gunawardena 2009b). One of the simplest multisite phosphorylation systems is when a protein has two phosphorylation sites. We focus on the sequential dual phosphorylation of the extracellular signal regulated kinase (ERK) by its kinase activated (dually phosphorylated) MEK. The model developed by Yeung et al. (2019) encodes a mixed phosphorylation mechanism (i.e. distributive and processive) by changes in parameter values rather than separate models (see, for example, Conradi and Shiu 2015; Gunawardena 2007 and references therein). This enabled them to quantify the extent to which a MEK variant is processive or distributive. We remark that the model presented by Yeung et al. does not include dephosphorylation mechanisms, since the experimental setup omitted the addition of phosphatases.

Next, we introduce the model and the experimental data published by Yeung et al. (2019).

2.1 The Model

The protein substrate ERK, is activated through dual phosphorylation by its activated enzyme kinase MEK. As shown in the chemical reaction network (see Fig. 1), unphosphorylated ERK (\(S_0\)) binds reversibly with its kinase MEK (E) to form an intermediate complex \(C_1\). The complex becomes \(C_2\) when a phosphate group is added. Complex \(C_2\) can then disassociate to form MEK (E) and ERK phosphorylated on the tyrosine site (\(S_1\)), or a second phosphate group is added to \(C_2\), resulting in product reactants \(E+S_2\). The six species and six rate constants are given in the following chemical reaction network (Fig. 1).

Fig. 1
figure 1

The reaction network associated with dual phosphorylation of ERK by its activated enzyme kinase MEK

We can translate this reaction network into a dynamical system \({\dot{x}} = f(x,\theta )\). Here, f is a vector-valued function of the vectors of species concentrations \(x= \{S_0, C_1, C_2, S_1, S_2,E\}\) and rate constants, referred to as parameters \(\theta = \{k_{f_1},k_{r_1},k_{c_1},k_{f_2}, k_{r_2},k_{c_2} \}\). The kinetics assumption for f is a modelling choice; here we assume that the law of mass action holds (Klipp et al. 2016, §2.1.1), as for the original model (Yeung et al. 2019). The resulting dynamical system of ODEs is given in Eqs. (1).

$$\begin{aligned} \frac{\mathrm dS_0}{\mathrm dt}&= -k_{f_1}E\cdot S_0+k_{r_1}C_1, \end{aligned}$$
(1a)
$$\begin{aligned} \frac{\mathrm dC_1}{\mathrm dt}&= k_{f_1}E\cdot S_0 -(k_{r_1}+k_{c_1})C_1, \end{aligned}$$
(1b)
$$\begin{aligned} \frac{\mathrm dC_2}{\mathrm dt}&= k_{c_1}C_1-(k_{r_2}+k_{c_2})C_2 + k_{f_2}E\cdot S_1, \end{aligned}$$
(1c)
$$\begin{aligned} \frac{\mathrm dS_1}{\mathrm dt}&= -k_{f_2}E\cdot S_1+k_{r_2}C_2, \end{aligned}$$
(1d)
$$\begin{aligned} \frac{\mathrm dS_2}{\mathrm dt}&=k_{c_2}C_2, \end{aligned}$$
(1e)
$$\begin{aligned} \frac{\mathrm dE}{\mathrm dt}&= -k_{f_1}E\cdot S_0+k_{r_1}C_1-k_{f_2}E\cdot S_1 +(k_{r_2}+k_{c_2})C_2. \end{aligned}$$
(1f)

We assume that initially all species are zero, except for \(S_0(t=0)=S_{tot}\) and \(E(t=0)=E_{tot}\). Equations (2)–(3) define two conserved quantities that constitute a basis for the linear space of conservation relations of the model:

$$\begin{aligned} S_{tot}&=S_0+S_1+S_2+C_1+C_2, \end{aligned}$$
(2)
$$\begin{aligned} E_{tot}&=E+C_1+C_2, \end{aligned}$$
(3)

where the total amounts of substrate ERK (\(S_{tot}\)) and enzyme MEK (\(E_{tot}\)) are constant and known from the initial conditions.

We aim to study the relationship between the species x, parameters \(\theta \), conserved quantities and available biological information (previous knowledge and experimental observations). The emphasis in this paper is not to analyse the steady-state variety as in Gross et al. (2016), rather here we focus on the transient dynamics of the model and algebraic approaches to analyse ERK kinetics in light of the available biological information.

2.2 The Data

The data is published. Details on measurement techniques and experimental methods can be found in Yeung et al. (2019). We present the experimental setup for the data we analyse.

2.2.1 Experimental Setup and Data

In all experiments, \(0.65\mu M\) free (activated) enzyme MEK (E) was added to \(5\mu M\) of unphosphorylated ERK substrate \((S_0)\) along with ATP; therefore, \(S_{tot}=5\mu M\) and \(E_{tot}=0.65\mu M\). ERK were measured in three states: unphosphorylated \((S_0+C_1)\), mono-phosphorylated \((S_1+C_2)\) and dually phosphorylated ERK \((S_2)\), at 7 time points, with r different experimental replicates. The sample space for each MEK variant is \(X={{\,\mathrm{{\mathbb {R}}}\,}}^{3\times 7\times r}\), where for human wild-type (WT) MEK, \(r=11\); for MEK variants with phosphomimetic (SSDD), \(r=6\); and for activating mutations, \(r=5\). The three activating mutants of MEK are known to be involved in human cancer (E203K) or developmental abnormalities (F53S and Y130C). The ERK observations were collected at seven time points \(t=\{0.5, 2, 3.25, 3.75, 5, 10, 20\}\) minutes for all MEK variants except SSDD, which were collected at \(t= \{1, 2, 3.25, 5, 10, 20, 40\}\) minutes.

2.2.2 Known Biological Information

The relationship between some kinetic rate constants is known. When a substrate binds reversibly to an enzyme to form an enzyme-substrate complex, which then reacts irreversibly to form a product and release the enzyme, one can define the Michaelis–Menten constant \(k_{M}\). In the reaction network given by Eqs. (1), there are two Michaelis-Menten constants \(k_{M_i}=(k_{c_i}+k_{r_i})/k_{f_i}\) for \(i=1,2\). Measurements show that in our experimental setup \(k_{M_i}\approx 25\mu M\) for \(i=1,2\) (Taylor 2019b). While the reaction rates \(k_{c_i}\) and \(k_{r_i}\) for \(i=1,2\) cannot be measured directly, they have been shown to be of the same order of magnitude (Bar-Even 2011). We will use these insights to assume, henceforth, that \(S_0\), \(S_1\) and \(S_2\) were measured (without added compound variables). We justify this mathematically in Sect. 3.2.4.

3 Algebraic Model Reduction

The first step to studying most models typically involves model reduction, which reduces the number of dependent variables and constant parameters. For many chemical reactions, there are time scales on which the rate of change of some variables is negligible and their dynamics is dominated by those of the remaining variables. This observation motivates the Quasi-Steady-State-Approximation (QSSA).

In recent years, algebraic approaches to reduce polynomial ODE models have been extended by Walcher and colleagues. In 2013, Pantea et al. (2014) used Galois theory to characterise chemical reaction networks for which no explicit QSSA reduction is possible. Furthermore, they provided computational tools for determining the feasibility of an explicit reduction. Subsequently, Sweeney (2017) proved that the nonsolvability of polynomials poses no issue to the CRNs most commonly encountered in practice and derived a more efficient algorithm for determining explicit reducibility by translating algebraic structures into graphs. Goeke and Walcher (2014) provide an explicit formula for obtaining a reduced QSSA model using a subset of an algebraic variety defined by the slow manifold. Subsequently, Goeke et al. (2017) characterised parameter values at which QSSA reduction is accurate using algebraic varieties and bounds on the polynomials governing the ODE system on a bounded parameter- and variable-domain. Most recently, Feliu et al. (2019) derived necessary and sufficient conditions for purely algebraic reductions of a CRN model to agree with model reductions derived via classical singular-perturbation theory (Keener and Sneyd 2011; Segel 1988).

In this section, we briefly review QSSA using classical singular-perturbation theory as well as the algebraic approaches developed by Goeke et al. (2017); Goeke and Walcher (2014). We then apply both methods to the full ERK model (Eqs. (1)). We show both approaches can generate the same QSSA-reduction of our ERK model, which we will call the Rational ERK model. Additionally, the algebraic method can yield a linear QSSA-reduction of our ERK model in a single step (which we call the Linear ERK model). By contrast, the singular-perturbation-theory approach requires additional assumptions on parameter values to arrive at the Linear ERK model (see Sect. 3.2). We show that the Linear ERK model approximates the Full ERK model (Eqs. (1)) with similar accuracy as the Rational ERK model in the context of the experimental setup, data and known biological information (see Sect. 3.2; Appendix A.2 for details).

With the algebraic method, we provide a rigorous mathematical justification of the Linear ERK model presented by Yeung et al. (2019). By comparing the singular perturbation method with the algebraic method and the two resulting model reductions, we illustrate how the algebraic methods form a well-structured approach for arriving at a QSS reduction and for assessing the accuracy of such reductions systematically.

3.1 Notation for Model Reduction

Throughout, we will assume we have an ODE system in variables \(x\in {{\,\mathrm{{\mathbb {R}}}\,}}^n\) and parameters \(\theta \in {{\,\mathrm{{\mathbb {R}}}\,}}^m\). If the system dynamics are governed by f, a vector of polynomials in \({{\,\mathrm{{\mathbb {R}}}\,}}[x,\theta ]^n\), then our ODE system is given by

$$\begin{aligned} \frac{\textrm{d} x}{\textrm{d}t} = f(x,\theta ). \end{aligned}$$
(4)

For \(1\le q<n\), we may define

$$\begin{aligned}&x^{[1]}=(x_1,\ldots ,x_q),{} & {} \quad f^{[1]}=(f_1,\ldots ,f_q),&\\&x^{[2]}=(x_{q+1},\ldots ,x_n),{} & {} \quad f^{[2]}=(f_{q+1},\ldots ,f_n).&\end{aligned}$$

We wish to retain the variables \(x^{[1]}\) in the reduced model and seek to eliminate variables \(x^{[2]}\) as part of our model reduction.

For the full ERK model (Eqs. (1)), we choose \(x^{[1]}:=(S_0, S_1, S_2)\) and \(x^{[2]}:=(C_1, C_2)\). Analogously, \(f^{[1]}\) are the polynomials governing the rates of change of \(S_0\), \(S_1\) and \(S_2\) (Eqs. (1a), (1d) and (1e)) and \(f^{[2]}\) are the polynomials governing the rates of change of \(C_1\) and \(C_2\) (Eqs. (1b) and (1c)).

Remark

In the current section, we treat the (non-zero) initial conditions of the ODE systems as parameters (and include them in the parameter count m), as they are central to determining the goodness of a model reduction. In Sect. 4 (Identifiability) and Sect. 5 (Inference & Comparison), we will not include the initial conditions in the set of parameters, as they are given by the experimental setup and, as such, do not need to be identified or inferred.

3.2 The Algebraic QSSA Approach

The algebraic approach to QSSA, as presented by Goeke, Walcher and Zerz in Goeke et al. (2017), differs from the classical approach in several ways. Most notably, an a priori separation of time scales is not needed. On the other hand, we require a choice of fast and slow variables (i.e., a choice of which variables we eliminate, and which we retain in the reduced model).

Remark

To the best of our knowledge, all existing algebraic approaches to QSSA, including (Goeke et al. 2011; Goeke and Walcher 2014; Boulier et al. 2011; Goeke et al. 2017), require a choice, explicit or implicit, of slow and fast variables. In Goeke et al. (2011) the relevant choice is made by expanding f (in Goeke et al. (2011): h) at different orders of \(\varepsilon \).

First, we characterise points in parameter space, i.e. parameter values, where the fast variables are exactly determined by the slow variables, which yields a reduced model. This set of parameter values is defined as the vanishing set of the polynomials governing the ODEs of the fast variables. This defines an algebraic variety in the parameter space. Typically, the ODE system will be degenerate at these values. Secondly, we characterise neighbourhoods of these values in parameter space, as well as time scales, for which the reduction is a good approximation to the original model.

To describe the characterisation from Goeke et al. (2017), we use \(x^{[1]}\), \(x^{[2]}\), \(f^{[1]}\) and \(f^{[2]}\) as before. In addition, we denote the partial derivative with respect to \(x^{[i]}\) by \(D_i\). For a fixed \(\theta ^*\in {{\,\mathrm{{\mathbb {R}}}\,}}^m\), we let \(Y_{\theta ^*}\) denote the algebraic variety defined by \(f^{[2]}(\,\cdot \,,\theta ^*)\).

Definition 2

Let \(y\in Y_{\theta ^*}\) be such that the \((n-q)\times (n-q)\) matrix \(D_2 f^{[2]}\) has full rank at \((y,\theta ^*)\). Then we denote by \(V_{\theta ^*}\subseteq Y_{\theta ^*}\) a relatively Zariski-open neighbourhood of y in which this rank is maximal. We call \(V_{\theta ^*}\) a quasi-steady-state (QSS) variety in the sense of Goeke et al. (2017) and may assume without loss of generality that it is irreducible.

If, furthermore, \(V_{\theta ^*}\) is an invariant set of the ODE system \({\mathrm dx^{[1]}/\mathrm dt}=f(x,\theta ^*)\), then we call \(\theta ^*\) a QSS parameter value. Recall that in dynamical systems theory, \(V_{\theta ^*}\) is an invariant set of \({{\,\mathrm{{\mathbb {R}}}\,}}^q\) if whenever the initial condition of an ODE at \(t=0\) is in \(V_{\theta ^*}\), then the corresponding trajectories of the ODE remain in \(V_{\theta ^*}\) for all \(t>0\).

Remark

Note that the steady-state variety (see Gross et al. 2016) and the QSS variety at a parameter value \(\theta ^*\) are not as closely related as one may first think. Indeed with our notation, the steady state variety is the zero set in \({{\,\mathrm{{\mathbb {R}}}\,}}^n\times {{\,\mathrm{{\mathbb {R}}}\,}}^m\) of the ideal \(\langle f^{[1]}(x,\theta ),f^{[2]}(x,\theta )\rangle \) of \({{\,\mathrm{{\mathbb {R}}}\,}}[x,\theta ]\), while the QSS variety at \(\theta ^*\) is contained in the zero set in \({{\,\mathrm{{\mathbb {R}}}\,}}^n\times \{\theta ^*\}\) of the ideal \(\langle f^{[2]}(x,\theta ^*)\rangle \) of \({{\,\mathrm{{\mathbb {R}}}\,}}[x]\). That is, we have both \({\mathcal {V}}_{{{\,\mathrm{{\mathbb {R}}}\,}}^n\times {{\,\mathrm{{\mathbb {R}}}\,}}^m}(f^{[1]}(x,\theta ),f^{[2]}(x,\theta ))\subset {\mathcal {V}}_{{{\,\mathrm{{\mathbb {R}}}\,}}^n\times {{\,\mathrm{{\mathbb {R}}}\,}}^m}(f^{[2]}(x,\theta ))\) and \(V_{\theta ^*}\subseteq Y_{\theta ^*}={\mathcal {V}}_{{{\,\mathrm{{\mathbb {R}}}\,}}^n\times \{\theta ^*\}}(f^{[2]}(x,\theta ^*))\subset {\mathcal {V}}_{{{\,\mathrm{{\mathbb {R}}}\,}}^n\times {{\,\mathrm{{\mathbb {R}}}\,}}^m}(f^{[2]}(x,\theta ))\), but the steady-state variety and \(V_{\theta ^*}\) are not contained in one another in general.

To apply the theory of Goeke, Walcher and Zerz in Goeke et al. (2017), we assume that the initial condition of our ODE system (Eq. (4)) lies in \(V_{\theta ^*}\). As \(D_2f^{[2]}\) has full rank on \(V_{\theta ^*}\), we have that \(x^{[2]}=\Psi \left( x^{[1]}\right) \) for some continuous \(\Psi \) by the Implicit Function Theorem. Hence, writing \(x=(x^{[1]}, x^{[2]})\), we obtain a reduced model:

$$\begin{aligned} \frac{\mathrm dx^{[1]}}{\mathrm dt}=f^{[1]}\left( \left( x^{[1]}, \Psi \left( x^{[1]}\right) \right) , {\theta ^*}\right) \end{aligned}$$
(5)

on some open neighbourhood in \({{\,\mathrm{{\mathbb {R}}}\,}}^j\) that naturally includes \(V_{\theta ^*}\). This corresponds to determining the fast variables in terms of the slow variables. We do this by setting their time rates of change equal to zero on the short timescale in classical QSSA, with the addition that on \(V_{\theta ^*}\) the above yields an exact solution rather than an approximation. As a caveat, we note that, in both settings, it may not be possible to find an algebraic expression for \(\Psi \); this was pointed out and completely characterised by Pantea et al. (2014) in terms of Galois theory. Because of the possible non-solvability issue with Eq. (5), we require a more general methodology (Proposition 3) to study the accuracy of a model reduction (Proposition 4).

Goeke, Walcher and Zerz showed that locally, in the variable \(x^{[1]}\), the reduced system given by Eq. (5) has the same solution as the following ODE system

$$\begin{aligned} \frac{\mathrm dx^{[1]}}{\mathrm dt}=f^{[1]}\left( x, {\theta ^*}\right) ,\qquad \frac{\mathrm dx^{[2]}}{\mathrm dt}=-D_2 f^{[2]}(x, {\theta ^*})^{-1}D_1f^{[2]}(x, {\theta ^*})f^{[1]}(x, {\theta ^*}).\nonumber \\ \end{aligned}$$
(6)

Proposition 3

(Lemma 1 & Proposition 1 in Goeke et al. (2017)) Let \(V_{\theta ^*}\) be a QSS-variety. Then \(V_{\theta ^*}\) is an invariant set of Eq. (5). Moreover, any solution of Eq. (6) with initial condition in \(V_{\theta ^*}\) locally solves Eq. (5). Conversely, any solution of Eq. (5) with initial condition in \(V_{\theta ^*}\) locally solves Eq. (6). In addition, \(V_{\theta ^*}\) is an invariant set of Eq. (4) if and only if the solutions of Eqs. (4) and (6) are equal for all initial conditions in \(V_{\theta ^*}\).

This proposition equips us with a method to obtain a solution for \(x^{[1]}\) in an algebraic QSSA without explicitly determining \(\Psi \). In Sects. 4 and 5, we will use Eq. (5) as our model reduction.

First, however, we assess the accuracy of Eq. (6) as an approximation to the full system, for parameter-values \(\theta \) in some neighbourhood of \(\theta ^*\). For convenience, we abbreviate system (6) as \({\mathrm dx/\mathrm dt}=f_\textrm{red}(x,\theta ^*)\).

Proposition 4

(Outline of Proposition 2 in Goeke et al. (2017)) Let \(K^*\subseteq {{\,\mathrm{{\mathbb {R}}}\,}}_+^n\times {{\,\mathrm{{\mathbb {R}}}\,}}_+^m\) be a compact domain in the product of the variable and parameter spaces which satisfies a number of conditions (we refer the interested reader to Appendix A.2 for details). Let \({\theta ^*}\) be given such that \(V_{\theta ^*}\times \{{\theta ^*}\}\) has non-empty intersection with \(\textrm{int}\,K^*\), let \((y,{\theta ^*})\) be a point in this intersection, and let \(V'_{\theta ^*}\) be an open neighbourhood of y such that \((V_{\theta ^*}\cap V'_{\theta ^*})\times \{{\theta ^*}\}\subseteq K^*\). Additionally, let \(t^*>0\) be such that the solution of Eq. (4), with initial condition y, remains in \(V'_{\theta ^*}\) for \(t\in [0,t^*]\).

Then there exists a compact neighbourhood \(A_{\theta ^*}\subseteq V_{\theta ^*}\) of y such that:

(i):

For every \(z\in A_{\theta ^*}\), the solution of Eq. (4) with initial condition z exists and remains in \(V'_{\theta ^*}\) for \(t\in [0, t^*]\).

(ii):

For every \(\varepsilon '>0\), there exists a \(\delta _1>0\) such that for every \(z\in V'_{\theta ^*}\cap A_{\theta ^*}\) the solution of Eq. (6), with initial condition z, exists and remains in \(V'_{\theta ^*}\) for \(t\in [0,t^*]\) whenever \(\Vert f-f_\textrm{red}\Vert <\delta _1\) on \(V_{\theta ^*}\).

(iii):

For every \(\varepsilon '>0\), there exists a \(\delta \in (0,\delta _1]\) such that, for any \(z\in V_{\theta ^*}\cap A_{\theta ^*}\), the difference between the solutions of Eqs. (6) and (4), with initial condition z, is at most \(\varepsilon '\) for \(t\in [0,t^*]\) whenever \(\Vert f-f_\textrm{red}\Vert <\delta \) on \(V'_{\theta ^*}\). Here, \(\Vert \,\cdot \,\Vert \) denotes the infinity-norm over the interval \([0, t^*]\) for a fixed parameter value.

In summary, given some technical assumptions on the variables and the domain \(K^*\), we can bound the difference between the solutions of Eqs. (4) and (6) in terms of \(\Vert f-f_\textrm{red}\Vert \) up to some time \(t^*>0\). The full statement of this proposition also includes lower bounds on this difference. Note that we do not assume that \(\theta ^*\) is a QSS-parameter value, but the assumptions on \(K^*\) (as detailed in Appendix A.2) require it to be close to some QSS-parameter value.

3.3 Reducing the ERK Model Algebraically

We now apply the theory from Sect. 3.1 to the Full ERK model (Eqs. (1)) in two different ways to derive two reduced models (the Linear ERK and Rational ERK models). The full details of the derivations can be found in Appendix A.1. We also give a brief biological explanation of why both systems explain the phenomena underlying the given experimental data equally well.

3.3.1 Reduction via Conservation Laws

We can exploit the conservation laws (2) and (3) to eliminate a variable before using the analytic or algebraic QSSA approach. First, we choose to eliminate E and note that there are two choices:

$$\begin{aligned} E=E_{tot}-C_1-C_2 \end{aligned}$$
(7)

or

$$\begin{aligned} E=E_{tot}-S_{tot}+S_0+S_1+S_2. \end{aligned}$$
(8)

Subsequently, we choose to eliminate the variables \(C_1\) and \(C_2\) via (algebraic) QSSA. For the Rational ERK model, using (7) to eliminate E, we obtain

$$\begin{aligned} f_\textrm{rat}^{[2]}=\begin{pmatrix}k_{f_1}(E_{tot}-C_1-C_2)\cdot S_0 -(k_{r_1}+k_{c_1})C_1\\ k_{c_1}C_1 + k_{f_2}(E_{tot}-C_1-C_2)\cdot S_1 -(k_{r_2}+k_{c_2})C_2\end{pmatrix}, \end{aligned}$$

while for the Linear ERK model, employing substitution (8), we have

$$\begin{aligned} f^{[2]}_\textrm{lin}=\begin{pmatrix}k_{f_1}(E_{tot}-S_{tot}+S_0+S_1+S_2)\cdot S_0 -(k_{r_1}+k_{c_1})C_1\\ k_{c_1}C_1 + k_{f_2}(E_{tot}-S_{tot}+S_0+S_1+S_2)\cdot S_1 -(k_{r_2}+k_{c_2})C_2\end{pmatrix}. \end{aligned}$$

3.3.2 Reduction via an Algebraic QSSA

To reduce the model further, we apply an algebraic QSSA, as described in Sect. 3.1. We start by identifying QSS-parameter-values. For \(f^{[2]}_\textrm{rat}\), we have

$$\begin{aligned} D_2f^{[2]}_\textrm{rat}=\begin{bmatrix}-k_{f_1}S_0-(k_{r_1}+k_{c_1}) &{} -k_{f_1}S_0 \\ -k_{f_2}S_1+k_{c_1} &{} -k_{f_2}S_1-(k_{r_2}+k_{c_2})\end{bmatrix}, \end{aligned}$$

while for \(f^{[2]}_\textrm{lin}\) we have

$$\begin{aligned} D_2f^{[2]}_\textrm{lin}=\begin{bmatrix}-(k_{r_1}+k_{c_1}) &{} 0 \\ k_{c_1} &{} -(k_{r_2}+k_{c_2})\end{bmatrix}. \end{aligned}$$

In both cases, assuming that \((k_{r_i}+k_{c_i})>0\) for \(i=1,2\) (otherwise, the reaction network would be degenerate, meaning some or all variables would remain constant), and given that \(S_0\) and \(S_1\) are non-constant, we deduce that these matrices are invertible. Hence, both substitutions (7) and (8) are good candidates for an algebraic QSSA reduction.

We note that the assumption \(E_{tot}=0\) is required to ensure that the initial condition lies in \(V_{\theta ^*}\). This is not physically realistic, as the absence of free enzyme makes the reaction rates negligible, however, in parameter space this assumption is close to the experimental setup (\(E_{tot}\approx 0.65\mu M\)). In fact, unlike the rate parameters, we know the value of \(E_{tot}\) and can, therefore, bound the error associated with such an idealisation (cf. Appendix A.2). The assumption that \(E_{tot}=0\) is similar to the classical singular-perturbation theory approach, where a typical choice of short timescale is \((t_{S}=E_{tot}k_{f_1})\) and one then subsequently assumes \(\varepsilon = E_{tot}/S_{tot}\rightarrow 0\).

As \(E_{tot}=0\) will yield a stationary model and ensure that \(V_{\theta ^*}\) contains the initial condition, we find that any parameter value \(\theta ^*\) satisfying \((k_{r_i}+k_{c_i})>0\) for \(i=1,2\) and \(E_{tot}=0\) is a QSS-parameter-value for both the Rational and Linear ERK model.

For both models, we have

$$\begin{aligned} Y_{\theta ^*}=\left\{ x=(S_0,S_1,S_2,C_1,C_2)\in {{\,\mathrm{{\mathbb {R}}}\,}}^5\,\vert \,f^{[2]}(x,\theta ^*)=0\right\} . \end{aligned}$$

For the Linear ERK model, we can show that \(Y_{\theta ^*}^\textrm{lin}\) is irreducible (at generic parameter values) and thus its QSS-variety is \(V_{\theta ^*}^\textrm{lin}=Y_{\theta ^*}^\textrm{lin}\). For the Rational ERK model, we have that \(Y_{\theta ^*}^\textrm{rat}\) decomposes as

$$\begin{aligned} Y_{\theta ^*}^\textrm{rat}=(Y_{\theta ^*}^\textrm{rat}\cap {\mathcal {V}}(\langle C_1+C_2\rangle ))\cup (Y_{\theta ^*}^\textrm{rat}\cap {\mathcal {V}}(\langle \lambda (k_{r_2}+k_{c_2})+S_0+\lambda k_{f_2}S_1\rangle )) \end{aligned}$$

where \(\lambda :=-k_{r_1}/(k_{f_1}(k_{c_1}-k_{c_2}-k_{r_1}))\). At generic parameter values, only the first irreducible component will contain the initial condition. Hence, the natural choice for the QSS-variety is

$$\begin{aligned} V^\textrm{rat}_{\theta ^*}=\left\{ x=(S_0,S_1,S_2,C_1,C_2)\in {{\,\mathrm{{\mathbb {R}}}\,}}^5\,\vert \,C_1=0,\,C_2=0\right\} . \end{aligned}$$

The substitution (7) yields the Rational ERK model given by

$$\begin{aligned} \frac{\mathrm dS_0}{\mathrm dt}&= \frac{-\kappa _1S_0}{ \gamma _1 S_0+\gamma _2S_1+1}, \end{aligned}$$
(9a)
$$\begin{aligned} \frac{\mathrm dS_1}{\mathrm dt}&=\frac{-\kappa _2S_1+(1-\pi )\kappa _1S_0}{ \gamma _1S_0+\gamma _2S_1+1}, \end{aligned}$$
(9b)
$$\begin{aligned} \frac{\mathrm dS_2}{\mathrm dt}&=\frac{\pi \kappa _1S_0 + \kappa _2S_1}{ \gamma _1S_0+\gamma _2S_1+1}, \end{aligned}$$
(9c)

while the substitution (8) gives the Linear ERK model:

$$\begin{aligned} \frac{\mathrm dS_0}{\mathrm dt}&= {-\kappa _1S_0}, \end{aligned}$$
(10a)
$$\begin{aligned} \frac{\mathrm dS_1}{\mathrm dt}&={-\kappa _2S_1+(1-\pi )\kappa _1S_0}, \end{aligned}$$
(10b)
$$\begin{aligned} \frac{\mathrm dS_2}{\mathrm dt}&= \pi \kappa _1S_0 + \kappa _2S_1. \end{aligned}$$
(10c)

Here, for \(i=1,2\), we use the newly introduced quantities

$$\begin{aligned} \kappa _i=E_{tot}\frac{k_{f_i}k_{c_i}}{ k_{c_i}+k_{r_i}},\qquad \pi =\frac{k_{c_2}}{ k_{c_2}+k_{r_2}},\qquad \gamma _i=k_{f_i}\frac{k_{c_1}+k_{c_2}}{\left( k_{c_1}+k_{r_1}\right) \left( k_{c_2}+k_{r_2}\right) }. \end{aligned}$$
(11)

Both models are reductions obtained via the ODE system (5). The processivity parameter, which is the probability that both phosphorylations are carried out by the same enzyme, is represented by \(\pi \) in the reduced models. The \(\kappa _i\) represents the kinetic efficiencies of the first and second phosphorylation steps, respectively (Yeung et al. 2019).

It should be noted that the Rational ERK model is the system we would obtain via the classical singular perturbation approach (Keener and Sneyd 2011).

3.3.3 Assessing Accuracy

We can use the algebraic framework of Goeke, Walcher and Zerz and, in particular, Proposition 4 to bound the error of the Linear ERK model reduction to the full model. Given the measurements of the Michaelis-Menten constants \(k_{M_i}\), we can derive simple expressions which bound the approximation error (see Appendix A.2 for both the Rational & Linear ERK model). Unfortunately, the bound on the approximation error depends on parameters with unknown values. However, we can compare the bounds derived for the Linear ERK model to those for the Rational ERK model and show that in the regime where \(k_{M_i}\approx 25\mu \hbox {M}\), both approximate the full model equally well (see Appendix A.2).

Recall that we can also derive the Rational ERK model via singular perturbation theory. When using perturbation theory, it is uncommon to bound the approximation error as explicitly as we do via the algebraic methods of Goeke et al. (2017). However, we can still show that the Linear ERK model is a good approximation of the Rational ERK model when \(0\le \gamma _1,\gamma _2\ll 1\). Again, we can use knowledge of the Michaelis-Menten constant to show that in our experimental setup, \(\gamma _1\) and \(\gamma _2\) are small. Indeed, we can rewrite

$$\begin{aligned} \gamma _1=\frac{1}{k_{M_1}}\frac{k_{c_1}+k_{c_2}}{ k_{c_2}+k_{r_2}}, \qquad \gamma _2=\frac{1}{ k_{M_2}}\frac{k_{c_1}+k_{c_2}}{ k_{c_1}+k_{r_1}}. \end{aligned}$$

Since \(k_{M_i}\approx 25\mu M\) and the parameters \(k_{c_i}\) and \(k_{r_i}\) are of similar magnitude (see Bar-Even 2011), we conclude that \(\gamma _1\approx 1/25\;(1/\mu M)\).

We reiterate that by employing an algebraic approach, we can derive a reduced model (without taking further limits) that approximates the Full ERK model as well as that obtained via singular perturbation theory, but has several advantages: it has fewer parameters, is interpretable as a chemical reaction network, and identifiable, as discussed in the next section.

3.3.4 Choice of Output Variables

Recall from Sect. 2.2, the experimental measurements correspond to the following linear combinations of variables: \(S_0+C_1\), \(S_1+C_2\) and \(S_2\). Here we argue that in the context of available data, \(S_0\), \(S_1\) and \(S_2\) are sufficient approximations of the output variables, which simplifies both the identifiability analysis and the parameter inference.

We argued in Sect. 3.2.3 that in the context of experimental data the Linear ERK model is as good of an approximation to the Full ERK model as the Rational ERK model. On the long timescale, substitutions for \(C_1\) and \(C_2\) from the Linear ERK model give approximately

$$\begin{aligned} C_1&=\frac{1}{k_{M_1}}E_{tot}\cdot S_0, \\ C_2&=\frac{1}{k_{M_2}}E_{tot}\cdot S_1 + \frac{k_{c_1} }{ k_{c_2}+k_{r_1}}\frac{1}{k_{M_2}}E_{tot}\cdot S_0. \end{aligned}$$

Recall that \(k_{M_i}\approx 25\mu M\) and \(E_{tot}=0.65\mu M\). We then find that the measurements of \(S_i + C_{i+1}\) will be dominated by \(S_i\). Henceforth we will use \(S_i\) interchangeably with our measurements \(S_i+C_{i+1}\).

4 Identifiability

One of the goals of this ERK study is to determine the kinetic parameters of the models given the data. Each model and experimental setup induces a map from the space of model parameters to observable model solutions (here, this is the measurement of the 3 species at the 7 time points over the course of r experimental replicates, i.e. a subset of \({{\,\mathrm{{\mathbb {R}}}\,}}^{21r}\)). We call this map \(\phi _{t_1,\ldots ,t_7}:\Theta \rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}^{21r}\) the model prediction map (see Dufresne et al. 2018). Here, the parameter space \(\Theta \) is a subset of the positive octant \({{\,\mathrm{{\mathbb {R}}}\,}}_{\ge 0}^6\) for the Full ERK model, \({{\,\mathrm{{\mathbb {R}}}\,}}_{\ge 0}^5\) for the Rational ERK model, and \({{\,\mathrm{{\mathbb {R}}}\,}}_{\ge 0}^3\) for the Linear ERK model. One can think of the data as being a point \(z^*\) in the space of observable model solutions, i.e. \({{\,\mathrm{{\mathbb {R}}}\,}}^{21r}\), and parameter estimation corresponds to attempting to compute the inverse image \(\phi _{t_1,\ldots ,t_7}^{-1}(z^*)\) of this map at that point. Structural identifiability generally corresponds to the model prediction map \(\phi _{t_1,\ldots ,t_7}\) being injective. Real-world observations are noisy, hence the data point \(z^*\) may not be in the image of the map \(\phi _{t_1,\ldots ,t_7}\). Thus, when performing parameter estimation, we instead search for parameters yielding model predictions close to the data point \(z^*\). Practical identifiability broadly corresponds to having the set of parameters with model predictions close to the data point \(z^*\) being bounded. In Sect. 4.1 we show that the Linear ERK model is structurally identifiable on its whole parameter space, while the Rational ERK model and the Full ERK model are structurally identifiable on some open dense subset of their parameter space. In Sect. 4.2 we show that the Linear ERK model is practically identifiable for our experimental data, providing the proof of Theorem 1. By contrast, we provide evidence that the Rational ERK model and Full ERK model are not practically identifiable.

4.1 Structural Identifiability

First, we study the structural identifiability of our ODE models, that is whether the model prediction map \(\phi _{t_1,\ldots ,t_7}:\Theta \rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}^{21r}\) is one-to-one, or at least locally one-to-one. We start by providing a formal definition of structural identifiability for models given by ODE systems with specific time points. Suppose we have a rational ODE system in variables \(x\in {{\,\mathrm{{\mathbb {R}}}\,}}^n\) and parameters \(\theta \in {{\,\mathrm{{\mathbb {R}}}\,}}^m\), given by

$$\begin{aligned} \frac{\textrm{d} x}{\textrm{d}t} = f(x,\theta ), \end{aligned}$$
(12)

where f is a vector of rational functions in \({{\,\mathrm{{\mathbb {R}}}\,}}(x,\theta )^n\). We assume that the measurable output is \(y=g(x,\theta )\) where g is also a vector of rational functions. Let \({\hat{x}}(\theta ,t)\) be a solution of (12) for the parameter value \(\theta \in \Theta \) and then let \({\hat{y}}(\theta , t)=g({\hat{x}}(\theta ,t),\theta )\) be the observable solution for the same parameter value. Then, supposing that there are r replicates of the experiment, for the specific time points \(t_1,\ldots ,t_l\) the model prediction map is given by

$$\begin{aligned} \phi _{t_1,\ldots ,t_l}(\theta )=\underbrace{({\hat{y}}(\theta ,t_1),\ldots ,{\hat{y}}(\theta ,t_l),\dots ,{\hat{y}}(\theta ,t_1),\ldots ,{\hat{y}}(\theta ,t_l))}_{r {\text {times}}}. \end{aligned}$$

The model prediction map then induces an equivalence relation \(\sim _{t_1,\ldots ,t_l}\) on the parameter space \(\Theta \) via

$$\begin{aligned} \theta \sim _{t_1,\ldots ,t_l}\theta ' {\text { if and only if }} \phi _{t_1,\ldots , t_l}(\theta )=\phi _{t_1,\ldots , t_l}(\theta '), \end{aligned}$$

for any \(\theta ,\theta '\in \Theta \).

Definition 5

(c.f. Definition 2.8 in Dufresne et al. (2018)) Suppose we have a model given by a system of rational ODEs (as above) with parameter space \(\Theta \) and model prediction map \(\phi _{t_1,\ldots ,t_l}\). We say a model is:

  • globally identifiable if every equivalence class of \(\sim _{t_1,\ldots ,t_l}\) on \(\Theta \) has size exactly 1.

  • generically identifiable if for almost all \(\theta \in \Theta \) the equivalence class of \(\theta \) has size exactly 1.

  • locally identifiable if for almost all \(\theta \in \Theta \) the equivalence class of \(\theta \) is finite.

  • generically non-identifiable if for almost all \(\theta \in \Theta \) the equivalence class of \(\theta \) is infinite.

Here “almost all” means everywhere except possibly in a closed subvariety (i.e. the set of common zeroes of some polynomials).

There are several approaches to assess structural identifiability. All identifiability methods involve a certain number of assumptions of genericity, but not always explicitly (see for example discussions in Ovchinnikov et al. (2021), Hong et al. (2020), Joubert et al. (2021), Saccomani et al. (2003), Villaverde et al. (2018), Villaverde et al. (2019)). First, all methods assume that one has access to the whole trajectory of the observable output, and so are looking at the size of the equivalence classes of the equivalence relation \(\sim _\infty \) on \(\Theta \) defined as

$$\begin{aligned} \theta \sim _\infty \theta ' {\text { if and only if }} {\hat{y}}(\theta ,t)= {\hat{y}}(\theta ',t) {\text { for all }} t\ge 0. \end{aligned}$$

For rational ODE models with time series data as considered here, a result of Sontag (2002) proves if at least \(2m+1\) generic time points are observed, where m is the dimension of the parameter space, then the equivalence relation \(\sim _{t_1,\ldots ,t_{2m+1}}\) coincides with the equivalence relation \(\sim _\infty \). If there are fewer time points or they are not generic, it could be that almost all equivalence classes of \(\sim _\infty \) have size 1 but those of \(\sim _{t_1,\ldots ,t_l}\) are larger. For the Linear ERK model, the parameter space has dimension 3, so we have enough time points, although we do not know a priori if they are generic. In fact, this model admits analytic solutions (see Sect. 4.2), so we can build the model prediction map explicitly and determine its identifiability directly. By a straightforward computation, we can show that for any choice of three distinct non-zero time points, the model prediction map \(\phi _{t_1,t_2,t_3}\) of the Linear ERK model is injective and so the model is globally structurally identifiable (see Appendix A.4 for details). In particular, it follows that any choice of three distinct time points is generic. For the Rational ERK model and the Full ERK model, the parameter space has dimensions 5 and 6, respectively; hence, we may not have enough time points, and we cannot determine the validity of any structural identifiability results for these specific model prediction maps. Indeed, these two models are non-linear and do not admit analytic solutions that would allow us to make the same argument as for the Linear ERK model. This is an instance of a more general open problem:

Open Problem 6

Find and implement an algorithm to determine structural identifiability of a rational ODE model with time series data at specific given time points \(\{t_1,\ldots ,t_l\}\).

Methods to assess the structural identifiability of ODE models include the classical approach via Taylor series (Pohjanpalo 1978) and generating series (Grewal and Glover 1976), and, more recently, approaches based on differential algebra (Audoly 2001; Saccomani et al. 2003; Hong et al. 2020). In this paper, we use SIAN (Hong et al. 2019), an approach based on differential algebra implemented in Maple (2019).

Similar to other methods based on differential algebra (for example, the method implemented in DAISY (Bellu et al. 2007)), SIAN is based on the differential Nullstellensatz (Ritt 1950, Chapter 1) or (Seidenberg 1952, Section 4). For a differentially closed field \({\mathbb {K}}\),this theorem establishes a correspondence between radical differential ideals and differentially closed subsets of \({\mathbb {K}}^n\). In the context of an ODE system, this implies that the solutions of the ODE system are completely determined by a prime differential ideal in a differential ring (see below). Criteria for identifiability can then be extracted from the ideal (or the quotient ring). The requirement that \({\mathbb {K}}\) is differentially closed then means that the solutions in question are possibly complex-valued, and the identifiability results will be about complex parameters, whether this is stated explicitly or not. For this reason, Hong et al. (2020) state their definition for complex parameters.

Remark

As mentioned above, the first difference between our definition of identifiability and Hong et al.’s is that their parameter space is a subset of \({{\,\mathrm{{\mathbb {C}}}\,}}^n\) instead of \({{\,\mathrm{{\mathbb {R}}}\,}}^n\). A second difference to note is that what Hong et al. (2020) call “globally identifiable” corresponds to what we call generically identifiable. Finally, Hong et al.’s (2020) definition is written for components of the parameters and makes the notion of “almost all” more precise.

The starting point is an ODE system of the same form as in Eq. (12) together with the initial condition \(x(0)=x_0\). Let Q be the least common multiple of all the polynomials appearing in the denominators in f and g. Then we have \(f=F/Q\) and \(g=G/Q\) where F and G are polynomial functions. Note that SIAN usually views the initial conditions as additional unknown components of the parameter that one may want to identify. The differential ring of interest is the differential ring \({{\,\mathrm{{\mathbb {C}}}\,}}(\theta )\{x,y\}\) (the differential ring in indeterminates x and y over the fraction field \({{\,\mathrm{{\mathbb {C}}}\,}}(\theta )\), i.e. the field of complex rational functions in the parameters). We can think of this ring as a polynomial ring in infinitely many indeterminates: \(\theta \), x, y and the infinitely many higher derivatives of x and y (i.e. \(x^{(i)}\) and \(y^{(i)}\) for \(i\ge 1\)). We are interested in the differential ideal \(I_{\Sigma }\) of \({{\,\mathrm{{\mathbb {C}}}\,}}(\theta )\{x,y\}\) given by

$$\begin{aligned} I_{\Sigma }:=\left( (Q\dot{x_i}-F_i)^{(j)},(Q\dot{y_k}-G_k)^{(j)}\mid 1\le i\le n,\, 1\le k\le m,\, j\ge 0\right) :Q^\infty , \end{aligned}$$
(13)

where for non-empty subsets TS of a ring R, the set \(T:S^\infty \) is defined as follows:

$$\begin{aligned} T:S^\infty :=\{r\in R\mid {\text {there exist }} s\in S, \,n\in {{\,\mathrm{{\mathbb {Z}}}\,}}_{\ge 0} {\text { such that }} s^nr\in T\}. \end{aligned}$$

Note that for polynomial systems like the Full ERK model and the Linear ERK model, we have \(Q=1\), and so the column operation is not needed and the ideal \(I_{\Sigma }\) is simply the differential ideal generated by the equations defining the ODE system and their derivatives. The ideal \(I_{\Sigma }\) is the ideal of all differential polynomials in \({{\,\mathrm{{\mathbb {C}}}\,}}(\theta )\{x,y\}\) that vanish on the solutions of the system of ODE system (12) (Saccomani et al. 2003; Hong et al. 2020).

The ideal \(I_{\Sigma }\) is prime (Hong et al. 2020) and so the quotient ring \({{\,\mathrm{{\mathbb {C}}}\,}}(\theta )\{x,y\}/I_{\Sigma }\) is an integral domain. Let \({\mathbb {K}}:=Q({{\,\mathrm{{\mathbb {C}}}\,}}[\theta ]\{x,y\}/I_{\Sigma })\) be the field of fractions of the domain \({{\,\mathrm{{\mathbb {C}}}\,}}(\theta )\{x,y\}/I_{\Sigma }\), and let \(\Bbbk \) be the subfield of \({\mathbb {K}}\) generated by the image of \({{\,\mathrm{{\mathbb {C}}}\,}}\{y\}\), that is, the subfield generated by the elements of the form \(y_i+I_{\Sigma }\). We can now state the non-constructive algebraic criterion for structural identifiability:

Proposition 7

(c.f. Proposition 3.4 in Hong et al. 2020) Suppose we have a model given by a system of rational ODEs as described above.

  • If the fields \(\Bbbk \) and \(\Bbbk (\theta )\) coincide, then the model is generically identifiable.

  • If the field extension \(\Bbbk \subseteq \Bbbk (\theta )\) is algebraic, then the model is locally identifiable.

Remark

Note that Proposition 3.4 in Hong et al. (2020) implies that \(\Bbbk \) and \(\Bbbk (\theta )\) coincide (respectively the field extension \(\Bbbk \subseteq \Bbbk (\theta )\) is algebraic) if and only if the model is globally identifiable (respectively, locally identifiable) in the sense of Hong et al. (2020). We are interested in something weaker; we only wish to identify parameters in the parameter space \(\Theta \), which is a subset of the real positive octant.

The criterion provided by the proposition above is not constructive, as it involves the field of rational functions of an infinitely generated \({{\,\mathrm{{\mathbb {C}}}\,}}\)-algebra. Hong et al. (2020) go on to provide a constructive version of the criterion (Hong et al. 2020, Section 3). The software SIAN (Hong et al. 2019), which we use here, is in turn based on a probabilistic version of the criterion (Hong et al. 2020, Section 4). Note that local identifiability is determined via the Taylor series approach.

We now consider the issue of initial conditions. As mentioned above, by default, SIAN considers the initial conditions as parameters that one may wish to identify. Other methods, like the differential algebra method as implemented in DAISY (Bellu et al. 2007), do not explicitly address initial conditions. Ovchinnikov et al. show in (Ovchinnikov et al. 2021, Theorem 19) that input-output identifiability corresponds to what they call multiple experiment identifiability, that is, identifiability from sufficiently many generic initial conditions. DAISY and COMBOS verify input-output identifiability (Meshkat et al. 2014).

Using SIAN (Hong et al. 2019), we verify that all three models are generically identifiable. In particular, in all three models all parameters are generically globally identifiable. Recall that this result is valid under the assumption that we have measurements at sufficiently many generic time-points, and for generic initial conditions. Inspired by the discussion in Saccomani et al. (2003), in Appendix A.3 we show that the set of differential polynomials in \({{\,\mathrm{{\mathbb {C}}}\,}}(\theta )\{x,y\}\) vanishing on those solutions of the system 12 with initial conditions \(S_0(0)=5\mu M\) and \(S_1(0)=S_2(0)=0\mu M\) for all three models, as well as \(C_1(0)=C_2(0)=0\mu M\) and \(E(0)=0.65\mu M\) for the Full ERK model coincides with the ideal \(I_\Sigma \). This means that the set of solutions with initial conditions corresponding to our experimental setup is dense in the set of all solutions for the Kolchin topology (induced by the differential ideals of \({{\,\mathrm{{\mathbb {C}}}\,}}(\theta )\{x,y\}\)). We can, therefore, conclude that the initial conditions specific to the experimental setup are indeed generic. Therefore, our structural identifiability results hold for the initial conditions specific to the experimental setup.

Remark

Using SIAN we can show that the Full ERK model is also generically identifiable with measurable outputs \(S_0+C_1\), \(S_1+C_2\) and \(S_2\) which is what was actually measured experimentally (see Sect. 3.2.4).

4.2 Practical Identifiability

Suppose a model is generically identifiable, then, generically, distinct parameters produce distinct data points. However, if there are parameter values that are arbitrarily far from one another but produce data points close to each other, parameter estimation would not be meaningful in practice. Practical (non-)identifiability aims to categorise models exhibiting such undesirable behaviour. For example, sloppiness (Gutenkunst 2007), uncertainty quantification (Smith 2013) and filtering problems (Shi et al. 1999) study mathematical models with a similar aim. We use a definition of practical identifiability introduced in Dufresne et al. (2018), which was adapted from the definition given in Raue et al. (2009).

Practical identifiability depends on more than the defining equations and specification of input and output of the model. Practical identifiability will be influenced by the precise choice of time points, the method used for parameter estimation, the assumption on measurement noise of the data, and the way we measure distances in parameter space. It may also vary on the area in the data space. A data point \(z^*\) is an experimental observation in the form of an N-dimensional vector whose entries are the observed values of the measured variables at each of the specific time points for each replicate of the experiment. We focus on practical identifiability for maximum likelihood estimation (MLE), one of the most widely used methods for parameter estimation (see, for example Ljung et al. 1987). Accordingly, in the remainder of this section, we consider models \(({\mathcal {M}},\phi _{t_1,\ldots ,t_s},\psi ,d_\Theta )\) with a precise choice of model prediction map \(\phi _{t_1,\ldots ,t_s}\) with specific time points \({t_1,\ldots ,t_s}\), a specific assumption for the probability distribution \(\psi \) of measurement noise and a choice of reference metric \(d_\Theta \) on parameter space \(\Theta \). We will also assume that the model considered is at least generically identifiable, so that MLE exist and are unique for generic data (see (Dufresne et al. 2018, Proposition 4.15)). We write \({\hat{\theta }}(z^*)\) to denote the MLE for \(z^*\), that is, \({\hat{\theta }}(z^*):={\text {max}}_{\theta \in \Theta }\psi (\theta ,z^*)\).

We define an \(\delta \)-confidence region \({U}_\delta (z^*)\) as follows:

$$\begin{aligned} U_{\delta }(z^*) := \{\theta \in \Theta \mid - \log \psi (\theta ,z^*) < \delta \}. \end{aligned}$$

The set \(U_{\delta }(z^*)\), often known as a likelihood-based confidence region (Vajda et al. 1989; Casella and Berger 2002), is intimately connected with the likelihood ratio test. Specifically, suppose we had a null hypothesis \({\textbf{H}}_0\) that data point \(z^*\) has true parameter \(\theta ^*\), and we wished to test the alternative hypothesis \({\textbf{H}}_1\) that \(z^*\)’s true parameter is something else. By definition, a likelihood ratio test would reject the null hypothesis when

$$\begin{aligned} \Lambda (\theta ^*,z^*):= \frac{\psi (\theta ^*,z^*)}{\psi ({\hat{\theta }}(z^*),z^*)} \le k^*, \end{aligned}$$

where \(k^*\) is a critical value, with the significance level \(\alpha \) equal to the probability \({\text {Pr}}(\Lambda (z^*)\le k^* | {\textbf{H}}_0)\) of rejecting the null hypothesis when it is in fact true. The set of parameters such that the null hypothesis is not rejected at significance level \(\alpha \) is

$$\begin{aligned} \{ \theta ' \in \Theta \mid -\log \psi (\theta ',z^*)<-\log \psi ({\hat{\theta }}(z^*),z^*)-\log k^*\}, \end{aligned}$$

that is, \(U_\delta (z^*)\), where \(\delta =-\log \psi ({\hat{\theta }}(z^*),z^*)-\log k^*\).

Definition 8

(Dufresne et al. 2018, Definition 4.17) The model \((M,\phi _{t_1,\ldots ,t_s},\psi ,d_{\Theta })\) is practically identifiable for a data point \(z^*\in {{\,\mathrm{{\mathbb {R}}}\,}}^N\) at significance level \(\alpha \) if and only if the confidence region \(U_{\delta }(z^*)\) is bounded with respect to \(d_\Theta \), where

$$\begin{aligned} \delta =-\log \psi ({\hat{\theta }}(z^*),z^*)-\log k^* \end{aligned}$$

and

$$\begin{aligned} \alpha ={\text {Pr}}\left( \frac{\psi ({\hat{\theta }}(z^*),{\hat{z}})}{{\text {max}}_{\theta \in \Theta }\psi (\theta ,{\hat{z}})}<k^* \mid {\hat{z}} {\text { is data with true parameter }} {\hat{\theta }}(z^*)\right) . \end{aligned}$$
(14)

For our analysis, we make the common assumption that the measurement noise is additive Gaussian with covariance matrix equal to a multiple of the identity matrix. The assumption is implicit when performing a least-squares fit computation for MLE. In our setup, we are measuring 3 substances at 7 time-points and there were r replicates, so our assumption on the measurement noise means that the probability distribution of the data is given by

$$\begin{aligned} \psi (\theta ,z)=(2\pi \sigma ^2)^{\frac{-21r}{2}}e^{-\frac{1}{2\sigma ^2}\Vert z-\phi _{t_1,\ldots ,t_7}(\theta )\Vert _2^2}, \end{aligned}$$

where \(\sigma ^2 I_{21}\) is the covariance. It then follows that

$$\begin{aligned} \delta&=-\log \psi ({\hat{\theta }}(z^*),z^*)-\log k^*\\&=\frac{21r}{2}\log (2\pi \sigma ^2)+\frac{1}{2\sigma ^2}\Vert z^*-\phi _{t_1,\ldots ,t_7}({\hat{\theta }}(z^*))\Vert _2^2-\log k^*, \end{aligned}$$

and

$$\begin{aligned} -\log \psi (\theta ',z^*)=\frac{21r}{2}\log (2\pi \sigma ^2)+\frac{1}{2\sigma ^2}\Vert z^*-\phi _{t_1,\ldots ,t_7}(\theta ')\Vert _2^2. \end{aligned}$$

Therefore, we have that

$$\begin{aligned} U_\delta (z^*)&=\{\theta '\in \Theta \mid \Vert z^*-\phi _{t_1,\ldots ,t_7}(\theta ')\Vert _2^2 <\Vert z^*-\phi _{t_1,\ldots ,t_7}({\hat{\theta }}(z^*))\Vert _2^2-2\sigma ^2\log k^*\}\\&=\phi _{t_1,\ldots ,t_7}^{-1}({B}_{\rho }(z^*)), \end{aligned}$$

where \({B}_{\rho }(z^*)\) is the Euclidean open ball of radius \(\rho :=\sqrt{(\Vert z^*-\phi _{t_1,\ldots ,t_7}({\hat{\theta }}(z^*))\Vert _2^2-2\sigma ^2\log k^*)}\) around the data point \(z^*\). It follows that under our assumptions, determining whether the various models we study are practically identifiable corresponds to determining whether the preimages under the model prediction map of small open balls around data points are bounded in parameter space. The size of the balls will depend on the data point and the significance level \(\alpha \) (or equivalently the critical value \(k^*\)).

4.3 Algorithm for Testing Practical Identifiability

The Rational ERK model and the Full ERK model do not admit analytic solutions, hence we do not have access to an explicit model prediction map \(\phi _{t_1,\ldots ,t_l}\). Therefore, we must approximate \(\phi _{t_1,\ldots ,t_l}\) and thus also \(U_\delta \) using numerical methods and repeated sampling.

First, we assume that our measurements have been corrupted with some Gaussian noise with mean 0 and variance \(\sigma ^2\). This variance is identical across measurement quantities, time points and trials. The noise distributions are independent across measurements.

As we have assumed that measurement noise is additive Gaussian with covariance matrix equal to a multiple of the identity matrix, we can obtain an MLE, given some data \(z^*\), by solving a least squares problem. This gives us \({\hat{\theta }}(z^*)\). We use this parameter to calculate the sample variance, assuming that the mean of each quantity is the model trajectory at each time point. This gives us an estimate of the covariance \(\sigma ^2\).

Recall that \(\delta \) is defined to be \(-\log \psi ({\hat{\theta }}(z^*),z^*)-\log k^*\). The log-likelihood is easy to compute, as we already know \(z^*\) and \({\hat{\theta }}(z^*)\), and can estimate \(\phi _{t_1,\ldots , t_7}\) using a numerical solution to the ODE system. We use the following procedure to approximate \(-\log k^*\):

figure a

This simply follows the definition of \(k^*\) in Eq. (14) and approximates \(-\log k^*\) by repeatedly sampling likelihood-ratios under our given noise assumptions and then taking a \((1-\alpha )\)-quantile (as \(-\log (\,\cdot \,)\) is a monotonically decreasing function).

Remark

In a situation where the number of replicates r is large, an approximate \(\delta \) can be computed from \(\alpha \) that depends primarily on the distance between the data point \(z^*\) and the predicted data point \(\phi _{t_1,\ldots ,t_7}({\hat{\theta }}({z^*}))\) corresponding to the MLE.

From the definition, we have \(\delta =-\log \psi ({\hat{p}}(z^*),z^*)-\log k^*\), meaning that \(k^*=1/e^{\delta }\psi ({\hat{p}}(z^*),z^*)\), and so we can describe \(\alpha \) in terms of \(\delta \) directly:

$$\begin{aligned} \alpha ={\text {Pr}}\left( \frac{\psi ({\hat{\theta }}(z^*),{\hat{z}})}{{\text {max}}_{\theta \in \Theta }\psi (\theta ,{\hat{z}})}<1/e^{\delta }\psi ({\hat{\theta }}(z^*),z^*) \mid {\hat{z}} {\text { is data with true parameter} } {\hat{\theta }}(z^*)\right) . \end{aligned}$$

This is equivalent to

$$\begin{aligned} \alpha ={\text {Pr}}\left( -\log \left( \frac{\psi ({\hat{\theta }}(z^*),{\hat{z}})}{{\text {max}}_{\theta \in \Theta }\psi (\theta ,{\hat{z}})}\right) >-\log (1/e^{\delta }\psi ({\hat{\theta }}(z^*),z^*) \mid {\hat{z}} {\text { has true parameter}} {\hat{\theta }}(z^*)\right) , \end{aligned}$$

and so

$$\begin{aligned} \alpha= & {} {\text {Pr}}\left( -\log \psi ({\hat{\theta }}(z^*),{\hat{z}})+\log {{\text {max}}_{\theta \in \Theta }\psi (\theta ,{\hat{z}})}>\delta \right. \\{} & {} \left. +\log \psi ({\hat{\theta }}(z^*),z^*) \mid {\hat{z}} {\text { has true parameter }} {\hat{\theta }}(z^*)\right) . \end{aligned}$$

Note that for each value of \({\hat{z}}\), the MLE \({\hat{\theta }}({\hat{z}})\) maximises \(\psi (\theta ,{\hat{z}})\). It follows that

$$\begin{aligned} \alpha= & {} {\text {Pr}}\left( 2(\log \psi ({\hat{\theta }}({\hat{z}}),{\hat{z}})-\log \psi ({\hat{\theta }}(z^*),{\hat{z}}))>2\delta \right. \\{} & {} \left. +2\log \psi ({\hat{\theta }}(z^*),z^*) \mid {\hat{z}} {\text { has true parameter }} {\hat{\theta }}(z^*)\right) . \end{aligned}$$

Wilk’s theorem (Fan et al. 2000) implies that \(2(\log \psi ({\hat{\theta }}({\hat{z}}),{\hat{z}})-\log \psi ({\hat{\theta }}(z^*),{\hat{z}}))\) is asymptotically \(\chi ^2\) with three degrees of freedom. If \(F({\hat{z}})\) is the asymptotic cumulative distribution function of \(2(\log \psi ({\hat{\theta }}z),{\hat{z}})-\log \psi ({\hat{\theta }}(z^*),{\hat{z}}))\), then \(\alpha \) is approximately equal to

$$\begin{aligned} \alpha&=1-{\text {Pr}}\left( 2(\log \psi ({\hat{\theta }}({\hat{z}}),{\hat{z}})-\log \psi ({\hat{\theta }}(z^*),{\hat{z}}))<2\delta \right. \\&\quad \left. +2\log \psi ({\hat{\theta }}(z^*),z^*) \mid {\hat{z}} {\text { has true parameter }} {\hat{\theta }}(z^*)\right) \\&\approx 1-F(2\delta +2\log \psi ({\hat{\theta }}(z^*),z^*)). \end{aligned}$$

Therefore, asymptotically we have that

$$\begin{aligned} \delta =F^{-1}(1-\alpha )/2-\log \psi ({\hat{\theta }}(z^*),z^*). \end{aligned}$$

Unfortunately, this is not applicable here, as the number of experiments is 5, 6 or 11, which are not large numbers. Indeed, the \(\delta \) obtained by applying Wilks’ Theorem and the \(\delta \) obtained via Algorithm 1 are notably different. For example, for the wild-type and the Linear model, we approximate \(-\log k^*\) as 0.477 while Wilks’ theorem approximates it as 3.907.

In order to demonstrate practical non-identifiability for the Full and Rational ERK models, we pick two parameters from each model, based on which we can illustrate non-identifiability well by presenting confidence areas marginalised to these two parameters. This choice of parameters is informed by performing a (ill-posed) Bayesian parameter inference first (see next section). This procedure is described here for the Rational ERK model, but works similarly for the full model:

figure b

While we do not know the values of \(\kappa _1\) and \(\gamma _1\), previous experimental work has provided bounds for \(\kappa _1\) and \(\gamma _1\), which we pass to the algorithm above. The list returned by the algorithm is a discrete approximation of the confidence area, marginalised to the pair of parameters \(\kappa _1\) and \(\gamma _1\). We plot these points for visual inspection, which can be seen in Fig. 2. The blue area reaching the upper and leftmost boundary of the plot indicates that the confidence region is very unlikely to be bounded and that this model is very unlikely to be practically identifiable.

Fig. 2
figure 2

(Color figure online) Marginalised confidence area following Algorithm 2 at significance level 0.05 for the Rational ERK model for the wild-type data point \(z^*\), with \(\kappa _1^{{\min }}=0\,(1/\min )\), \(\kappa _1^{{\max }}=1000\,(1/\min )\), \(\gamma _1^{{\min }}=0\,(1/\mu \hbox {M})\) and \(\gamma _1^{{\max }}=1000\,(1/\mu \hbox {M})\)

The source of this practical non-identifiability of the Full ERK model and the Rational ERK model is not completely clear. One possible source of non-identifiability could be the choice of time points. Indeed, as mentioned in Sect. 4.1, in both cases we do not know if the time points are sufficiently generic. There are reasons to believe that not all practical non-identifiability can be explained by having an insufficient number of time points. Indeed, as part of earlier work during the preparation of Yeung et al. (2019), additional time point data were simulated for the Full ERK model, but confidence regions still appeared unbounded. Another possible source of non-identifiability could be that for the given experimental data there is a valid quasi-steady-state approximation resulting in a smaller dimensional parameter space. At quasi-steady-state parameter values, the reduction is exact and so for these parameters, the equivalence class of \(\sim _{t_1,\ldots ,t_7}\) is positive dimensional. Intuitively, since the solutions of the Full ERK model and the Rational ERK model are close to those of the Linear ERK model near quasi-steady-state parameter values, the confidence regions should contain the equivalence class of the nearby quasi-steady-state parameter value, which in this case, was unbounded. This might be an example of more widespread phenomena.

4.4 The Practical Identifiability of the Linear ERK Model

We now consider the practical identifiability of the Linear ERK model. What distinguishes the Linear ERK model from the Full ERK model and the Rational ERK model is that an analytic solution to the ODE system is available and so we can construct an explicit model prediction map. The solution to the ODE system (10) with initial conditions \(S_0(0)=5\mu M\) and \(S_1(0)=S_2(0)=0\) is given by:

$$\begin{aligned} {S}_0(t)=&5e^{-\kappa _1t}\\ {S}_1(t)=&5\kappa _1(1-\pi )t e^{-\kappa _1t}&{\text {if }}\; \kappa _1=\kappa _2\\&5\kappa _1(1-\pi )(e^{-\kappa _2t}-e^{-\kappa _1t})/(\kappa _1-\kappa _2)&{\text {otherwise }}\\ {S}_2(t)=&5-{S}_0(t)-{S}_1(t). \end{aligned}$$

As we did for the Rational ERK model in Sect. 4.3, for a given data point \(z^*\), we obtain an MLE \({\hat{\theta }}(z^*)\) by solving a least-squares problem. We then use Algorithm 1 to approximate \(-\log k^*\), and then \(\delta \), using the explicit model prediction map we construct based on the analytic solutions. In Fig. 3 we plot the boundary of the confidence regions at significance level \(\alpha =0.05\) for the data points corresponding to the wild-type and each mutant. All five confidence regions are seen to be bounded, and we conclude that the model is practically identifiable for those data points.

Fig. 3
figure 3

(Color figure online) Boundary of the confidence regions for wild-type and each mutant at significance level 0.05 for the Linear ERK model

5 Topological Data Analysis for Kinetic Parameter Inference

Since the Linear ERK model is practically identifiable, we now infer the parameters of this model using data from wild-type and mutant experiments. First, we briefly review the Bayesian approach for inferring parameters of the Linear ERK model, as already computed by Yeung et al. (2019). We then introduce topological data analysis (TDA) and present previous results that enable us to analyse the parameters sampled from the posteriors of the wild-type (WT) and four mutants with topological data analysis. Specifically we exploit a theorem by Bobrowski et al. (2017) for hypothesis testing of topological distances in noisy settings. We implement their theoretical result and compare the topological distances between WT and mutants.

5.1 Bayesian Inference

Given experimental data and a mathematical model, we seek to infer parameters for which the model accurately fits the data. We choose to do this via Bayesian inference. The theory of Bayesian statistics captures how our belief in the true values of these parameters changes when we make observations (in this case: measurements) in the language of probability theory. Most importantly, Bayesian inference does not infer a single value for each parameter, as would a frequentist approach; rather, it infers a probability distribution of parameter values expressing how strongly we believe a certain set of parameter values is correct.

Formally, we are given a parameter space \(\Theta \) and observations x from some sample space \({\mathcal {X}}\). Combining the mathematical model with noise assumptions on available measurements, we obtain an expression for \(p(x|\theta )\), the likelihood of observing x assuming that the parameter of the model is \(\theta \in \Theta \). In addition, we need to specify a measure of belief in the parameter values before we observe any data, expressed through a probability density \(p(\theta )\), called the prior distribution. Theoretically, we want to inform a Bayesian inference only through observations. Consequently, we do not want to inform the inference by placing strong prior beliefs on certain parameter values. In practice, however, a trade-off between neutral prior beliefs (which should only account for substantive prior knowledge and possibly scientific conjectures), analytical convenience and computational tractability is commonplace (Gelman and Shalizi 2013, 11-12).

Having selected a mathematical model and a prior distribution, our formal belief in parameter values becomes

$$\begin{aligned} p(\theta |x)\propto p(x|\theta )\cdot p(\theta ) \end{aligned}$$

by making observations \(x\in {\mathcal {X}}\). The probability density \(p(\theta |x)\) is called the posterior distribution. The proportionality in the above equation indicates that we omitted a normalisation which is independent of \(\theta \). As one can approximately sample from \(p(\theta |x)\) without normalising, the normalisation factor is not necessary for our application.

For the Linear ERK model (Eqs. (10)), the parameter is \(\theta =(\kappa _1, \kappa _2, \pi , \sigma )\in {{\,\mathrm{{\mathbb {R}}}\,}}^4=\Theta \). Here, the first three components come from the parameter of the Linear ERK model, while \(\sigma \), the variance of the distribution of the data, which must be inferred in order to construct a Bayesian model, and will be subsequently marginalised (i.e. integrated out). The observations are measurements of \(S_0\), \(S_1\) and \(S_2\). As measurements of each MEK type are taken from r replicates, at 7 different times, for 3 phosphorylation states of substrate, we formally have \({\mathcal {X}}={{\,\mathrm{{\mathbb {R}}}\,}}^{r\cdot 3\cdot 7}={{\,\mathrm{{\mathbb {R}}}\,}}^{r\cdot 21}\). We have \(r=11\) for the wild-type, \(r=6\) for SSDD and \(r=5\) for all other variants.

To construct a statistical model on the mechanistic Linear ERK model, we set the prior distributions to

$$\begin{aligned} \kappa _1,\kappa _2 \sim {\textit{Unif}}(0\,(1/{\min }),10\,(1/{\min })),\quad \sigma \sim {\textit{Unif}}(0\,(\mu M),10\,(\mu M)), \end{aligned}$$

a uniform distribution over values we deem biologically feasible for these parameters (Yeung et al. 2019), and \(\pi \sim {\textit{Unif}}(0,1)\), as \(\pi \) can only take values within this range by definition.

Given samples \(S_0^*\), \(S_1^*\) and \(S_2^*\), we assume that

$$\begin{aligned} \left( S_0^*\right) _{t,i}\sim {\mathcal {N}}\left( S_0(\kappa _1,\kappa _2,\pi ,t),\sigma \right) ,\\ \left( S_1^*\right) _{t,i}\sim {\mathcal {N}}\left( S_1(\kappa _1,\kappa _2,\pi ,t),\sigma \right) ,\\ \left( S_2^*\right) _{t,i}\sim {\mathcal {N}}\left( S_2(\kappa _1,\kappa _2,\pi ,t),\sigma \right) , \end{aligned}$$

where t denotes the respective measurement time and i indexes the sample. Here, \(S_j(\kappa _1,\kappa _2,\pi ,t)\) is a solution to the ODE system at time t for parameters \(\kappa _1\), \(\kappa _2\) and \(\pi \). For the Linear ERK model, we can construct an analytic solution to the governing equations, but generally, a numerical solution suffices. Such ODE solutions give rise to an expression for the likelihood \(p(x\vert \theta )\).

We note that in the above Bayesian model, some standard simplifying assumptions were made. First, in the given setup, negative values of measurements of \(S_0\), \(S_1\) and \(S_2\) have strictly positive likelihoods, which is not true in reality. Second, we assume that \((S_0^*)_{t,i}\), \((S_1^*)_{t,i}\) and \((S_2^*)_{t,i}\) are independent random variables for all t and i and that they have the same standard deviation. Despite these assumptions, we obtained good fits to the data. For example, performing an inference with three different standard deviation parameters \(\sigma _0\), \(\sigma _1\) and \(\sigma _2\) for \(S_0\), \(S_1\) and \(S_2\), respectively, did not significantly improve the fits to the data.

This Bayesian inference framework can also be applied to other ODE models describing the measurements, including the Rational ERK model (Eqs. (9)) and the Full ERK model (Eqs. (1)). In these cases, we employ numerical solutions and adapt priors to the larger parameter spaces.

We note that for the Full ERK model and the Rational ERK model, the choice of prior distributions significantly changes both the location and prominence of modes of the posterior distributions. In particular, they tend to be near the endpoints of the prior distributions. This is linked to the practical non-identifiability of these models and prevents us from interpreting parameter modes, and also from conducting a sensible topological comparison that is not highly dependent on the choice of prior distribution.

In order to compute posterior distributions of the involved parameters, we used PyStan, the Python version of the statistical software STAN (Carpenter et al. 2017). While analytical expressions for the posterior distributions are too complex to be feasible for interpretation, PyStan enables us to approximately sample from them via Hamiltonian MCMC. The resulting samples (visualised in Fig. 4) form the basis of our further analysis.

Fig. 4
figure 4

(Color figure online) a Random samples from the posterior distributions for the WT and all mutants (2000 points each); Moreover, we display approximate marginal densities for \(\kappa _1\) in (b), \(\kappa _2\) in (c) and \(\pi \) in (d) in the same colour scheme

5.2 Topological Analysis

To analyse the topology of the samples of the resulting posterior distributions, we introduce notation and methodology from Topological Data Analysis (TDA).

Definition 9

Let \({\textbf{v}}\) be a finite set of vertices. A subset of the power-set of \({\textbf{v}}\), \({{\,\mathrm{{\mathcal {K}}}\,}}\subseteq {\mathcal {P}}({\textbf{v}})\), is called a simplicial complex if for any \(\tau \in {{\,\mathrm{{\mathcal {K}}}\,}}\) the relation \(\tau '\subseteq \tau \) implies \(\tau '\in {{\,\mathrm{{\mathcal {K}}}\,}}\).

We write \({{\,\mathrm{{\mathcal {K}}}\,}}_i=\{\tau \in {{\,\mathrm{{\mathcal {K}}}\,}}\,\vert \,|\tau |=i+1\}\) and call the elements of \({{\,\mathrm{{\mathcal {K}}}\,}}_i\) the i-simplices. A map \(h:{\textbf{v}}\rightarrow {\textbf{v}}'\) which extends to a map \(h:{\mathcal {K}}\rightarrow {\mathcal {K}}'\) by \(h(\tau ):=\{h(v)\,\vert \, v\in \tau \}\) for each \(\tau \in {{\,\mathrm{{\mathcal {K}}}\,}}\) is called a simplicial map.

Fig. 5
figure 5

(Color figure online) a An example of a simplicial complex on vertices \(v_0\), \(v_1\), \(v_2\) and \(v_3\) (left) and its geometrical realisation (right); b An example of a filtration of a simplicial complex, visualised on geometric realisations

We can view a simplicial complex as a combinatorial description of a topological space. Given a simplicial complex \({{\,\mathrm{{\mathcal {K}}}\,}}\), we can investigate its geometric realisation

$$\begin{aligned} \left| {{\,\mathrm{{\mathcal {K}}}\,}}\right| :=\bigcup _{\tau \in {{\,\mathrm{{\mathcal {K}}}\,}}}\textrm{cvx}(\tau )\subseteq {{\,\mathrm{{\mathbb {R}}}\,}}\langle {\textbf{v}}\rangle , \end{aligned}$$

where \(\textrm{cvx}\) denotes the convex hull in the real free vector space generated by the vertices V. The realisation \(|{{\,\mathrm{{\mathcal {K}}}\,}}|\) is endowed with the subspace topology in \({{\,\mathrm{{\mathbb {R}}}\,}}\langle {\textbf{v}}\rangle \). An example of a simplicial complex and its geometric realisation can be found in Fig. 5a. Since \({{\,\mathrm{{\mathcal {K}}}\,}}\) is a discrete and combinatorial entity, one can compute meaningful topological information from topological spaces (or datasets) described by simplicial complexes.

5.2.1 Homology

One topological invariant we can compute from simplicial complexes is homology. In each dimension k, the dimension of the k-th homology group can be thought of as the number of voids in a simplicial complex enclosed by a k-dimensional boundary. We restrict our definition of homology over the field of two elements, \({{\,\mathrm{{\mathbb {F}}}\,}}_2\), which is the setting for our computations. For a simplicial complex, the homology groups coincide with those of its geometric realisation (viewed as a topological space).

Definition 10

Let \({{\,\mathrm{{\mathcal {K}}}\,}}\) be a simplicial complex. We define its chain complex \({\mathcal {C}}_\bullet ({{\,\mathrm{{\mathcal {K}}}\,}})\) over \({{\,\mathrm{{\mathbb {F}}}\,}}_2\) to be the collection of vector spaces \({\mathcal {C}}_i={{\,\mathrm{{\mathbb {F}}}\,}}_2\langle {{\,\mathrm{{\mathcal {K}}}\,}}_i\rangle \), together with the collection of linear maps \(\partial _i:{\mathcal {C}}_i\rightarrow {\mathcal {C}}_{i-1}\) induced by

$$\begin{aligned} \partial _i: \tau \mapsto \sum _{v\in \tau }\tau \backslash \{v\} \end{aligned}$$

for all \(\tau \in {{\,\mathrm{{\mathcal {K}}}\,}}_i\).

We observe that \(\partial _i\circ \partial _{i+1}=0\) for all i. Furthermore, we note that any simplicial map \(h:{{\,\mathrm{{\mathcal {K}}}\,}}\rightarrow {{\,\mathrm{{\mathcal {K}}}\,}}'\) induces a collection of maps on corresponding chain complexes \({\mathcal {C}}_\bullet \) and \({\mathcal {C}}'_\bullet \), denoted \(\{{\hat{h}}_i:{\mathcal {C}}_i\rightarrow {\mathcal {C}}'_i\}_i\), which are defined as

$$\begin{aligned} {\hat{h}}_i(\tau ):={\left\{ \begin{array}{ll}h(\tau ) &{} {\text {if }}\; \dim h(\tau )=\dim \tau \\ 0 &{} {\text {otherwise}}\end{array}\right. }. \end{aligned}$$

We call such a collection of maps a chain map from \({\mathcal {C}}_\bullet \) to \({\mathcal {C}}'_\bullet \). It satisfies \(\partial '_i\circ {\hat{h}}_i={\hat{h}}_{i-1}\circ \partial _i\) for all i.

Definition 11

Let \({{\,\mathrm{{\mathcal {K}}}\,}}\) be a simplicial complex and let \({\mathcal {C}}_\bullet ({{\,\mathrm{{\mathcal {K}}}\,}})\) be its associated chain complex over \({{\,\mathrm{{\mathbb {F}}}\,}}_2\). Then the k-th homology group of \({{\,\mathrm{{\mathcal {K}}}\,}}\) is defined to be the quotient of vector spaces

$$\begin{aligned} H_k({{\,\mathrm{{\mathcal {K}}}\,}}):=\frac{\ker \partial _i}{\textrm{im}\,\partial _{i+1}}. \end{aligned}$$

Note that for \({\hat{h}}:{\mathcal {C}}_\bullet ({{\,\mathrm{{\mathcal {K}}}\,}})\rightarrow {\mathcal {C}}_\bullet ({{\,\mathrm{{\mathcal {K}}}\,}}')\) the induced map \(h^*:H_k({{\,\mathrm{{\mathcal {K}}}\,}})\rightarrow H_k({{\,\mathrm{{\mathcal {K}}}\,}}')\) given by \(h^*:[c]\mapsto [{\hat{h}}_k(c)]\), where \(c\in \ker \partial _k\) and the brackets denote equivalence up to translation by \(\textrm{im}\,\partial _k\) and \(\textrm{im}\,\partial '_k\) respectively, is well defined for all k (Otter 2017). Moreover, for simplicial maps \(h:{{\,\mathrm{{\mathcal {K}}}\,}}\rightarrow {{\,\mathrm{{\mathcal {K}}}\,}}'\) and \(h':{{\,\mathrm{{\mathcal {K}}}\,}}'\rightarrow {{\,\mathrm{{\mathcal {K}}}\,}}''\) we have \((h \circ h')^*= h^*\circ (h')^*\). This property is called the functorality of homology and will be used when we introduce persistence.

5.2.2 Persistence

We view point clouds as a discrete subset of a continuous geometric object embedded in Euclidean space. The underlying continuous space is the primary subject of interest. In order to obtain information about this geometric object, we wish to inflate our discrete points to a continuous space, or to capture a relative offset between points in this space. In practice, we usually do not know the adequate inflation resolution. Persistence theory offers an elegant way to overcome this caveat by scaling the resolution from fine to coarse, and tracking how the homology of these spaces evolves by considering their canonical inclusion relations.

Definition 12

Let \({{\,\mathrm{{\mathcal {K}}}\,}}\) be a simplicial complex and let \(g:{{\,\mathrm{{\mathcal {K}}}\,}}\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\) be a function such that \(\tau \subseteq \tau '\) implies \(g(\tau )\le g(\tau ')\) for any \(\tau ,\tau '\in {\mathcal {K}}\). A filtration of the simplicial complex \({{\,\mathrm{{\mathcal {K}}}\,}}\) by g is then defined to be the sequence of simplicial complexes \(\{{{\,\mathrm{{\mathcal {K}}}\,}}_L\}_{L\in {{\,\mathrm{{\mathbb {R}}}\,}}}\), where

$$\begin{aligned} {{\,\mathrm{{\mathcal {K}}}\,}}_L:=\{\tau \in {{\,\mathrm{{\mathcal {K}}}\,}}\,\vert \,g(\tau )\le L\}, \end{aligned}$$

together with the canonical inclusions \(\iota _L^{L'}:{{\,\mathrm{{\mathcal {K}}}\,}}_L\hookrightarrow {{\,\mathrm{{\mathcal {K}}}\,}}_{L'}\) whenever \(L\le L'\). An example of a filtration is visualised in Fig. 5b. In the same spirit, let \({\mathcal {T}}\) be a topological space and \(g:{\mathcal {T}}\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\) be a continuous function. A filtration of the topological space \({\mathcal {T}}\) is then defined to be the sequence of topological spaces \(\{{\mathcal {T}}_L\}_{L\in {{\,\mathrm{{\mathbb {R}}}\,}}}\), where

$$\begin{aligned} {\mathcal {T}}_L:=\{x\in {\mathcal {T}}\,\vert \,g(x)\le L\}, \end{aligned}$$

together with the canonical inclusions \(\iota _L^{L'}:{\mathcal {T}}_L\hookrightarrow {\mathcal {T}}_{L'}\) whenever \(L\le L'\).

A common way of constructing a filtration from a point cloud \({\textbf{v}}\subset {{\,\mathrm{{\mathbb {R}}}\,}}^d\) is to set \({{\,\mathrm{{\mathcal {K}}}\,}}={\mathcal {P}}(X)\) and \(g(\tau )=\max \{d(x,y)\,\vert \,x,y\in \tau \}\). This is called the Vietoris–Rips filtration, and \({{\,\mathrm{{\mathcal {K}}}\,}}_L\) is a good approximation to an inflation of \({\textbf{v}}\) by placing balls of radius L/2 at each point (Oudot 2015). We will consider the following alternative filtration. For a fixed \(L\in {{\,\mathrm{{\mathbb {R}}}\,}}\) and map \(p:{{\,\mathrm{{\mathbb {R}}}\,}}^d\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\), we set \({{\,\mathrm{{\mathcal {K}}}\,}}':={{\,\mathrm{{\mathcal {K}}}\,}}_L\) in the Vietoris–Rips sense and consider the filtration by the map \(g':{{\,\mathrm{{\mathcal {K}}}\,}}'\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\) defined by \(g'(\tau ):=\max \{p(x)\,\vert \,x\in \tau \}\).

Definition 13

Let \({{\,\mathrm{{\mathbb {F}}}\,}}_2[t]\) be the ring of polynomials in the indeterminate t with coefficients in \({{\,\mathrm{{\mathbb {F}}}\,}}_2\). Let \(\{{{\,\mathrm{{\mathcal {K}}}\,}}_L\}_{L\in {{\,\mathrm{{\mathbb {R}}}\,}}}\) be a filtration of a simplicial complex. Moreover, define \(\textrm{Crit}_L:=\{L\in {{\,\mathrm{{\mathbb {R}}}\,}}\vert \iota ^L_{L-\varepsilon }\ne \textrm{id}\,\forall \varepsilon >0\}\), the set of all L at which \({{\,\mathrm{{\mathcal {K}}}\,}}_L\) changes (which is a finite set at \({{\,\mathrm{{\mathcal {K}}}\,}}\) is finite). Define the function \(c:{\mathbb {N}}_0\rightarrow \textrm{Crit}_L\cup \{\inf \textrm{Crit}_L-1\}\) by mapping 0 to \(\inf \textrm{Crit}_L-1\) and \(n>0\) to the n-th smallest element of \(\textrm{Crit}_L\) (without loss of generality, we map integers bigger than the cardinality of \(\textrm{Crit}_L\) to the largest element of \(\textrm{Crit}_L\)).

For a fixed integer k, let \(H_k(\,\cdot \,)\) denote the k-th simplicial homology with coefficients in \({{\,\mathrm{{\mathbb {F}}}\,}}_2\). Define

$$\begin{aligned} M_k:=\bigoplus _{n\in {{\,\mathrm{{\mathbb {N}}}\,}}_0}H_k\left( {{\,\mathrm{{\mathcal {K}}}\,}}_{c(n)}\right) \end{aligned}$$
(15)

together with the action of \({{\,\mathrm{{\mathbb {F}}}\,}}_2[t]\) on \(M_k\) induced by \(t^a\cdot x = \iota _{c(n+a)}^{c(n)}(x)^*\in H_k({{\,\mathrm{{\mathcal {K}}}\,}}_{c(n+a)})\) for \(x\in H_k({{\,\mathrm{{\mathcal {K}}}\,}}_{c(n)})\) and non-negative integer a. Then \(M_k\) is a (graded) \({{\,\mathrm{{\mathbb {F}}}\,}}_2[t]\)-module, called the persistence module of the filtration.

The definition works analogously for a filtration of a topological space (assuming that the homology of the spaces changes at only finitely many filtration values). It can be shown that the operation of taking a persistence module of a filtration of a simplicial complex (or a topological space) is functorial. Hence, persistence modules are algebraic invariants of filtrations.

Since \({{\,\mathrm{{\mathcal {K}}}\,}}\) is finite, the persistence module \(M_k\) is finitely generated as a \({{\,\mathrm{{\mathbb {F}}}\,}}_2[t]\)-module. As \({{\,\mathrm{{\mathbb {F}}}\,}}_2[t]\) is a principal ideal domain, \(M_k\) decomposes into summands generated by a single object uniquely up to (graded) isomorphism and permutation of summands. Hence, we can write

$$\begin{aligned} M_k\cong \left( \bigoplus _{a\in G_F}{{\,\mathrm{{\mathbb {F}}}\,}}_2[t]\right) \oplus \left( \bigoplus _{b\in G_T}{{\,\mathrm{{\mathbb {F}}}\,}}_2[t]/\langle t^{d_b}\rangle \right) , \end{aligned}$$

where \(G_F\) is the subset of chosen generators that are free and \(G_T\) is the subset of generators that are torsion. In particular, any element in \(G_F\) or \(G_T\) will have a non-zero entry in exactly one summand of the decomposition in Eq. (15). We call the integer n indexing this entry the degree of that element.

Definition 14

Let \(M_k\) be a persistence module that decomposes as above. Let \(\textrm{deg}:G_F\cup G_T\rightarrow {{\,\mathrm{{\mathbb {N}}}\,}}_0\) be the function mapping each element to its degree. The barcode of \(M_k\) is defined to be the multiset

$$\begin{aligned} {\mathcal {B}}:=\{(c(\textrm{deg}(a)),\infty )\,\vert \,a\in G_F\}\cup \{(c(\textrm{deg}(a)),c(\textrm{deg}(a)+d_a))\,\vert \,a\in G_T\}. \end{aligned}$$

We call the elements of \({\mathcal {B}}\) bars, the first coordinate of each bar its birth-value, the latter coordinate its death-value and the absolute difference of the coordinates its persistence.

A matching of barcodes \({\mathcal {B}}\) and \({\mathcal {B}}'\) is a partial injection \(\varpi :{\mathcal {B}}\hookrightarrow {\mathcal {B}}'\). The bottleneck distance between \({\mathcal {B}}\) and \({\mathcal {B}}'\) is defined to be

$$\begin{aligned} d_{BD}\left( {\mathcal {B}},{\mathcal {B}}'\right) :=\inf _\varpi \,\max \left\{ \max _{a\in \textrm{dom}\,\varpi }\left\| a-\varpi (a)\right\| _\infty ,\max _{(x,y)\not \in \textrm{dom}\,\varpi }\frac{y-x}{ 2},\max _{(x,y)\not \in \textrm{im}\,\varpi }\frac{y-x}{ 2}\right\} , \end{aligned}$$

where the infimum is taken over all possible matchings and elements of a barcode are viewed as elements of \({{\,\mathrm{{\mathbb {R}}}\,}}^2\) (we assume \(\infty -\infty =0\)). Here, \(\textrm{dom}\, \varpi \) is the domain of \(\varpi \), i.e. the set of inputs at which \(\varpi \) is defined.

The bottleneck distance defines a metric on the space of barcodes (Oudot 2015). This metric is stable in the following sense:

Theorem 15

(e.g. Corollary 3.6 in Oudot 2015) Let \({{\,\mathrm{{\mathcal {K}}}\,}}\) be a simplicial complex and let \(g,g':{{\,\mathrm{{\mathcal {K}}}\,}}\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\) be functions defining filtrations of \({{\,\mathrm{{\mathcal {K}}}\,}}\), and subsequently persistence modules \(M_k\) and \(M'_k\), and barcodes \({\mathcal {B}}\) and \({\mathcal {B}}'\). Then

$$\begin{aligned} d_{BD}\left( {\mathcal {B}},{\mathcal {B}}'\right) \le \left\| g-g'\right\| _\infty . \end{aligned}$$

Henceforth, we write \(\textrm{PH}_k(g)\) to denote the k-dimensional persistent homology (which can equivalently be summarised by a barcode or a persistence module) of a simplicial complex or a topological space filtered by a function g.

5.2.3 Persistent Homology of Random Data

In this section, we study the persistent homology of the posterior distributions of the parameter inferences of Sect. 5.1. Note that simplicial complexes, filtrations and persistent homology can also be employed to compare biological models a priori (i.e. with no dependence on measurement data) (Vittadello and Stumpf 2020).

We demonstrate that filtering a Vietoris–Rips complex for a fixed value L by a function \(g'\), as described at the beginning of this section, yields more discriminative power. Here, we pick \(g'\) to be an estimated probability density function. These filtrations turn out to be highly discriminative between the mutants and offer novel insight at the biological level. While a Vietoris–Rips filtration is entirely based on distances, the construction we employ, using a Vietoris–Rips complex at a fixed parameter value and then filtering it by a probability density function (pdf), places an emphasis on density. The information encoded is directly related to the probability distribution and the resulting barcodes will stabilise as the sample size increases (Theorem 3.5.1 in Rabadan and Blumberg 2020). Furthermore, the chosen construction is stable with respect to outliers. By contrast, in a Vietoris–Rips filtration, bars in the resulting barcodes will converge towards zero length when increasing the sample size and a single outlier, even in a large sample, can change a barcode drastically.

Initially, we assume that we are given a probability density function \(p:{{\,\mathrm{{\mathbb {R}}}\,}}^m\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\). This pdf defines a filtration of the graph by \(-p\), \({\mathcal {T}}\) say, via \({\mathcal {T}}_L=\left\{ x\in {{\,\mathrm{{\mathbb {R}}}\,}}^m\,\vert \, -p(x)\le L\right\} \). For \(L'\le L\) we then have \({\mathcal {T}}_{L'}\subseteq {\mathcal {T}}_{L}\). Such a filtration is visualised for the case \(m=1\) in Fig. 6. By analogy with filtrations of simplicial complexes, we can theoretically compute a barcode for each such topological filtration and investigate the resulting bottleneck distances.

For each (homological) dimension, these barcodes provide a topological signature of a posterior distribution. We point out that although this signature is not a sufficient statistic, it is effective at distinguishing between posteriors corresponding to distinct mutants in our application. In particular, for any pdf \(p_1:{{\,\mathrm{{\mathbb {R}}}\,}}^d\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\), the pdf \(p_2(x)=p_1(x-x_0)\) gives rise to the same topological signature for any constant \(x_0\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\). Thus, rather than comparing the location of probability density in parameter space, in the context of a Bayesian inference, this topological signature captures the quality of the certainty we have in parameter values, irrespective of their location.

For example, bars in the \(H_0\)-barcode encode the density (as negative of the birth-value) and the prominence (as the persistence) of the modes of a pdf. Similarly, Morse Theory tells us that for a (smooth) pdf on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\), the \((d-1)\)th barcode captures local minima by their density (as death-value) and the depth of their basin of attraction (as persistence).

In order to conduct such a topological analysis, two questions must be addressed:

  1. (1)

    How can we approximate the topology of a graph of a probability density combinatorially (i.e. in a manner amenable to the application of discrete computational methods) if only point samples are available?

  2. (2)

    Can we test the statistical significance of the resulting bottleneck distances?

To resolve the first question, we will employ a result from Bobrowski et al. (2017) that relies on the concept of kernel density estimation (KDE). In order to test the significance of the resulting bottleneck distance, we will use an empirical p-value estimate.

Definition 16

Let \({\textbf{v}}=\{v_1,\dots ,v_N\}\subseteq {{\,\mathrm{{\mathbb {R}}}\,}}^m\) be a set of N samples drawn independently from a probability distribution governed by the density function \(p:{{\,\mathrm{{\mathbb {R}}}\,}}^m\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\). Let \(K:{{\,\mathrm{{\mathbb {R}}}\,}}^m\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\) be smooth, unimodal, symmetric probability density function whose support is contained in the unit ball centred at 0. Then

$$\begin{aligned} {\hat{p}}_b(x)=\frac{1}{ Nb^m}\sum _{i=1}^NK\left( \frac{x-v_i }{ b}\right) \end{aligned}$$

is called a kernel density estimate (KDE) of p with bandwidth b.

On each sample \(v_i\), we place a pdf and average it, where b controls the width of each pdf, that is, how much of the probability mass is centred around \(v_i\). Loosely speaking, if b is too large, then the resulting function underfits a histogram given by the data, while if it is too small, then the bandwidth overfits the histograms (see Fig. 7). The bandwidth is negatively correlated with the sample size and there are standardised ways of picking optimal bandwidths for the case where p is unknown (Henderson and Parmeter 2012).

Fig. 6
figure 6

(Color figure online) An example of a super-level-set filtration of the graph of a density function \(p:{{\,\mathrm{{\mathbb {R}}}\,}}^1\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\). This is equivalent to a sub-level-set filtration of \(-p\)

Fig. 7
figure 7

(Color figure online) A probability density function on \({{\,\mathrm{{\mathbb {R}}}\,}}^1\) (black line) with 1.000 samples (black dashes). We see kernel density estimations with bandwidth 0.6 (blue line) and 1.4 (green line). The ideal bandwidth is approximately 1

Given such an i.i.d. sample \({\textbf{v}}=\{v_1,\dots ,v_N\}\subseteq {{\,\mathrm{{\mathbb {R}}}\,}}^m\) from our probability density function p and an optimal bandwidth b, we can construct a Vietoris–Rips complex with fixed parameter b (equalling the bandwidth)

$$\begin{aligned} \textrm{VR}_b({\textbf{v}}):=\left\{ \{v_0,\dots ,v_k\}\subseteq {\textbf{v}}\,\vert \,\Vert v_i-v_j\Vert \le b\,\forall i,j\right\} . \end{aligned}$$

For the sake of brevity, let \({{\,\mathrm{{\mathcal {K}}}\,}}={\text {VR}}_b({\textbf{v}})\). The KDE \({\hat{p}}_b\) of p based on \({\textbf{v}}\) then extends to a function on \({\mathcal {K}}\) via

$$\begin{aligned} {\hat{p}}_b(\{v_0,\dots ,v_k\}):=\min \left\{ {\hat{p}}_b(v_0),\dots ,{\hat{p}}_b(v_k)\right\} . \end{aligned}$$

In return, the extended function \({\hat{p}}_r\) defines a filtration \(\{{{\,\mathrm{{\mathcal {K}}}\,}}_L\}_{L\in {{\,\mathrm{{\mathbb {R}}}\,}}}\) of \({{\,\mathrm{{\mathcal {K}}}\,}}\) by

$$\begin{aligned} {{\,\mathrm{{\mathcal {K}}}\,}}_L:=\left\{ \{v_0,\dots ,v_k\}\,\vert \,-{\hat{p}}_b(\{v_0,\dots ,v_k\})\le L\right\} . \end{aligned}$$

We seek to relate the persistent homology of the filtration of simplicial complexes \({{\,\mathrm{{\mathcal {K}}}\,}}_L\) to the persistent homology of the filtration of topological spaces \({\mathcal {T}}_L\).

In order to use results from Bobrowski et al. (2017), we introduce some notation. For a function \(f:{{\,\mathrm{{\mathcal {K}}}\,}}\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\) and \(\eta >0\) define \(f_{\lfloor \eta \rfloor }(\sigma ):=2\eta \lfloor f(\sigma )/(2\eta )\rfloor \). Then

Theorem 17

(Theorem 3.7 in Bobrowski et al. 2017) Let \(p:{{\,\mathrm{{\mathbb {R}}}\,}}^m\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\) be a smooth bounded pdf with finitely many critical points. Let \({\hat{p}}\) be a KDE with bandwidth b based on n i.i.d samples of p and \({{\,\mathrm{{\mathcal {K}}}\,}}\) be a simplicial complex as above. Assume \(b\rightarrow 0\) and \(Nb^m\rightarrow \infty \). Then for any \(0\le k \le m\), we have

$$\begin{aligned} \textrm{Pr}\left( d_{BD}\left( \textrm{PH}_k(p\right) , \textrm{PH}_k\left( {\hat{p}}_{\lfloor \eta \rfloor }\right) \le 5\eta \right) \ge 1-3\eta ^* Ne^{-C_\eta Nr^d}, \end{aligned}$$

where for \(p_{\max }:=\sup _{x\in {{\,\mathrm{{\mathbb {R}}}\,}}^d}p(x)\) we define

$$\begin{aligned} \eta ^*:=\left\lceil p_{\max }/2\eta \right\rceil \quad {\text {and}}\quad C_\eta :=\frac{(\eta /2)^2}{ 3p_{\max }+\eta /2}. \end{aligned}$$

Theoretically, the above theorem can be exploited for testing the null hypothesis \({\textbf{H}}_0:\textrm{PH}_k\left( p\right) =\textrm{PH}_k\left( p'\right) \) for two distributions P and \(P'\) with associated densities p and \(p'\), as the result enables us to establish a bound on how large a bottleneck distance can be explained by sampling noise at a given significance level. However, we estimate that to use this theorem for showing that the bottleneck distances between posterior distributions associated with the wild-type and the four mutants are significant, we must sample at least \(1.5\times 10^7\) points per distribution. This makes persistent homology computation infeasible.

At the same time, we observe that there is little change in the bottleneck distances between the barcodes resulting from the wild-type’s and the four mutants’ posterior distributions when resampling point clouds containing as few as 200 points. This leads us to think that the true p-value associated with the null hypothesis \({\textbf{H}}_0:\textrm{PH}_k\left( p\right) =\textrm{PH}_k\left( p'\right) \), where p and \(p'\) are posterior densities corresponding to the wild-type and a mutant is possibly much lower than the upper bound derived by appealing to Theorem 17. One factor that may explain this discrepancy is that while our distributions are technically distributions on \({{\,\mathrm{{\mathbb {R}}}\,}}^3\), they have compact support. Similarly, major sources of instability for KDE, and subsequently for the filtration of density functions, are modes linked to outliers, while repeated simulations suggest that in our case all density functions are unimodal. Together, these aspects imply that the computed barcodes could converge to the barcode obtained by filtering the unknown density function at a faster rate than in the general setting of Theorem 17.

Henceforth, we use the method of constructing a filtration based on a point cloud proposed in Bobrowski et al. (2017), which is provably well-behaved asymptotically but uses a different approach to estimate significance. To do this we opt for a Monte Carlo p-value estimate, also known as the empirical p-value (e.g. see Davison and Hinkley 1997). For each mutant, we sample \(\beta \) additional point clouds of size n from the posterior distribution. In this context, for the first mutant (or the wild-type) under investigation, we call the original point cloud \({\textbf{v}}\) and let \({\textbf{v}}_i\) for \(i=1,\ldots ,\beta \) denote \(\beta \) additional point clouds of size n, obtained by repeated sampling. Define \({\textbf{v}}'\) and \({\textbf{v}}'_i\) analogously for a distinct mutant. Let \(d_i=d_{BD}\left( \textrm{PH}_k\left( {\hat{p}}\right) ,\textrm{PH}_k\left( {\hat{p}}_i\right) \right) \), where \({\hat{p}}_i\) is the density estimate obtained from \({\textbf{v}}_i\) and define \(d_i'\) analogously. Assume \(d=d_{BD}\left( \textrm{PH}_k\left( {\hat{p}}\right) ,\textrm{PH}_k\left( {\hat{p}}'\right) \right) \) is the j-th largest element in the multiset \(\left\{ d_i\right\} _{i=1}^\beta \cup \{d\}\) and the \(j'\)-th largest element in \(\left\{ d'_i\right\} _{i=1}^\beta \cup \{d\}\) for two distinct mutants, then

$$\begin{aligned} {\hat{\pi }}=\min \left\{ \frac{\beta +1-j}{ \beta +1},\frac{\beta +1-j' }{ \beta +1}\right\} \end{aligned}$$

is a p-value estimate for a hypothesis test \({\textbf{H}}_0:\textrm{PH}_1\left( p\right) =\textrm{PH}_1\left( p'\right) \). The resulting p-value estimates, for each pair of mutants and wild-type, can be found in Table 1. It is likely that these p-value estimates over-estimate the actual value, but they allow us to reject all null hypotheses at a significance level of 0.05 (North et al. 2002).

Table 1 Bottleneck distance between Barcodes for \(H_1\) obtained from super-level-set filtration of KDEs (top) and their respective p-value estimates (bottom)

The results (Table 1) of the topological data analysis quantify the differences between the Linear ERK model parameter posteriors for WT and mutants and find SSDD mutant kinetics are most different from WT and other mutants. This biological result raises the suitability for using the SSDD variant as a replacement for wild-type MEK activated by Raf. We suggest this should be investigated with further experimental studies. The previous work by Yeung et al. (2019) found that \(\pi \), the processivity parameter, of E203K differed the most from WT MEK. Here we extended and complemented their analysis by comparing the three parameters together as a point cloud.

It remains to address the practical computability of all the constructions involved. As mentioned in the previous section, we use the statistical software STAN (in particular, PyStan) to sample from the posterior distributions. This sampling is approximate via Hamiltonian MCMC (Carpenter et al. 2017), but we can verify via output summaries and trace plots of the Markov chains involved that all chains have converged close to their stationary distribution during the warm-up phase.

In order to construct the KDE, we used the KernelDensity method of the Python package sklearn. We used the Epanechnikov kernel, which satisfies the hypothesis of the kernel in Theorem 17. As a guess for the bandwidth, this package uses a rule of thumb proportional to Silverman’s method, which we then cross-validate and plot against a histogram of our samples for each marginal distribution. Given experimental data, we construct a Vietoris–Rips complex with a radius b, equalling the bandwidth from the KDE, using the Python package dionysus (version 2). We compute the resulting bottleneck distances using the package persim.

6 Conclusion

We presented an exhaustive mathematical analysis that supports the three main findings presented in Yeung et al. (2019): model reduction, analysis of the model parameters and comparing mutation kinetics. Yeung et al. observed that certain values of parameter combinations from the Full ERK model fit the data, which in turn motivated the creation of a reduced model, the Linear ERK model. We confirmed the derivation of the Linear ERK model using algebraic QSS and the validity of the QSS approximation using the QSS variety. We performed systematic identifiability analyses on all three models, which is a prerequisite for meaningful parameter estimation. We found the Full, Rational and Linear ERK models are structurally identifiable. We then improved a previous definition of practical identifiability and showed that the Linear ERK model is practically identifiability but Rational and Full ERK models are not, which is consistent with (Yeung et al. 2019). Hitherto, testing structural identifiability has been limited to small models due to computational costs; however, recent work significantly improves computing structural identifiability, enabling analysis of larger models (Dong et al. 2021; Villaverde et al. 2019). We remark there are many realistic models, such as this ERK study or those by the group of Marisa Eisenberg, that benefit from existing methods and motivate the development of new identifiability tools.

We reproduced the parameter inference for wild-type and mutant MEK experiments. While Yeung et al visually inspected samples of the posteriors, here we quantified these point clouds with computational algebraic topology. In future, it would be interesting to further explore the relationship between topological analysis and practical identifiability and how they may be used to inform experimental design (Apgar et al. 2010; Hagen et al. 2013). Throughout we showcase the potential role of algebra, geometry and topology in systems and synthetic biology. Complementary to the analysis here is an inference of models in systems and single-cell biology that relies on algebra and topology (Wang et al. 2019; Vittadello and Stumpf 2021; Rizvi 2017). We believe that topological data analysis in combination with modelling and parameter estimation is a promising area for the sciences (Thorne et al. 2022; Carriere et al. 2018; Suzuki 2021). We hope our analysis of this ERK case study will motivate other systems biologists to partner with algebraists and topologists to analyse dynamical systems together with their experimental setup and data.