Learning system parameters from turing patterns

The Turing mechanism describes the emergence of spatial patterns due to spontaneous symmetry breaking in reaction–diffusion processes and underlies many developmental processes. Identifying Turing mechanisms in biological systems defines a challenging problem. This paper introduces an approach to the prediction of Turing parameter values from observed Turing patterns. The parameter values correspond to a parametrized system of reaction–diffusion equations that generate Turing patterns as steady state. The Gierer–Meinhardt model with four parameters is chosen as a case study. A novel invariant pattern representation based on resistance distance histograms is employed, along with Wasserstein kernels, in order to cope with the highly variable arrangement of local pattern structure that depends on the initial conditions which are assumed to be unknown. This enables us to compute physically plausible distances between patterns, to compute clusters of patterns and, above all, model parameter prediction based on training data that can be generated by numerical model evaluation with random initial data: for small training sets, classical state-of-the-art methods including operator-valued kernels outperform neural networks that are applied to raw pattern data, whereas for large training sets the latter are more accurate. A prominent property of our approach is that only a single pattern is required as input data for model parameter predicion. Excellent predictions are obtained for single parameter values and reasonably accurate results for jointly predicting all four parameter values.


Introduction 1.Motivation and Overview
Reaction-diffusion models in the form of Eq. (2.2) are used to describe the dynamic behaviour of interacting and diffusing particles in various disciplines including biochemical processes [Mur01], ecology [HLBV94], epidemiology [Mar15] and tumor growth [GG96].Here, we are interested in systems where the interaction of particles can give rise to spontaneous symmetry breaking of a homogenous system by means of the so-called Turing mechanism which was first described by Alan Turing in 1952 [Tur52].It describes the scenario where a stable steady state of a non-spatial system of ordinary differential equations becomes unstable due to diffusion [Mur82,Per15].This phenomeon is hence also referred to as diffusion-driven instability.Such instabilities typically give rise to stable non-homogenous spatial patterns.In two spatial dimensions, for example, these patterns can take various forms such as spots, stripes or labyrinths [Mur01].This variety of patterns is generated by a reaction-diffusion model that is parametrised by a few parameters that represent physical quantities of the system, such as reaction and diffusion rate constants.Certain versions of the well-known Gierer-Meinhardt model, for example, comprises four effective parameters (cf.Section 2.4) [GM72,Mur01].
It was not until almost four decades after Alan Turing's seminal work that the first experimental observation of a Turing pattern was realised in a chemically engineered system [CDBDK90].Recently, as a first practical application, a chemical Turing system was engineered to manufacture a porous filter that can be used in water purification [TCP + 18].In biological systems, Turing patterns are regarded as the main driving mechanism in the formation of spatial structures in various biological systems, including patterning of palatel ridges and digits, hair follice distribution, feather formation, and patterns on the skins of animals such as fish and zebras [EOP + 12, RMRS14, SRTS06, JFWJ + 98, NTKK09].For a recent survey, see [LJDM20].However, the biological and mathematical complexity has often prevented identification of the precise molecular mechanisms and parameter values underlying biological systems.One difficulty in fitting models to experimental data stems from the high sensitivity of the arrangement of local structure on the initial conditions that are usually unknown in practice.
In this paper, we focus on the inverse problem: given a non-homogenous spatial pattern and a reaction-diffusion model, the task is to predict the parameter values that generate the pattern as steady state of the reaction-diffusion equation.To this end, we introduce a novel pattern representation in terms of resistance distance histograms that effectively represents the spatial structure of patterns, irrespective of the local variability stemming from different initial conditions.This enables to compute almost invariant distances between patterns, to compute clusters of patterns and, above all, to predict model parameter values.
Specifically, we focus on the Gierer-Meinhardt model as a case study and apply and compare stateof-the-art machine learning techniques and neural network architectures for prediction.The former approaches comprise nonlocal image features for the invariant representation of spatial patterns and kernel-based methods for parameter prediction.The latter neural networks are either applied using the afore-mentioned nonlocal image features, or are directly applied to the raw pattern data and perform feature extraction and parameter prediction simultaneously.All predictors are trained using various parameter values that generate diffusion-driven instabilities and corresponding spatial patterns that result from solving numerically the reaction-diffusion equations.

Related Work
The problem studied in this paper, parameter prediction from observed Turing patterns, has been studied in the literature from various angles.We distinguish three categories and briefly discuss few relevant papers.
Turing parameter estimation by optimal control.The work [GMT10] presents an approach for estimating parameter values by fitting the solution of the reaction-diffusion equation to a given spatial pattern.This gives rise to a PDE-constraint problem of optimal control requiring sophisticated numerics; see also [SPM16].A similar approach is studied in [SLB19].
The authors of [GMT10] show and demonstrate that the proposed control problem is solvable which indicates that the task studied in our work, i.e. learning directly the pattern-to-parameter mapping, is not unrealistic.The approach of [GMT10] has been generalized by [GT14] in order to handle also non-constant spatially-distributed parameters.In our work, we only consider constant parameter values.
A strong property of approaches employing PDE-based control is that they effectively cope with noisy observed patterns, provided that the type of noise is known such that a suitable objective function can be set up.A weak point is that in some of these papers the initial conditions are assumed to be known which is not the case in realistic applications, and that sophisticated and expensive numerics is required.
The problem to control PDEs that generate time-varying travelling wave patterns has been studied recently [UKYK17, KUK20,SS20].In these works the focus is on fitting the trajectory of the evolving pattern in function space, however, rather than on estimating model parameter values that are assumed to be given.
Turing parameter estimation by statistical inference.The paper [CFVM19] presents a Bayesian approach to parameter estimation using the reaction diffusion equation as forward mapping and a data likelihood function corresponding to additive Gaussian zero mean noise superimposed on observed patterns.Given a pattern, the posterior distribution on the parameter space is explored using expensive MCMC computations.A weak point of this approach shared with the works discussed above in the former category is that the initial conditions are assumed to be known.This assumption is not required in our approach presented below.
Closer to our work is the recent paper [KH20].These authors also study model parameter identification from steady-state pattern data only, without access to initial conditions or the transient pattern evolving towards the steady state.The key idea is to model statistically steady-state patterns of 'the same class', i.e. patterns whose spatial structure differ considerably due to different initial conditions.This is achieved by adopting a Gaussian model for the empirical distribution of discretized L 2 distances between spatial patterns, which can be justified theoretically in the large sample limit.This approach requires a few dozen to hundreds of novel test patterns to estimate model parameters.
In our work, we proceed differently: an almost invariant representation of patterns of 'the same class' is developed.This is advantageous in practice since parameter prediction can be done for each single novel test pattern.
Turing parameter estimation: other approaches.The work [MVM18] focus on the identification of parameter values through a linear stability analysis on various irregular domains, assuming that the corresponding predicted pattern is close to a desired or observed pattern.However, the authors admit that, in many cases, the steady-state pattern may not be an eigenfunction of the Laplacian on the given domain, since the nonlinear terms play a role in the resultant steady-state pattern.In our work, we solely focus on steady-state patterns and the inverse parameter value prediction problem.
A recent account of the broad variety of Turing pattern generating mechanisms and corresponding identifiability issues is given by [WKG21].In our work, we focus on the well-known Gierer-Meinhardt model and study the feasibility of predicting points in the four-dimensional parameter space based on given steady-state patterns.

Contribution and Organisation
We introduce a novel representation of the spatial structure of Turing patterns which is achieved by computing resistance distances within each pattern, followed by discretization and using the empirical distribution of resistance distances as class representative.Discretization effects are accounted for by using the Wasserstein distance and a corresponding kernel function for comparing pairs of patterns.Based on this representation we present results of a feasibility regarding the prediction of model parameter values from observed patterns.To our knowledge, this is the first paper that applies machine learning methods to the problem of mapping directly Turing patterns to model parameter values of a corresponding system of reaction-diffusion equations that generate the pattern as steady state.Adopting the Gierer-Meinhardt model as a case study, we demonstrate that about 1000 data points suffice for highly accurate prediction of single model parameter values using state-of-the-art kernel-based methods.The accuracy decreases for predictions of all four model parameter values but is still sufficiently good in terms of the normalized root-mean-square error and the corresponding pattern variation.In the large data regime (≥ 20.000 data points) predictions by neural networks trained directly on raw pattern data outperform kernel-based methods.
Our paper is organized as follows.Section 2 summarizes basics of Turing patterns that are required in the remainder of the paper: definition of diffusion-driven patterns; discretization and a numerical algorithm for solving a system of semi-linear reaction diffusion equations whose steady states correspond to the patterns that are used as input data for model parameter prediction; the Gierer-Meinhardt PDE and its parametrization.Section 3 details the features that are extracted from spatial patterns in order to predict model parameter values.A key feature are histograms of resistance distances that represent spatial pattern structure in a proper invariant way.Section 4 introduces four methods for model parameter prediction from observed patterns: two kernel-based methods (basic SVM regression and operator-valued kernels) and neural networks are applied to either nonlocal pattern features or to the raw pattern data directly.Numerical results are reported and discussed in Section 5. We conclude in Section 6.

Basic Notation
We set [n] = {1, 2, . . ., n} and 1 n = (1, . . ., 1) ∈ R n for n ∈ N. The Euclidean inner product is denoted by p, q for vectors q, p ∈ R n with corresponding norm q = q, q .The ∞ n -norm is denoted by p ∞ = max{|p i | : i ∈ [n]}.A, B = tr (A B) with trace tr (•) denotes the inner product for two matrices A, B. The spectral matrix norm is written as A .The symbol λ with matrix argument denotes an eigenvalue λ(A) of the matrix.R n + is the nonnegative orthant and u > 0 means u 1 > 0, . . ., u n > 0 if u ∈ R n .Diag(u) is the diagonal matrix that has the components of a vector u as entries.Similarly, Diag(A 1 , . . ., A n ) is the block diagonal matrix with matrices A i , i ∈ [n] as entries.The probability simplex is denoted by

Turing Patterns: Definition and Computation
This section provides the required background on Turing patterns: reaction-diffusion systems (Section 2.1), Turing instability and patterns (Section 2.2), and a numerical algorithm for computing Turing patterns (Section 2.3).We refer to, e.g., [Mur01,Per15] for comprehensive expositions and to [KM10,LJDM20] for recent reviews.

Reaction-Diffusion Models
Consider a system of N interacting species described by the state vector u(t) = (u 1 (t), . . ., u N (t)), where u i (t) ∈ R + is the time-dependent concentration of the ith species.We assume that the dynamics is governed by an autonomous system of ordinary differential equations where initial condition u 0 ∈ R N + is assumed to be positive and f : R N → R N encodes interactions of the species.We further assume that the functions f i , i ∈ [N ] are continuously differentiable with bounded derivatives.'Autonomous' means that f does not explicitly depend on the time t.
Next, model (2.1) is extended to a spatial scenario including diffusion.Concentrations u i (t), i ∈ [N ] are replaced by space-dependent concentration fields u(r, t) = (u 1 (r, t), . . ., u N (r, t)), where r = (r 1 , . . ., r M ) denotes a point in a region S ⊂ R M .The dynamics of these fields is described by a system of reaction-diffusion equations where System (2.2) has to be supplied with boundary conditions in order to be well-posed.A common choice are homogeneous Neumann conditions.We choose periodic boundary conditions, however, because this considerably speeds up the generation of training data by numerical simulation (Section 2.3), yet does not facilitate or change in any essential way the learning problem studied in this paper.

Turing Patterns
We characterize Turing instabilities that cause Turing patterns.Suppose u * ∈ R N + is an equilibrium point of (2.1) satisfying f (u * ) = 0.In order to assess the stability of u * , we write and compute a first-order expansion of the system (2.1), with the Jacobian Assuming that u * is asymptotically stable, we next consider the extended system (2.2) that involves spatial diffusion.Let u * = u * (r) denote the spatially constant extension of the equilibrium point that neither depends on the time t nor on the spatial variable r: u * (r, t) = u * (r) = u * (r ), ∀r, r ∈ S. Hence ∆ N u * = 0. Due to the diffusion terms, this equilibrium of f may not be stable anymore for the system (2.2), however.To assess the stability of u * , a linear stability analysis is conducted using the ansatz u(r, t) = u * + u(t)e i q,r , (2.6) The perturbation u(t)e i q,r conforms to the eigenvalue problem of the linearized spatial system (2.2).Substituting this ansatz into (2.2) and expanding to the first order with respect to yields a linear system of equations for u(t), similar to Eq. (2.4), but with Jacobian J given by J(u * , q) = J(u * ) − |q| 2 D. (2.7) Let λ 1 (q) = λ( J(u * , q)), . . ., λ N (q) = λ( J(u * , q)) be the eigenvalues of J(u * , q).For q = 0, we have λ i (0) = λ i , i ∈ [N ] (eigenvalues of J(u * ) given by (2.5)) and hence Re( since u * = u * (r) is equal to the equilibrium of non-spatial system (2.1) for every r.
We say u * is a Turing instability of the system (2.2) if there exists a finite q > 0 and some i ∈ [N ] for which Re( λ i (q)) > 0, i.e. the steady state u * becomes unstable for a certain wavevector q.Here, we are interested in the additional condition Re( λ j (q)) < 0 for q → ∞ for all j ∈ [N ], such that Re( λ i (q)) has a global maximum for some finite q .This type of instability typically leads to a stable pattern of a wavelength corresponding to q [Mur01].In summary, the conditions for a Turing instability read Re( λ j (0)) = Re(λ j ) < 0, for all j ∈ [N ], (2.8a) there exist q > 0 and i ∈ [N ] for which Re( λ i (q)) > 0, (2.8b) Re( λ j (q)) < 0 if q → ∞ for all j ∈ [N ]. (2.8c) For other types of instabilities, we refer the reader to [SSIS19].Figure 2.1 shows Re( λ i (q)) of a twospecies system (N = 2) with Turing instability and the Turing pattern resulting from solving numerically the system of equations (2.2).

Numerical Simulation
This section describes the numerical algorithm used to simulate the system of PDEs (2.2).

Discretization
We consider reaction-diffusion systems of the form (2.2) with two spatial dimensions M = 2, spatial points and domain and their solutions within the time interval t ∈ [0, T ].We further assume doubly periodic boundary conditions, i.e., u i (0, r 2 , t) = u i (n r , r 2 , t) for each r 2 ∈ [0, n r ] and where each node v ∈ V indexes a point r v ∈ S. The edge set E represents the adjacency of each node to its four nearest neighbors on the grid and takes into account the doubly periodic boundary conditions.Each of these edges have length 1 corresponding to uniform sampling along each coordinate r 1 and r 2 , respectively, of size 1. Figure 2.1: (a) Panel (a) displays the eigenvalue λ i (q) of the Jacobian J(u * , q) in Eq. (2.7) that gives rise to the Turing instability.The real part of λ i (q) is shown, evaluated at an asymptotically stable equilibrium point u * for the Gierer-Meinhardt model of Eq. (2.26), as a function of q 2 .The parameters of the model were set to a = 0.01, b = 1.2, c = 0.7, δ = 40 and the scaling parameter to s = 1.One has Re( λ i (q))| q=0 < 0 due to an asymptotically stable equilibrium as explained after Eq. (2.5).We find that Re( λ i (q)) becomes positive for an intermediate range of q values, which indicates a Turing instability.(b) The first species field u 1 (r, t) of the solution to the system (2.2) computed on a 128 × 128 grid, as described in Section 2.3.

Algorithm
Case N = 1.For simplicity, consider first the case of a single species N = 1, u(r, t) = u 1 (r, t), with diffusion constant δ, and let v(t) denote the vector obtained by stacking the rows of the two-dimensional array of function values u(r v , t) v∈V evaluated on the grid.We discretize time into equal intervals of length h > 0 and write v (k) = v(kh) for k ∈ N. The discretized PDE of the single species case ∂ ∂t u(r, t) = f u(r, t) + d∆u(r, t), u(r, 0) = u 0 (r) (2.11) of Eq. (2.2) is solved by the implicit Euler scheme where matrix L is the Laplacian discretized using the standard 5-point stencil.To perform a single timestep update according to (2.12), we rewrite this equation as fixed point iteration with an inner iteration index l where I is the identity matrix.This fixed point equation is iterated until the convergence criterion is met for some constant 0 < ε l 1, followed by updating the outer iteration (2.13) (2.15) The outer iteration is terminated when for some constant 0 < ε k 1. Due to the doubly periodic boundary conditions, the matrix I − hδL = W ΛW * is a sparse, blockcirculant and can hence be diagonalized using the unitary Fourier matrix W corresponding to the twodimensional discrete Fourier transform of doubly periodic signals defined on the graph G.As a result, using the fast Fourier transform (2D-FFT), multiplication of the inverse matrix by some vector b, (2.17) can be efficiently computed due to the convolution theorem by • pointwise multiplication with the inverse eigenvalues of the matrix: Λ −1 b, where Λ −1 is a diagonal matrix and hence is inverted elementwise, • transforming back using the inverse 2D-FFT: The eigenvalues Λ of the matrix I + hδL result from applying the 2D-FFT to the block-circulant matrix corresponding to the convolution stencil (2.18) Case N > 1.This procedure applies almost unchanged to the case of multiple species (N > 1), because the diffusion operator denotes the stacked subvectors v 1 , v 2 that result from stacking the rows of the twodimensional arrays of function values u 1 (r v , t) v∈V , u 2 (r v , t) v∈V evaluated on the grid.The fixed point iteration (2.13) then reads where the mappings (I − hδ i L) −1 , i = 1, 2 can be applied in parallel to the corresponding subvectors.
Note that the vector f (v couples the species concentrations.The general case N > 2 is handled similarly.

Step Size Selection
Step size h has to be selected such that two conditions are fulfilled: Matrix I −hδL N should be invertible where δL N means the block-diagonal matrix and the fixed point iteration (2.13) should converge.We discuss these two conditions in turn.The first condition holds if This also yields the estimate λ max (L) may be easily computed beforehand using the power method [HJ13, p. 81] or replaced by the upper bound due to Gerschgorin's circle theorem [HJ13, Section 6.1].Now consider the fixed point iteration (2.13).Due to our assumptions stated after eq.(2.1), the mapping f is Lipschitz continuous, i.e. there exists a constant L f > 0 such that , we obtain using (2.22) and (2.23) As a result, both above-mentioned conditions hold if h is chosen small enough to satisfy (2.21) and Then the mapping T h is a contraction and, by Banach's fixed point theorem, the iteration converges.

The Gierer-Meinhardt model
As concrete examples, we consider evaluations of the Gierer-Meinhardt model [GM72] comprising two species: a slowly diffusing activator that promotes its own and the other species' production, and a fast diffusing inhibitor that suppresses the production of the activator.Regarding the representation of the model by means of a PDE as in eq.(2.2), several different variants have been proposed in the literature [GM72,Mur01].Here, we use the dimensionless version analysed in [Mur82] and defined by with parameters a, b, c, δ, s > 0 and the shorthands (cf.Eq. (2.2)) (2.27) Since only the ratio between the diffusion constants of the two species effects the stability of the system, the diffusion constant of the first species in (2.26) is normalised and we set δ 1 = 1, δ 2 = δ.The overall scaling of D determines however the wavelength of an emerging pattern, and we accordingly multiply the diffusion matrix D in eq.(2.26) with an additional scaling factor s > 0.
Figure 2.1 displays the eigenvalues of the Jacobian of this model in the context of Turing instabilities as described in Section 2.2, for one choice of parameters a, b, c, δ, and the corresponding Turing pattern emerging when simulating the model numerically as described in Section 2.3.Figure 2.2 shows simulation results for various parameter values c and δ.The model gives rise to different types of patterns, ranging from spots to labyrinths.The characteristic length scale, or 'wavelength', of the pattern varies with these parameters.This limits the ranges of parameter values that we analyse in the experiments: a too small wavelength leads to numerical artefacts when simulating the corresponding PDEs due to discretization errors; a too large wavelength on the other hand only yields a small section of the spatial pattern as 'close-up view'.
We hence only consider parameter values that exclude both extreme cases relative to a fixed grid graph that was used for numerical computation.

Extracting Features from Turing Patterns
We extract two types of features from Turing patterns: resistance distance histograms (Section 3.1) efficiently encode the spatial structure of patterns due to their stability under spatial transformations.This almost invariant representation also includes few symmetries, however, which may reduce the accuracy of parameter prediction in certain scenarios.Hence two additional features are extracted that remove some of these symmetries (Section 3.2).

Resistance Distance Histograms (RDHs)
Resistance distance histograms (see Definition 3.1 below) require two standard preprocessing steps described subsequently: representing Turing patterns as weighted graphs and computing pairwise resistance distances.
Graph-based representation of Turing patterns.Let u i,j , i, j ∈ [n r ] be the concentration values of some species of a reaction-diffusion system at time t = T on a regular torus grid graph j be the concentration value at v = (i, j), obtained by simulating a system of PDEs (2.2) as described in Section 2.3, and denote by the mean concentration.We assign weights ω vv to edges that is edges between adjacent nodes receive the unit weight 1 if both concentrations are either larger or smaller than the mean concentration u, and the weight otherwise.Resistance distances and histograms.Based on (3.2), we define the weighted adjacency matrix Ω G and graph Laplacian L G of G, where 1 m is an m-dimensional column vector with all entries equal to 1. Using (3.5), in turn, we define the Gram or kernel matrix Each entry R vv is the resistance distance between v and v that was introduced in where d G denotes the length of the shortest weighted path connecting v and v in G.The bound is tight if this path is unique.Conversely, if multiple paths connect v and v , then the resistance distance is strictly smaller than d G (v, v ).This sensitivity to the connectivity between nodes in graphs explains its widespread use, e.g. for cluster and community detection in graphs [For10].
A probabilistic interpretation of the resistance distance is as follows.Consider a random walk on G performing jumps along the edges in discrete time steps, and assume that the probability to jump along an edge is proportional to the edge's weight.Then R vv is inversely proportional to the probability that the random walk starting at v visits v before returning to v [Bap14, Section 10.3].In view of (3.2), this implies that the process jumps more likely between neighbouring nodes with both large (small) concentrations than between differing concentrations.
We add a third interpretation of the resistance distance from the viewpoint of kernel methods [HSS08, SST14] and data embedding.Let the space of functions on V that we identify with real vectors of dimension m = |V |, and consider the bilinear form Since G is connected, the symmetric and positive semidefinite graph Laplacian L G has a single eigenvalue 0 corresponding to the eigenvector 1 m .Consequently, using E, one defines the Hilbert space Since dim H G < ∞, all norms are equivalent and the evaluation map f → f v is continuous.Hence H G is a reproducing kernel Hilbert space [PR16, Def.1.1] with reproducing kernel where K vv denotes the entries of the Gram matrix (3.6), and K v , K v are the column vectors indexed by v, v and interpreted as elements (functions) in H G .The resistance distance (3.7) then takes the form where • G denotes the norm induced by the inner product (3.11b).This makes explicit the nonlocal nature of the resistance distance R vv between nodes v, v ∈ V in terms of the squared distance of the corresponding functions Overall, each of the three interpretations reveals how the resistance distance measures nonlocal spatial structure of Turing patterns.Figure 3.1(b) visualises the resistance distances {R vv } v ∈V with respect to one fixed node v.We condense these data extracted from Turing patterns into features, in terms of corresponding histograms described next.Definition 3.1 (resistance distance histogram (RDH)) Let V t ⊂ V, t ∈ N denote the nodes of the subgraph induced by the subgrid that is obtained from the underlying grid graph G = (V, E) by undersampling each coordinate direction with a factor t. Define the set of resistance distance values The radius parameter r specifies the spatial scale at which the local structure of Turing patterns is assessed through RDHs, in a way that is stable against spatial transformations of the local domains corresponding to (3.14).r is the only essential parameter, since RDHs are based on data R r,t collected from nodes v ∈ V t .This results in averaging of local pattern structure and makes RDHs only weakly dependent on the spacing t.In addition, the representation becomes robust against local noise in the pattern caused by random initial conditions.
Remark 1 Typically, concentrations of different species in a Turing pattern are approximately scaled (and sometimes reflected) versions of each other, in particular for two-species systems like the Gierer-Meinhardt model studied here.See Figure 3.2 for an illustration.The RDHs defined in Definition 3.1 hence contain redundant information when computed for different species.Therefore, we only use concentrations of one species to compute RDHs in the following.

Maximal Concentration and Connected Components
RDHs according to Definition 3.1 represent the spatial structure of Turing patterns in a compact way.However, this representation also includes few symmetries such that certain properties of patterns are not captured, such as 1. absolute concentration values: rescaling or shifting the concentration values of a pattern does not change the RDHs; 2. range-reflection symmetry: reflection of a pattern on any plane of constant concentration, i.e. inverting the total order that defines the weights (3.2), does not change the RDHs.
To account for these two properties, we introduce the following two additional features.
1. Maximal concentration c m : We aim to estimate the concentration of areas in the pattern with large concentrations while disregarding local fluctuations that potentially arise from numerical inaccuracies.To this end, we bin the concentration values of a pattern into a histogram and define the maximal concentration c m as the location of the right-most peak.
2. Number of connected components n c : To account for the above-mentioned range reflection symmetry, we define the graph G = (V, E ) with the same nodes V as the original graph G but with only a subset of edges E ⊂ E between nodes of high concentration.We then compute the number n c of connected components in G .
This list of pattern properties not captured by RDHs is not exhaustive, of course.For example, RDHs do not effectively measure the steepness of transitions between areas of high to low concentrations.However, since RDHs turned out to be powerful enough for parameter estimation, as shown below, we did not use further features in this study.

Learning Parameters from Spatial Turing Patterns
This section concerns the problem of learning the kinetic parameter of Turing patterns described in Section 2.2.We start by formulating the learning problem in Section 4.1.Sections 4.2 and 4.3 introduce the approaches for parameter prediction studied in this paper, kernel based predictors and neural networks, respectively.

Learning Problem
We consider a multi-output learning problem with training set where each x i is a resistance distance histogram H r,t (RDH) according to Definition 3.1, and vectors y i comprise parameter values of a model, such as the parameters a, b, c, δ of the Gierer-Meinhardt model (2.26).Our goal is to learn a prediction function f : X → Y that generalises well to points (x, y) / ∈ D n .We distinguish individual parameter prediction (Section 4.2.1)corresponding to dimension d = 1, where for each model parameter a predictor function f is separately learned, and joint parameter prediction (Section 4.2.2) corresponding to d > 1, where a single vector-valued predictor function f is learned.In each of these cases, the output training data y i in D n (4.1) have to be interpreted accordingly.
Regarding individual parameter prediction, we employ basic support vector regression [EPP00, SS04] in Section 4.2.1 and specify suitable kernel functions for resistance histograms as input data in Section 4.2.3.Regarding joint parameter prediction, we employ regression using an operator-valued kernel function in Section 4.2.2.For background reading concerning reproducing kernel Hilbert spaces (RKHS) and their use in machine learning, we refer to [BCR84, BTA04, PR16] and [EPP00, CS01, HSS08], respectively.We also utilize neural networks in Section 4.3 for both individual and joint parameter prediction.

Accuracy Measure
A commonly used measure for the accuracy of an estimator f :

Individual Parameter Prediction Using SVMs
In this section, we focus on the case d = 1 where along with a finite sample of RDHs x i , values y i ∈ R of some model parameter are given as training set (4.1).Individual parameter prediction means that, for each model parameter, an individual prediction function specific to this particular parameter is determined.We apply standard support vector regression.Given a symmetric and nonnegative kernel function (see Section 4.2.3 for concrete examples) the corresponding reproducing kernel Hilbert space (RKHS) with inner product •, • k is denoted by H k .'Reproducing' refers to the property (4.6b) Using the training set D n and a corresponding loss function, our objective is to determine a prediction function f ∈ H k that maps a RDH x extracted from an observed pattern to a corresponding parameter value y = f (x).We employ the ε-insensitive loss function [EPP00] ε to define the training objective function where the regularizing parameter controls the size of the set of prediction functions f in order to avoid overfitting.Since ε is continuous and the regularizing term monotonically increases with f k , the representer theorem [Wha90, SHS01] applies and implies that the function f * minimizing (4.8) lies in the span of the functions generated by the training set Using nonnegative slack variables ξ i , i ∈ [n] in order to represent the piecewise linear summands subject to ξ i ≥ 0, (4.11b) Since K n is positive definite, this is a convex quadratic program that can be solved using standard methods.Substituting the minimizing vector α * into (4.9)determines the desired prediction function f * .This procedure is repeated to obtain prediction functions f * j , j ∈ [d] for each parameter to be predicted, using the training sets {(x i , y i,j )} i∈[n] for j ∈ [d].

Joint Parameter Prediction Using Operator-Valued Kernels
In this section, we consider the general case d ≥ 2: each vector y i of the training set (4.1) comprises the values of a fixed set of model parameters and x i is the RDH extracted from the corresponding training pattern.Our aim is to exploit dependencies between input variables {x i } i∈[n] and output variables {y i } i∈ [n] as well as dependencies between the output variables.To this end, we use the method proposed by [KGP13] for vector-valued parameter prediction that generalizes vector-valued kernel ridge regression as introduced by [EMP05, MP05].See [ ÁRL12] for a review of vector-valued kernels and [BSdB16,MBM16] for general operator-valued kernels and prediction.
In addition to a kernel function k (4.5) for the input data, we additionally employ a kernel function for the output data with corresponding RKHS (H l , •, • l ) and properties analogous to (4.6), and a operator-valued kernel function K : X × X → L(H l ) (4.13) that takes values in the space L(H l ) of bounded self-adjoint operators from H l to H l .K enjoys the same properties as the more common scalar-valued kernel functions k, l, viz., it is the reproducing kernel of a The properties analogous to (4.6) now read In particular, the operator-valued kernel matrix ϕ i , K(x i , x j )ϕ j l ≥ 0 (4.17) for all m ∈ N, x 1 , . . ., x m ∈ X , ϕ 1 , . . ., ϕ m ∈ H l .
In order to capture dependencies among the output variables as well as between input and output variables, the prediction function f : X → Y (4.18) is not learned directly, unlike the individual predictors (4.4) in the preceding section (cf.(4.9)).Rather, the optimal mapping (4.18) is parametrized by where φ l : Y → H l is the feature map corresponding to the output kernel function (4.12) (see, e.g., [CS01, Section 3]) satisfying φ(y i ), φ(y j ) l = l(y i , y j ), (4.20) and g * ∈ H k is determined by regularized least-squares on the transformed training data that is by solving Invoking again the representer theorem valid for the present more general scenario [MP05], g * admits the representation which makes explicit how the approach generalizes the individual parameter predictors (4.9).It remains to specify a kernel function K and the computation of φ −1 l in order to evaluate the prediction map (4.19).As for K, our choice is with the input kernel function k (4.5) and the conditional covariance operator on H l .Since K is evaluated on the training data, C YY|X is replaced in practice by evaluating the empirical covariance operators on the right-hand side, i.e.
with output kernel function l (4.12), l y i = l(y i , •) and (l y i ⊗ l y j )ϕ = l y j , ϕ l l y i , and similarly for the remaining mappings on the right-hand side of (4.23b).The kernel function (4.23) together with the predictor (4.22) in the output feature space reveals how the dependencies are taken into account of both the output variables and between the input and output variables.
In order to obtain for some test input (RDH) x the predicted parameter vector from the predicted embedded output value g * (x), the mapping φ −1 l of (4.19) has to be evaluated.This is an instance of the so-called pre-image problem [SMB + 99, HR11].In the present scenario, putting together (4.21), (4.22), (4.23) and (4.15), this yields after a lengthy computation [KGP13, Appendix] the optimization problem L n = l(y i , y j ) i,j∈[n] , l y = l(y, y 1 , . . ., l(y, y n ) .(4.26e) Here, λ in (4.26b) is the regularization parameter of (4.21), ε in (4.26c) is a small constant regularizing the numerical matrix inversion, K n , L n are the input and output kernel matrices corresponding to the training data (4.1),x is a novel unseen test pattern represented as described in Section 3, and y is the parameter vector variable to be optimized.Unlike the input kernel function k that is applied to RDHs (see Section 4.2.3), the output kernel function l applies to the common case of parameter vectors and hence choosing the smooth Gaussian kernel function as l is a sensible choice.Therefore, once the vector v(x; D n ) has been computed for a test pattern x, the optimization problem (4.26a) can be solved numerically by iterative gradient descent with adaptive step size selection by line search.
Regarding the computation of the vector (4.26b) that defines the objective function of (4.26a), the matrix T n given by (4.26c) can be directly computed for numbers n up to few thousands data points using off-the-shelf solvers.This is not the case for the linear system of (4.26b) involving the Kronecker product K n ⊗ T n , however, which is dense and has the size n 2 × n 2 .Therefore, we solve the linear system in a memory-efficient way using the global-GMRES algorithm proposed by [BJ08] that iteratively constructs Krylov matrix subspaces and approximates the solution by solving a sequence of low-dimensional least-squares problems.Having computed u, the vector (4.26b) results from computing (4.28)

Kernels for Resistance Distance Histograms
In this section, we specify kernel functions (4.5) that we evaluated for parameter prediction.Below, x, x ∈ H r,t denote two RDHs.
Symmetric χ 2 -kernel.This kernel is member of a family of kernels generated by Hilbertian metrics on the space of probability measures on X [HB05] and defined by Wasserstein kernel.We define a cost matrix and the squared discrete Wasserstein distance between x and x where M * solves the discrete optimal transport problem [PC19] M is a doubly stochastic matrix, and the minimizer M * is the optimal transport plan for transporting x to x , with respect to the given costs C. The Wasserstein kernel is defined as and can be shown to be a valid kernel for generating a RKHS and embedding [BGLV18].For measures defined on the real line R, it is well known that the distance d W between two distributions can be evaluated in terms of the corresponding cumulative distributions.This carries over to discrete measures x, x and the distance d W (x, x ) considered here, provided the implementation takes care of monotonicity and hence invertibility of the discrete cumulative distributions; we refer to [San15, Section 2] for details.

Neural Networks
Feed-forward neural networks.Let n k and n d be the dimensions of the input and output space, respectively.A feed-forward neural network (FFNN) of depth L ∈ N is a function f : R n k → R n d that can be written as the composition , where n (1) i = n k , and a final output layer Each hidden layer is in turn the composition of a linear transformation and an activation function.We use the rectified linear unit (ReLU) as activation function defined as ReLU(x) = max(0, x), x ∈ R. (4.35) The hidden layers can accordingly be written as (4.36) for an input vector x ∈ R n (j) i , weight matrix and bias b (j) ∈ R n (j) o .The ReLU function in Eq. (4.36) acts independently on each element of its argument.For the final output function f (o) we use a linear transformation without activation function: o are called the numbers of "hidden units" or "neurons" of the jth layer.Since for a general W (j) all neurons of the (j − 1)th layer are connected to all neurons of the jth layer, hidden layers as in Eq. (4.36) are also called "fully-connected layers".To characterise a FFNN we specify the numbers of hidden units as (n Convolutional neural networks.Convolutional neural networks (CNNs) have been used for learning problems on image data in various different applications, in particular for classification tasks [GWK + 18].We will consider CNNs that were trained on raw pattern data to learn the kinetic parameters of the model, as a benchmark for the results obtained from training models on resistance distance histograms.
Various different CNN architectures have been used in the literature.The majority consist of three basic types of layers: convolutional, pooling, and fully connected layers.The convolutional layer's function is to learn feature representations of the inputs.This is achieved by convolving the inputs with learnable kernels, followed by applying an element-wise activation function.Convolutional layers are typically followed by pooling layers which reduce the dimension by combining the outputs of clusters of neurons into a single neuron in the next layer.Local pooling combines small clusters, typically of size 2 × 2, while global pooling acts on all neurons of the previous layer.A sequence of convolutional and pooling layers is then typically followed by one or several fully-connected layers as in Eq. (4.36).These are then followed by a final output layer chosen according to the specific learning task such as a softmax layer for classification tasks [GWK + 18].
For the applications in this paper, we found the best performance for minimalistic CNNs consisting of only one convolutional and one fully-connected layer.The ReLU activation function in Eq. (4.35) was applied to the output of both layers.We denote the architecture by (n k /n p /n f ) where n k is the used number of kernels of size n p × n p and n f denotes the number of neurons in the fully connected layer.

Implementation Details
Simulation details.According to Section 2.3.3,setting the step size h properly requires to estimate (an upper bound of) the Lipschitz constant of f .It turned out, however, that applying standard calculus [RW09, Ch. 9] to the concrete mappings f (2.26) yields too lose upper bounds of L f and hence quite small step sizes h, which slows down the numerical computations unnecessarily.Therefore, in practice, we set h to a value that is 'reasonable' for the backward Euler method and monitored the fixed point iteration (2.13) in oder to check every few iterations if the method diverges, in which case h was replaced by h/2.We found h = 0.2 to be a reasonable choice for all applications studied here.
The threshold ε l for the convergence criterion of the inner iteration in Eq. (2.14) was set to ε l = 0.001.The outer iteration was terminated if either the convergence criterion in Eq. (2.16) was met with threshold ε k = 10 −6 , which we checked after time intervals of δt = 100, or when a fixed maximal time T f was reached.We chose T f = 2000 for domain sizes of 32 × 32 and 64 × 64, and T f = 5000 for a domain size of 128 × 128.We found that the patterns do not change substantially beyond these time values even if the convergence criterion in Eq. (2.16) was not satisfied.
Resistance distance histograms.As pointed out in Remark 1, the resistance distance histograms (RDHs) for different species are typically redundant.We hence used only the first species' simulation results for computing the RDHs from simulations of the Gierer-Meinhardt model in Eq. (2.26) studied here.
For computing RDHs, we had to specify the edge weight parameter of (3.2) which penalises paths from high to low concentrations and vice versa.Choosing too small (corresponding to a large penalty) leads to a saturation effect of large resistance values between nodes at large distances, preventing to resolve the geometry of a pattern on such larger scales.Similarly, a large fails to resolve the geometry on small scales.We empirically found = 0.003 to be a good compromise.
For the parameter t in Definition 3.1 determining the undersampling of the graph, we found t = 1 to give the most accurate results.Thus, all results presented in this study were produced using t = 1.Note that t = 1 means that the original graph was used without undersampling.
When simulating a model for varying parameters, we found that some patterns led to few occurences of very large resistance values, while the majority of patterns had maximal resistance values substantially below these outliers.We believe that these large values arise from numerical inaccuracies in the PDE solver.Rather than including all resistance values which would cause most RDHs having only zeros for large values, we disregarded values beyond a certain threshold.To specify this threshold, we computed the 99% quantile across all patterns and picked the maximal value.
Finally, we set the bin number B introduced in Definition 3.1 to the value B = 12.We found empirically that smaller bin numbers give more accurate results for small data sets, while larger numbers perform better for larger data sets.B = 12 appears as a good tradeoff between these two regimes.
Additional features.In Section 3.2 we discussed the maximal concentration as an additional feature.Due to numerical inaccuracies when simulating a system, a few pixels might have an artificially large concentration.We aim here to disregard such values and instead estimate the concentration value of the highest plateau in a given pattern.To this end, we collected the concentration values of the pattern into a histogram of 25 bins and defined the maximal concentration as the location of the peak with the highest concentration value.
Data splitting.Consider the learning problem described in Section 4.1.1:given a data set D m = {(x i , y i )} i∈[m] ⊂ X × Y with RDHs x i and vectors y i comprising parameter values of the model, we aimed to learn a prediction function f : X → Y.In practice, we did not use the whole data set D m for training, but split it into mutually disjoint training, test and validation sets, which respectively comprised 60%, 20% and 20% of D m .The training set was used to train a model for given hyperparameters, while the validation set was in turn used to optimise the hyperparameters.The NRMSE of the trained model was subsequently computed on the test set.
For small data sets D m with m ≤ 500, we observed large variations in the resulting NRMSE values.To obtain more robust estimates we split a total data set of 1000 points into subsets D m of size m for m ≤ 500, performed training and computed the NRMSE value for each D m as described above, and took the average over these NRMSE values.For example, if m = 100, then we averaged over 1000/100 = 10 data sets.For data sets of size m > 500, the procedure above was performed on the single data set without averaging.
For convolutional neural networks trained on raw pattern data we found the results to be substantially more noisy than for the other learning methods trained on RDHs.Accordingly, we here averaged the NRMSE values over more training sets: for m ≤ 1000 points, we split a total set of 5000 points into data sets D m .For m = 2000, 5000 and 10000 we used total data sets of 20000 points.Finally, for m = 20000 we trained the models three times with random initialization on the same data set and averaged subsequently.
Target variable preprocessing.We normalised each component of the target variable by its maximal value over the whole data set, i.e., where d is the dimension of the target variable corresponding to the number of parameters to be learned, and n is the number of data points.We use these normalised target variables for both the regression task in Section 5.3 and for clustering in Section 5.4 Support-Vector Regression.The training procedure for learning a single parameter, i.e. a scalarvalued target variable, using support-vector regression (SVR) is described in Section 4.2.1.We choose the hyperparameters γ (c.f.Eq. (4.30) for the exponential χ 2 -kernel and Eq.(4.34) for the Wasserstein kernel) and λ (c.f.Eq. (4.11a)) by minimising the NRMSE on the validation set on a grid in the two parameters (note that the χ 2 -kernel of Eq. (4.29) does not contain any hyperparameter).In some cases, we performed a second optimisation over a finer grid centered around the optimal parameters from the first run.We found this to lead to only minor improvements, however.The model was then evaluated on the test set for the optimal parameters and the resulting NRMSE value is reported.
For learning multiple parameters we applied the SVR approach separately to each target parameter and subsequently computed the joint NRMSE value.
Operator-Valued Kernels.For learning multiple parameters jointly, i.e. a vector-valued target variable, we used the operator-valued kernel method described in Section 4.2.2.In addition to the input kernel parameter γ and regression parameter λ used for support-vector regression, we here also had to optimise the scale parameter of the output kernel (c.f. the discussion after Eq. (4.26e)).Optimisation of these hyperparameters was performed on a grid as in the SVR case, but this time jointly for all target parameters.
Feed-forward neural networks.We employed feed-forward neural networks both for learning a single parameter as well as learning multiple parameters jointly from RDHs.We used Mathematica s © buildin NetTrain function with the Adam optimization algorithm and the mean-squared loss function for training [Res21].The network architecture that gives the minimal loss on the validation set along the training trajectory was selected and evaluated on the test set to obtain the NRMSE value.Training was performed for T f training steps with early stopping if the error on the validation set does not improve for more than T e steps.For the data set sizes 20 and 50 we used (T f , T e ) = (4 × 10 5 , 10 5 ), for data set sizes 100, 1000 and 2000 we used (T f , T e ) = (2 × 10 5 , 5 × 10 4 ), and for data set sizes ≥ 5000 we used (T f , T e ) = (10 5 , 2 × 10 4 ).
Convolutional neural networks.Convolutional neural networks were trained on the raw simulation data of the first species, i.e. not on RDHs.We used TensorFlow [A + 15] and Keras [C + 15] for this purpose.We employed the same training procedure as for feed-forward neural networks.For the number of training steps we used (T f , T e ) = (500, 100).

Robustness of Resistance Distance Histograms
Ideally, the RDHs should be characteristic of patterns of different types, while being robust to noise in the patterns due to noise in the initial conditions, and being invariant under immaterial spatial pattern transformations (translation, rotation).In other words, patterns arising from simulations of the same model for different initial conditions should give rise to RDHs that do not differ substantially, while patterns generated by different models should lead to larger deviations.
To assess these properties, we simulated the Gierer-Meinhardt model introduced in Section 2.4 for eight different values for c with the other parameters fixed.For each value of c we simulate the model for five random initial conditions.We subsequently embed the corresponding RDHs into two dimensions.Figure 5.1 shows the results for differing domain sizes.We observe that the points are reasonable well clustered for a domain size of 32×32 pixels, with a substantial improvement when increasing the domain size to 64 × 64 pixels.This is to be expected, since a larger snapshot of a pattern should reduce the noise in the corresponding RDHs.The clustering appears to improve slightly upon further increase of domain size to 128 × 128 pixels.
The fact that the different noisy realisations of the patterns separate well indicates that the RDHs average out this noise while encoding the characteristic features of the patterns to a large degree.
In view of these results, we only consider domain sizes of 64 × 64 and 128 × 128 in the following.

Learning Parameters of the Gierer-Meinhardt Model
In this section we consider the prediction problem of learning a map from RDHs (potentially combined with the additional features described in Section 3.2) generated from simulations of the Gierer-Meinhardt model in Eq. (2.26) as described in Section 2.3 onto the corresponding kinetic parameters.The training data thus consists of pairs (x i , y i ) with x i being an RDH and y i being a set of kinetic parameters of the model.As outlined in Section 5.1 we split the data into a training, validation and test set, where the former two are used to train the models and learn hyperparameters, and the latter is used to evaluate the model's error in terms of the normalised root-mean squared error (NRMSE) (c.f.Section 4.1.2). Figure 5.2 visualises how patterns vary for varying parameters corresponding to different NRMSE values.We observe that the patterns look reasonably similar for NRMSE values of 0.2, while they are hardly distinguishable anymore for values below 0.05.

Learning a Single Parameter
We start by varying the parameter c and fixing the other parameters to a = 0.02, b = 1 and δ = 100 (c.f.Eq. (2.26)).We randomly sample 2 × 10 4 values for c on the interval [0, 1.15], solve the corresponding PDE in Eq. (2.26) and compute the resulting RDHs as described in Section 3.1.Several different types of patterns emerge in this range of c values as can be seen in Figure 5.1(b).
Support-vector regression.2).As one may expect, the NRMSE values are substantially larger here than in the scalar case of learning just one parameter for the same number of data points (c.f. Figure 5.4) and once again decrease for increasing data sets.We find that all three methods trained on RDHs perform very similar.This implies, in particular, that no correlation among the output parameter values could be exploited for prediction.In contrast, including the maximal concentration c m as an additional feature leads to substantially improved NRMSE values with an improvement of up to 35% for large data sets.(b) Architectures for the FFNNs that gave the best performance and whose results are shown in (a).We found the same optimal architectures for both training the FFNNs on the RDHs only and on the RDHs together with the maximal concentration c m .
Neural networks.Figure 5.4 shows the regression results for learning the parameter c using feedforward neural networks (FFNN) for data sizes of up to 2 × 10 4 points.As one may expect, the FFNNs perform worse than support-vector regression for small data sets and better for larger data sets.Using FFNNs it is feasible to use data sets beyond the maximum of 5000 points used for support-vector regression.However, we find that the NRSME value appears to not improve any further beyond about 2000 points.This may be expected since in the computation of the RDHs some information about the patterns is inevitably lost, meaning there is a lower bound of how accurate the parameter can be learned in the limit of an infinitely large data set.We point out, however, that this saturation effect happens at NRSME values ≤ 0.03 which is very accurate (cf. Figure 5.2).
Additional features.In Section 3.2 we introduced two additional features to account for certain symmetries of the RDHs, namely the maximal concentration c m and the number of connected components n c of a pattern.Figure 5.4 shows the results obtained by training FFNNs on RDHs with c m as an additional feature.We find slightly larger NRMSE values for small data sets with less than 500 points, and slightly smaller NRMSE values for larger data sets.In contrast, using the number of connected components n c as an additional feature did not give rise to notably more accurate results (results not shown).
Benchmark: CNN on raw data.Figure 5.4 also shows the NRMSE values obtained from training convolutional neural networks (CNNs) directly on the raw pattern data obtained through simulation (c.f.Section 4.3).For data set sizes of ≤ 200 data points we found substantially larger NRMSE values than from FFNNs trained on RDHs.This shows once again that the RDHs efficiently encode most of the relevant information while averaging out noise, allowing for more accurate parameter learning for data sets of small and medium size.As one might except, the difference between the CNN and FFNN results becomes smaller for larger data set sizes since the CNNs can effectively average out the noise themselves when a sufficient large number of data points are provided.Consequently, we found that CNNs become more accurate than the other methods for large data sets of about 10 4 − 2 × 10 4 data points.

Jointly Predicting Four Parameters
We next consider the problem of learning all four kinetic parameters a, b, c and δ of the Gierer-Meinhardt model in Eq. (2.26).We vary a, b, c and δ uniformly on the interals [0.01, 0.7], [0.4,2], [0.02, 7] and [20, 200], respectively.The scaling parameter was set to s = 0.4 and assumed to be known.Parameter combinations for which the system does not possess a Turing instability and hence does not produce a pattern in simulations are disregarded.Figure 5.5 shows the NRMSE results for the separable SVR model, for the joint kernel-based model, and for feed-forward neural networks (FFNNs) trained on RDHs of radius r = 8 and a domain size of 64 × 64.We find that the three methods perform similar which means, in particular, that no correlation among the output parameters could be exploited for joint prediction in order to outperform separable parameter prediction (the SVR methods we trained only for data sets of up to 2 × 10 3 points).We further observe the NRMSE values to be substantially higher than in the scalar case (c.f. Figure 5.4) for the same number of data points, which is to be expected when learning more parameters.The NRMSE value again successively decreases for increasing data set sizes, with a minimal value of about 0.26 for FFNNs and 10 4 data points, which is substantially larger than the minimal value of 0.033 obtained in the scalar case for the same data set size and radius (c.f. Figure 5.4).However, here the curve does not appear to have levelled off yet for 10 4 data points as in the scalar case, and increasing the data set size further should further increase accuracy.
We performed the same experiment as shown in Figure 5.5 but for RDHs radius r = 32 and found similar results (results not shown).
Combining RDHs of different radii.In Section 3 we argued that the RDH radius r determines the scale at which the local structure of a Turing pattern is resolved.We used the two radii r = 8 and r = 32 in the results presented so far in Figures 5.3 − 5.5 and found similar results for the two.However, since RDHs with radius r = 8 should more accurately capture characteristics of patterns on small scales and r = 32 should be able to capture larger-scale characteristics, one might expect that combining the two should provide more information than each of them individually and might therefore give rise to more accurate results.We trained feed-forward neural networks on the RDHs of the two radii taken together as features for the same setting as in Figure 5.5, but did not obtain notably more accurate results (results not shown).
Additional features.While we found in Section 5.3.1 that using the maximal concentration c m as an additional feature did only slightly improve NRMSE values and only for large data sets when learning a single parameter (c.f. Figure 5.4), we here find a substantial improvement for all data set sizes, with improvements of up to 35% for large data set sizes as can be seen in Figure 5.5.As in the scalar case, we find that using the number of connected components n c does not improve results (results not shown).

Cluster of Patterns
We explored the geometry of patterns represented by resistance distance histograms (RDHs) and the squared Wasserstein distance (4.32).Parameter c in the Gierer-Meinhardt model in Eq. (2.26) was varied as described in Section 5.3.1, and 1000 patterns and corresponding RDHs were computed as detailed in Section 2.3.Next, we examined the neighborhood graph in which two patterns with corresponding W (x, x ) ≤ 0.05.As a result, about 84% of all patterns were contained in one of the six clusters corresponding to the connected components with the largest number of patterns.
Figure 5.6 depicts 10 sample patterns taken from each cluster.This result shows that RDHs according to Definition 3.1, together with an appropriate distance function, are suited for clustering patterns into qualitatively different categories.

Conclusion
We introduced a novel learning-based approach to Turing pattern parameter prediction.A key difference to existing work is that any single observed pattern is directly mapped to a predicted model parameter value.Major ingredients of the method are (i) the (almost) invariant representation of Turing patterns by histograms of resistance distances computed within patterns, and (ii) a kernel-based pattern similarity measure based on the Wasserstein distance that takes properly into account minor but unavoidable binning effects.
We compared classical reproducing kernel Hilbert space methods using basic kernels for single parameter prediction and operator-valued kernels for jointly predicting all parameters.These methods performed best for small and medium-sized training data sets.In addition, we evaluated various feedforward neural network architectures for prediction.As for single parameter prediction, these methods performed best for larger training data sets and but were on a par only with kernel-based methods in the case of joint parameter prediction.
Finally, we applied convolutional neural networks to raw pattern data directly.We found that, for very large training data sets with ≥ 2 • 10 4 data samples, they outperformed all other methods.However, it remains unexplained what internal pattern representations are used.
Overall, we observed excellent parameter prediction of single parameters even for small data sets with ≤ 1.000 data samples, and fairly accurate joint prediction of all parameters for large data sets.Our results indicate that the latter predictions should further improve when even larger data sets can be used for training.We leave such experiments for future work.
We suggest to focus on two aspects in future work.In this paper, the Gierer-Meinhardt model was chosen in order to conduct a representative case study.In practical applications, selecting also the model among various candidates, besides estimating its parameters, might be desirable.Furthermore, our current approach cannot quantify the uncertainty of parameter prediction and in this respect falls short of statistical approaches like [CFVM19,KH20].On the other hand, our approach can be applied to single patterns, rather than to ensemble of patterns like the approach [KH20], and further input data like initial conditions as in [CFVM19] that are unknown in practice, are not required.Resolving these pros and cons defines an attractive program for future research.

Figure 2
Figure 2.2: Simulation results of species u 1 in the Gierer-Meinhardt model defined by Eq. (2.26) on a 128 × 128 grid with final time T = 5000 for varying parameters c and δ and fixed parameters a = 0.02, b = 1.0 and s = 0.5.We observe that different parameter combinations give rise to different types of patterns and differing wavelengths.For c = 1.2 and δ = 50 and δ = 100 we find a homogeneous solution and no pattern, which illustrates that the system does not exhibit an Turing instability for these parameter values.

Figure 3 . 1 :
Figure 3.1: Each row of the figure shows from left to right: a pattern, a resistance distance plot and resistance distance histograms (RDH) for radii 8 and 32 of the first species, obtained from simulating the Gierer-Meinhardt model in Eq. (2.26) on a 128 × 128 grid.Rows correspond to different values of parameter c.Weights are assigned to the edges of the corresponding grid graph according to Eq. (3.2).The resistance plots visualise resistance distances between all nodes and one central node marked with red.These plots result from partitioning the column of the resistance matrix R (3.7) corresponding to the central node into an 128 × 128 array.The darker the colour of a node the larger its resistance towards the central node.One can observe how the resistance values vary depending on the local structure of the pattern.In particular, the resistance distance histograms (RDHs) (cf.Definition 3.1) differ substantially for the different types of patterns.The parameters are set to a = 0.02, b = 1, δ = 100 and s = 0.8 for all three columns, and the parameter c is set to 0.72, 0.47 and 0.11 for the three columns, respectively.The final simulation time is T = 5000.The RDHs are computed for B = 12 bins and hypergraph spacing of t = 1.

Figure 3
Figure 3.2: The patterns depicted by Figure 3.1 are shown again, here for both species, however.(a), (b) and (c) correspond to the three rows of Figure 3.1.It is apparent that the patterns of the two species are qualitatively very similar: Basically, they are just rescaled and shifted versions of each other.Since resistance distance histograms (RDHs) are invariant under such operations, the resulting RDHs would be approximately equal.It is hence sufficient to use only the patterns of one species for computing RDHs and subsequent analysis.
Figure 5.1: (a) Dimensionality reduction of simulation results of the Gierer-Meinhardt model defined in Eq. (2.26) for eight equally-distanced values of parameter c on the interval [0.01, 1.15] indicated by different symbols, and five different random initial conditions each.The other parameters are fixed to a = 0.02, b = 1, δ = 100.The resulting RDHs are embedded into the two-dimensional plane using latent semantic analysis [BDO95].Results are shown for domain sizes 32 × 32, 64 × 64 and 128 × 128 and radii 8 and 32.We found that while points corresponding to different patterns do not separate into distinct clusters for a domain size of 32 × 32, they do so for a domain size of 64 × 64.For a domain size of 128 × 128, this separation appears even more pronounced.This demonstrates that resistance distance histograms successfully encode characteristic features of patterns while averaging out noise, if the domain size is chosen large enough.(b) Patterns for the eight different c values shown in (a) for one initial condition each.

Figure 5 . 2 :
Figure 5.2: Pattern accuracy.Simulation results of the Gierer-Meinhardt model in Eq. (2.26) for fixed parameters a = 0.02, b = 1 and δ = 100 and varying values for parameter c on a 64 × 64 domain.The c values are varied around a central value such that they correspond to a certain NRMSE value, and different rows correspond to different NRMSE values.We find that for an NRMSE value of 0.4 the patterns deviate quite substantially from each other, while they look relatively similar for a value of 0.2 already.Decreasing the NRMSE value further successively decreases the deviations in the patterns.For NRMSE values smaller than 0.05 different patterns are hardly distinguishable anymore.

Figure 5
Figure 5.3: NRMSE values for support-vector regression of parameter c of the Gierer-Meinhardt model in Eq. (2.26).We vary c uniformly on the interval [0, 1.15] and fix the other kinetic parameters to a = 0.02, b = 1 and δ = 100, the scaling parameter to s = 0.25.The figures show the NRMSE values for varying data set sizes and for the two RDH-radii 8 and 32 (c.f.Definition 3.1).(a) The figure shows the results for both the exponential χ 2 -and Wasserstein kernel (c.f.Eqs.(4.29) and (4.34), respectively)and a domain size of 64 × 64.We find that even for small data sets of only 20 data points a reasonable good NRMSE value of about 0.2 is achieved (c.f.Figure5.2).This value successively decreases for increasing data set sizes down to a value of about 0.05.While the results for the two different RDH-radii do not vary substantially, the Wasserstein kernel outperforms the χ 2 kernel for small data sets.(b) The figure shows the results for the Wasserstein kernel in Eq. (4.34) and the two domain sizes 64 × 64 and 128×128.We observe about about 10−50% better results for the 128×128 domain.These results show that about 1000 data points suffice to reach the NRMSE value 0.05 which is quite accurate (cf.Figure5.2).

Figure 5
Figure 5.5: (a) NRMSE values for learning all four kinetic parameters a, b, c and δ of the Gierer-Meinhardt model in Eq. (2.26).The figures show the NRMSE values for varying data set sizes and for the RDH-radius 8 (c.f.Definition 3.1) and a domain size of 64 × 64, for both cases of kernel-based learning the four parameters individually and jointly as outlined in Sections 4.2.1 and 4.2.2, respectively, using the Wasserstein kernel (c.f.Eqs.(4.34)).In addition, the NRMSE values of feed-forward neural networks (FFNNs) (c.f.Section 4.3) are shown, once trained only on RDHs and once trained using the maximal concentration c m as additional feature (c.f.Section 3.2).As one may expect, the NRMSE values are substantially larger here than in the scalar case of learning just one parameter for the same number of data points (c.f.Figure5.4) and once again decrease for increasing data sets.We find that all three methods trained on RDHs perform very similar.This implies, in particular, that no correlation among the output parameter values could be exploited for prediction.In contrast, including the maximal concentration c m as an additional feature leads to substantially improved NRMSE values with an improvement of up to 35% for large data sets.(b) Architectures for the FFNNs that gave the best performance and whose results are shown in (a).We found the same optimal architectures for both training the FFNNs on the RDHs only and on the RDHs together with the maximal concentration c m .

Figure 5 . 6 :
Figure 5.6: Cluster of patterns obtained by varying parameter c as described in Section 5.3.1.Each row shows 10 sample patterns from a cluster that comprises a large number of patters within a small radius, measured by the squared Wasserstein distance (4.32) between the corresponding resistance distance histograms (RDHs).This result demonstrates how RDHs and a corresponding distance function represent distinct types of patterns.
be the eigenvalues of J at u * .The equilibrium u * is asymptotically stable if and only if Re(λ i ) < 0, i ∈ [N ] [SC16, Cor.6.1.2],that is a region of attraction U (u * ) containing u * exists such that u(t) ∈ U (u * ) implies u(t) → u * as t → ∞.
.2) Additional normalisation by the empirical mean value of the nonnegative target variables yields the normalised root-mean-square error (NRSME)We use the NRMSE to measure the accuracy of predicted parameter values in this work.Figure5.2(page23) illustrates visually the variation of patterns for various NRMSE values.We consider as "good" model parameter predictions with accuracy values NRMSE ≤ 0.2 and as "excellent" predictions with accuracy values NRMSE ≤ 0.05.
.10) and substituting f = i∈[n] α i k x i yields the training objective function (4.8) in the form