C ONSTRUCTING COARSE - SCALE BIFURCATION DIAGRAMS FROM SPATIO - TEMPORAL OBSERVATIONS OF MICROSCOPIC SIMULATIONS : A PARSIMONIOUS MACHINE LEARNING APPROACH

We address a three-tier data-driven approach to solve the inverse problem in complex systems modelling from spatio-temporal data produced by microscopic simulators using machine learning. In the ﬁrst step

machine learning schemes was compared with the reference bifurcation diagram obtained by finite differences of the FHN PDEs.

Methodology
The pipeline of our computational framework for constructing the bifurcation diagrams from data produced from detailed microscopic simulations consists of three tasks: (a) the identification of a set of coarse-scale variables from fine-scale spatio-temporal data using manifold learning and in particular parsimonious Diffusion maps using leaveone-out cross-validation (LOOCV), (b) based on the parsimonious coarse-grained set of variables, the reconstruction of the right-hand-side of the effective PDEs using machine learning and, (c) based on the machine learning models, the construction of the coarse-scale bifurcation diagrams of the emergent dynamics using the numerical bifurcation analysis toolkit.
The assumption here is that the emergent dynamics of the complex system under study on a domain Ω × [t 0 , t end ] ⊆ R d × R can be modelled by a system, of say m (parabolic) PDEs in the form of: where u(x, t) = [u (1) (x, t), . . . , u (m) (x, t)], F (i) , i = 1, 2, . . . m is a non-linear operator, D ν u(x, t) is the generic multi-index ν-th order spatial derivative at time t i.e.: and ε denotes the (bifurcation) parameters of the system. The boundary conditions read: B where {∂Ω l } denotes an l partition of the boundary of Ω, and initial conditions The right-hand-side of the i-th PDE depend on say γ (i) number of variables and bifurcation parameters from the set of variables S (i) = {x, u(x, t), Du(x, t), D 2 u(x, t), . . . , D ν u(x, t), ε}.
Let us denote this set as S (i) , with cardinality |S (i) | = γ(i). Hence, at each spatial point x q , q = 1, 2, . . . , M and time instant t s , s = 1, 2, . . . , N the set of features for the i-th PDE can be described by a vector z q (t s ) ∈ R γ(i) .
Here, we assume that such macroscopic PDEs in principle exist but there are not available in a closed-form. Instead, we assume that we have detailed observations from microscopic simulations from which we can compute the time and spatial derivatives of all the observables in N points in time and M points in space using e.g. finite differences. Thus, we aim to (a) identify the intrinsic dimension of the manifold on which the coarse-grained dynamics evolve, i.e. for each PDE identify γ(i), and the coordinates that define the low-dimensional manifold, i.e. the sets S (i) , and based on them (b) identify the right-hand-side (RHS) of the effective PDEs using machine learning.
To demonstrate the proposed approach, we have chosen to produce data from D1Q3 Lattice Boltzmann (LB) simulations of the coupled FitzHugh-Nagumo PDEs of activation-inhibition dynamics. Using the LB simulator, we produced data in time and space from different initial conditions and values of the bifurcation parameter. For the identification of an appropriate set of coarse-scale variables that define the low-dimensional manifold on which the emergent dynamics evolve, we performed feature selection using parsimonious Diffusion Maps [30,31]. Then, we trained the machine learning schemes to learn the right-hand-side of the coarse-grained PDEs on the low-dimensional manifold. Based on the constructed models, we performed numerical bifurcation analysis, employing the pseudo-arc-length continuation method. The performance of the proposed data-driven scheme for constructing the coarse-grained bifurcation diagram was validated against the one computed with the PDEs using finite differences. A schematic overview of the proposed framework for the case of two effective PDEs (as in the problem of the FitzHugh-Nagumo activation-inhibition dynamics) is shown in Figure 1.
In what follows, we first describe the parsimonious Diffusion Maps algorithm for feature selection. Then, we present Figure 1: Schematic of the three-stage workflow for constructing coarse-grained bifurcation diagrams from fine scale observations using the paradigm of two parabolic PDEs: 1) Identify a set of parsimonious coarse-grained observables using Diffusion Maps from microscopic simulations (here D1Q3 Lattice Boltzmann simulations) and compute their spatial and time derivatives using finite differences, 2) "learn" the right-hand-side of the PDEs using machine learning (here shallow FNNs with 2 hidden layers and single-layer RPNNs), and, 3) employ the numerical bifurcation analysis toolkit (here the pseudo-arc-length-continuation method) to construct the coarse-grained bifurcation diagrams.
the machine learning schemes used for identifying the right-hand-side of the effective PDEs from the microscopic simulations, and then we show how one can couple the machine learning models with the pseudo-arc-length continuation method to construct the coarse-scale bifurcation diagrams. Finally, we present the numerical results and compare the performance of the proposed machine learning schemes.

Parsimonious Diffusion Maps
Diffusion Maps is a non-linear manifold learning algorithm [22,23,24] that identifies a low-dimensional representation y i ∈ R µ of a point z i ∈ R n , i = 1, 2, . . . N in the high dimensional space (µ << n) addressing the diffusion distance among points as the preserved metric [24]. Difussion Maps assume that the high-dimensional data are embedded in a smooth low-dimensional manifold. It can be shown that the eigenvectors of the large normalized kernel matrices constructed from the data converge to the eigenfunctions of the Laplace-Beltrami operator on this manifold at the limit of infinite data [22,24]. The approximation of this Laplace-Beltrami operator is made by representing the weighted edges connecting nodes i and j commonly by a normalized diffusion kernel matrix W with elements: Then, one can define the N × N diffusion matrix P by: which elements p ij correspond to the probability of jumping from one point to another in the high-dimensional space.
Taking the power of t of the diffusion matrix P is essentially identical of observing t steps forward of a Markov chain process Z t on the data points. Thus, the transition probability of moving from point i to point j reads: On the weighted graph, the random walk can be defined as: where deg(z i ) denotes the weighted degree of the point i defined as: At the next step, it is easy to compute the graph Laplacian matrixP defined as: The eigendecomposition ofP results toP = U ΛU * , where Λ is a diagonal matrix with the eigenvalues and U is the matrix with columns the eigenvectors ofP . The eigenvalues of P are the same ofP since P is the adjoined of the symmetric matrix P , while the left and right eigenvectors of P (say Φ and Ψ) are related to those ofP as [32]: The embedding of the manifold in µ dimensions is realized by taking the first µ non-trivial/dependent eigenvectors of P : where t denotes the number of diffusion steps (usually t = 1) and λ 1 , . . . , λ µ the descending order eigenvalues. For any pair of points z i and z j , the diffusion distance at the time step t is defined as: where Φ 0 denotes the stationary distribution of the random walk described by the diffusion matrix P [33]: In practice, the embedding dimension µ is determined by the spectral gap in the eigenvalues of the final decomposition. Such a numerical gap means that the first few eigenvalues would be adequate for the approximation of the diffusion distance between all pairs of points [22,23]. Here we retain only the µ parsimonious eigendimensions of the final decomposition as proposed in [30,31].

Feature selection using Diffusion Maps with leave-one-out cross-validation (LOOCV)
Here, by identifying the coarse-scale spatio-temporal behaviour of a system of PDEs, we mean learning their righthand-sides as a black-box model. Hence, we first have to deal with the task of discovering a set of coarse-grained variables embedded in the high-dimensional input data space. For this task, one can employ various methods for feature selection such as LASSO [34,35] and Random Forests [36,37]. In our framework, we used a method that extracts the dominant features based on manifold parametrization through output-informed Diffusion Maps [30]. The core assumption of this method is that given a dataset in a high-dimensional space, then we can parametrize it on a lower-dimensional manifold. For this purpose, given a set of φ 1 , φ 2 , . . . , φ k−1 ∈ R N Diffusion Maps eigenvectors, for each element i = 1, 2 . . . , N of φ k , we use a local linear regression model: to investigate if φ k is an dependent eigendirection; Φ k−1,i = [φ 1,i , φ 2,i , . . . , φ T k−1,i ], α k,i ∈ R and β k,i ∈ R k−1 . The values of parameters α k,i and β k,i are found solving an optimization problem of the form: where K is a kernel weighted function, usually the Gaussian kernel: where σ is the shape parameter. The final normalized leave-one-out cross-validation (LOOCV) error for this local linear fit is defined as: For small values of r k , φ k is considered to be dependent of the other eigenvectors and hence as a harmonic or repeated eigendirection, while large values of r k , suggest that φ k can serve as a new independent eigendirection.
In our approach, we provide as inputs to the Diffusion Maps algorithm the combined input-output domain (the observables z i and their spatial and time derivatives). In principle, any of the subsets that is capable to parametrize the discovered embedding coordinates that were chosen after the above described methodology, can be considered as a new possible input data domain that can be used for learning the right-hand-side of the effective PDEs. Actually, we find the subsets of variables of the input space that minimally parametrize the intrinsic embedding by quantifying it with a total regression loss L T based on a mean squared error: Here, as L φj , we define the regression loss for representing the intrinsic coordinate φ j when using s out of n selected input features: where g(·) is the output of the regressors with inputs the values of the features in the ambient space and target values the eigenvectors φ k . Note, that in this procedure, we did not include the values of the bifurcation parameter into the dataset. We have chosen to employ the above method separately for every subset of the same value of the bifurcation parameter and finally to select the subset of features with the minimum sum of the total regression loses across all the embedding spaces.

Shallow Feedforward Neural Networks
It is well known that FNNs are universal approximators of any (piecewise) continuous (multivariate) function, to any desired accuracy [38,39,40,41,42]. This implies that any failure of a network must arise from an inadequate choice/calibration of weights and biases or an insufficient number of hidden nodes.
The output of a FNN with two hidden layers, with H hidden units in each layer, that models the right-hand-side of the i-th PDE at each input point z (i) (x q , t s ) ∈ R γ(i) , evaluated at each point in space x q , q = 1, . . . , M , and time t s , s = 1, 2, . . . , N can be written as: ψ(.) is the activation function (based on the above formulation it is assumed to be the same for all nodes in the hidden are the external weights connecting the second hidden layer and the output, b o(i) ∈ R is the bias of the output node, the matrix W 1(i) ∈ R H×γ(i) with rows ω 1(i) l ∈ R γ(i) are the weights connecting the input and the first hidden layer, b 1(i) = (b are the biases of the nodes of the first hidden layer, the matrix W 2(i) ∈ R H×H contains the weights ω lj connecting the l-th unit of the first hidden layer with the j-th unit of the second hidden layer and b 2(i) = (b are the biases of the second hidden layer. In the same way, one can easily extend the above formula for more than two hidden layers. Then, a loss function for each one of the m PDEs can be specified as: The main task of a neural network is the generalization. Foresee and Hagan ( [43]) showed that adding the regularization term E ω = H j=1 ω 2 j to the cost function will maximize the posterior probability based on Bayes' rule. Hence, the total cost function is: where λ is the regularization parameter that has to be tuned. For our simulations we used the Bayesian regularized back-propagation updating the weight values using the Levenbgerg-Marquadt algorithm. ( [44]) as implemented in Matlab.

Random Projection Neural Networks
Random Projection Neural Networks (RPNNs) are a family of neural networks including Random Vector Functional Links (RVFLs) [45,46], Reservoir Computing/ Echo state networks [47,48], Extreme Learning Machines [49] and Liquid-State Networks [50]. The basic idea, which seed can be found in the pioneering work of Rosenblatt back in '60s [51], behind all these approaches is to use FNNs with fixed-weights between the input and the hidden layer(s), fixed biases for the nodes of the hidden layer(s), and a linear output layer. Based on that configuration, the output is projected linearly onto the functional subspace spanned by the nonlinear basis functions of the hidden layer, and the only remaining unknowns are the weights between the hidden and the output layer. Their estimation is done by solving a nonlinear regularized least squares problem [52,53]. The universal approximation properties of the RPNNs has been proved in a series of papers (see e.g. [45,46,48,49]). In general, the universal approximation property of random projections can been can be rationalized by the celebrated Johnson and Lindenstrauss (JL) Theorem [54]: Note, that while the above theorem is deterministic, its proof relies on probabilistic techniques combined with Kirszbraun's theorem to yield a so-called extension mapping [54]. In particular, it can be shown, that one of the many such embedding maps is simply a linear projection matrix R with entries r ij that are i.i.d. random variables sampled from a normal distribution. In particular, the JL Theorem may be proved using the following lemma.
Lemma 2 Let Z be a set of N points in R n and let G(z) be the random projection defined by where R = [r ij ] ∈ R µ×n has components that are i.i.d. random variables sampled from a normal distribution. Then, . Similar proofs have been given for distributions different from the normal one (see, e.g. [55,56,57,58]).
The above is a feature mapping, which may result in a dimensionality reduction (µ < n) or, in analogy to the case of kernel-based manifold learning methods, a projection into a higher dimensional space (µ > n). We also note that while the above linear random projection is but one of the choices for constructing a JL embedding, it has been experimentally demonstrated and/or theoretically proven that appropriately constructed nonlinear random embeddings may outperform simple linear random projections. For example, in [59] it was shown that deep networks with random weights for each layer result in even better approximation accuracy than the simple linear random projection.
Here, for learning the right-hand-side of the set of PDEs, we used RPNNs (in the form of an Extreme Learning Machine [49]) with a single hidden layer [52]. For H hidden units in the hidden layer, the output of the proposed RPNN can be written as:û where are the weights connecting the input and the hidden layer and the biases of the hidden layer, respectively. Now, in the proposed RPNN scheme, ω (i) j and b (i) j are random variables drawn from appropriate uniform distributions, the output bias b o is set to zero, and the output weights ω o(i) ∈ R 1×H are determined solving a linear least squares problem: t (x q , t s ) of the RPNN for q = 1, . . . , M and s = 1, . . . , N , and the matrix A (i) ∈ R H×MN is the matrix which elements A j,k are given by: where z Regarding the regression problem, generally H << M N , the system Eq. (25) is over-determined. As the resulting projection matrix A is not guaranteed to be full row-rank the solution can be computed with Singular Value Decomposition (SVD). Given the SVD decomposition of A, the pseudo inverse A + is: where U ∈ R H×H and V ∈ R MN ×MN are the unitary matrices of left and right eigenvectors respectively, and Σ ∈ R H×MN is the diagonal matrix of H singular values σ j . Finally, in order to regularize the problem, we can select justH < H singular valuesσ that are greater than a given tolerance, i.e.,σ ∈ {σ j | σ j > tol, j = 1, . . . , H}. Hence, the output weights ω o(i) are computed as: whereŨ ∈ R H×H ,Ṽ ∈ R MN ×H andΣ ∈ RH ×H are restricted to theσs.
For the regression problem, we aim at learning the right-hand-side of the PDEs from spatio-temporal data with singlelayer RPNNs with H random basis functions: Then the approximated functionF (i) is just a linear combination of the random basis functions ψ (i) j . For our computations, we selected as activation function the logistic sigmoid ψ : y ∈ R → ψ(y) ∈ R given by: where, y is given by linear combination y = ω

Random Sampling Procedure
For the construction of the appropriate set of random basis functions for the solution of the inverse problem (i.e. that of learning the effective PDEs from data), we suggest a different random sampling procedure, than the one usually implemented in RPNNs and in particular in Extreme Learning Machines [52,53,60,61,62,63] for the solution of the forward problem, i.e. that of the numerical solution of Partial Differential Equations. Since in the inverse problem, we aim at solving a high-dimensional over-determined system (M N >> H) is important to parsimoniously select the underlying basis functions ψ (i) j , i.e. to seek for appropriate internal weights W (i) and biases b (i) that lead to non-trivial functions.
In general, the weights ω j are uniformly random sampled from a subset of the input/feature space, e.g., ω where an high dimension γ(i) of the input/feature space leads to the phenomenon of curse of dimensionality. Indeed, it is necessary to use many function (H ∝ 10 γ(i) ) to correctly "explore" the input space and give a good basis function.
Hence, our goal is to construct ω j that well describe the manifold M (i) where the data z (i) (x q , t s ) ∈ M (i) , ∀q, ∀s are embedded. It is well-known that the output of a neuron is given by a ridge function f : R H → R such that f (z 1 , . . . , z n ) = g(a T · z), where g : R → R and a ∈ R n . The inflection point of the logistic sigmoid is at (y = 0, ψ(y) = 1/2). The points that satisfy the following relation [52,53]: j is constantly 1/2. We call the points c j from the inputs z(x q , t s ). Then, we construct the hidden weights as: in order to set the direction ω j . By doing so, the ridge function will be constant on a direction orthogonal to the connection between two points in the manifold M (i) and along this line will change in value, so it will be able to discriminate between the points lying on this direction. Thus, the biases b Eq. (31) ensures that c j is a center of the ridge function.

Coarse-grained numerical bifurcation analysis from spatio-temporal data
For assessing the performance of the proposed scheme, we selected the celebrated, well studied FitzHugh-Nagumo (FHN) model first introduced in [64] to simplify the Hodgkin-Huxley model into a two-dimensional system of ODEs to describe the dynamics of the voltage across a nerve cell. In particular, we consider the FHN equations which add a spatial diffusion term to describe the propagation of an action potential as a traveling wave. The bifurcation diagram of the one-dimensional set of PDEs is known to have a turning point and two supercritical Andronov-Hopf bifurcation points. In what follows, we describe the model along with the initial and boundary conditions and then we present the D1Q3 Lattice Boltzmann model.

The Macroscale model: the FitzHugh-Nagumo Partial Differential Equations
The evolution of activation u : [x 0 , x end ] × [t 0 , t end ] → R and inhibition v : [x 0 , x end ] × [t 0 , t end ] → R dynamics are described by the following two coupled nonlinear parabolic PDEs: with homogeneous von Neumann Boundary conditions: α 0 and α 1 are parameters, ε is the kinetic bifurcation parameter.

The D1Q3 Lattice Boltzmann model
The Lattice Boltzmann model serves as our fine-scale simulator. The statistical description of the system at a mesoscopic level uses the concept of distribution function f ( r, c, t), i.e. f ( r, c, t)d rd cdt is the infinitesimal probability of having particles at location r with velocities c at a given time t, for reducing the high-number of equations and unknowns. Then, at this level, a system without an external force is governed by the Boltzmann Transport equation [66]: where the term R(f ) describes the rate of collisions between particles. In 1954, Bhatnagar, Gross and Krook (BGK) [67] introduced an approximation model for the collision operator: where τ is the so-called relaxing time coefficient and f eq denote the local equilibrium distribution function.
In the LBM, Eq.(37)-(38) is collocated (assumed valid) along specific directions c i on a lattice: and then Eq.(39) is discretized with a time step ∆t as follows: One common interpretation of Eq. (40) is to think about the distribution functions as fictitious particles that stream and collide along specified linkages of the lattice. Lattices are usually denoted by the notation DnQm, where n is the spatial dimension of the problem and m refer to the number of connections of each node in the lattice. The node in the lattices coincide with the points of a spatial grid with a spatial step ∆x.
Here, in order to estimate the coarse-scale observables u and v of the FHN dynamics, we considered the D1Q3 implementation, i.e. we used the one-dimensional lattice with three velocities c i : particles can stream to the right (c 1 = ∆x ∆t ), to the left (c −1 = − ∆x ∆t ) or staying still on the node (c 0 = 0). Also, we assume the coexistence of two different distribution functions for describing the distribution of the activator particles f u i and the distribution of the inhibitor particles f v i , where the subscript i refer to the associated direction. Therefore, one can figure that at each instant there are six fictitious particles on each node of the lattice: two resting on the node (with distribution f u 0 and f v 0 ), two moving on the left (with distribution f u −1 and f v −1 ) and two moving on the right (with distribution f u 1 and f v 1 ). The relation between the above distributions and the coarse-scale density u and v is given by the zeroth moment (across the velocity directions) of the overall distribution function: The coexistence of multiple distributions renders necessary to introduce weights ω i for the connections in the lattice that should satisfy the following properties: (a) Normalization ω 0 + ω 1 + ω −1 = 1 (b) Simmetry ω 1 − ω −1 = 0 (c) Isotropy: where c s is the speed of sound in the lattice. Thus, the weights are equal to ω ±1 = 1/6 for the moving particles and ω 0 = 4/6 for the resting particle. The resulting speed of sound in the lattice is c s = √ 3∆x 3∆t . As the BGK operator (38) suggests, one key step in applying LBM for solving reaction-advection-diffusion PDEs is to determine the local equilibrium distribution function f eq associated to a given model. For particles with macroscopic density ρ that move in a medium macroscopic velocity u m , the Maxwell distribution is: where d is the spatial dimension of the problem, T is the temperature and R is the universal gas constant. The exponential in Eq. (42) can be expanded using Taylor series, ignoring terms of order O(u 3 ) and higher, thus obtaining: with and RT = c 2 s , with c s speed of the sound. Now, since the FHN PDEs are only diffusive, i.e. there are no advection terms, the medium is stationary ( u m = 0) and the equilibrium distribution function, discretized on the lattice direction c i , is simplified in: Now, in the FHN model, we need to consider also reaction terms R l i and so finally, the time evolution of the microscopic simulator associated to the FHN on a given D1Q3 lattice is: where the superscript l denotes the activator u and the inhibitor v and the reaction terms R l i are directly derived by: Finally, the relaxation coefficient ∆t τ l is related to the macroscopic kinematic viscosity D l of the FHN model and in general depends on the speed of the sound c s associated to the lattice [68]: (47)

Algorithm flow chart
Summarizing, the proposed three-tier algorithm for constructing bifurcation diagrams from data is provided in the form of a pseudo code in Algorithm 1. The first two steps are related to the identification of the effective coarse scale observables and the learning of the right-hand-side of the effective PDEs. The third step is where the pseudo-arc-length continuation method is applied for the tracing of the solution branch through saddle-node bifurcations.

Algorithm 1 Construct coarse bifurcation diagrams from spatio-temporal data from microscopic (here LB) simulations
Require: Grid Nε of ε ⊲ set grid for the values of the bifurcation parameter ε Require: x and t be the space and time grid 1. Use Diffusions Maps to identify a parsimonious set of coarse scale observables from data (here produced by Lattice Boltzmann simulations) ⊲ Lattice Boltzman simulator, see eq. 41 Compute ut, vt, ux, vx, uxx, vxx ⊲ here, using central finite differences. Compute the first µ Diffusion Maps (DM) eigenvectors: ⊲ see eq. 18 and 19 end for end for zu ← {q * : L u t (q * ) = min(L u t )} zv ← {q * : L u t (q * ) = min(L v t )} ⊲ extract the effective features 2. Based on the extracted set of coarse variables from Step 2, learn the right-hand-sides of the coarse scale PDEs Train the FNNs/RPNNs: F u r ≡ût,r ←FFN/RPNN(zu, ε) F v r ≡vt,r ←FNN/RPNN(zv, ε) 3. Wrap around the machine learning models the numerical bifurcation analysis toolkit (here the pseudo arc-length continuation method) to systematically study the emergent dynamics 5 Numerical Results

Numerical bifurcation analysis of the FHN PDEs
For comparison purposes, we first constructed the bifurcation diagram of the FHN PDEs using central finite differences. The discretization of the one-dimensional PDEs in M points with second-order central finite differences in the unit interval 0 ≤ x ≤ 20 leads to the following system of 2(M − 2) non-linear algebraic equations At the boundaries, we imposed homogeneous von Neumann boundary conditions. The above 2(M − 2) set of nonlinear algebraic equations is solved iteratively using Newton's method. The non-null elements of the Jacobian matrix are given by: To trace the solution branch along the critical points, we used the pseudo arc-length-continuation method ( [69,70,71]). This involves the parametrization of u(x), v(x) and ε(x) by the arc-length s on the solution branch. The solution is sought in terms ofũ(x, s),ṽ(x, s) andε(s) in an iterative manner, by solving until convergence the following augmented system: where where (ũ(x) −2 ,ṽ(x) −2 ) and (ũ(x) −1 ,ṽ(x) −1 ) are two already found consequent solutions forε −2 andε −1 , respectively and ds is the arc-length step for which a new solution around the previous solution (ũ(x) −2 ,ṽ(x) −2 ,ε −2 ) along the arc-length of the solution branch is being sought. The corresponding reference bifurcation diagram is shown in Figure 2. In this range of values, there is an Andronov-Hopf bifurcation at ε ≈ 0.018497 and a fold point at ε ≈ 0.95874.

Numerical bifurcation analysis from microscopic simulations
We collected transients of u(x, t) and v(x, t) with a sampling rate of 1s, from 10 different random sampled initial conditions for 40 different values for the bifurcation parameter ε. In particular, we created a grid of 40 different ε in [0005, 0.955] using Gauss-Chebychev-Lobatto points, while the 10 initial conditions are sampled according to Eq. (36). and two single-layer RPNNs (one for each one of the variables u and v). The FNNs were constructed using two hidden layers with 12 units in each layer. Hidden units were employed with the hyperbolic tangent sigmoid activation function, while the regularization parameter was tuned and set λ = 0.01. For the training of the FNNs, we used the Deep Learning toolbox of MATLAB 2021a on an Intel Core i5-8265U with up to 3.9 GHz frequency with a memory of 8 GB. Table 1 summarises the performance of the two schemes on the training and on the test data sets. As it is shown, for any practical purposes, both schemes resulted to equivalent numerical accuracy for all metrics. For the FNNs, the training phase (using the deep-learning toolbox in Matlab R2020b) required ∼ 1000 epochs and ∼ 4 hours with the minimum tolerance set to 1e − 07.

Numerical bifurcation analysis without feature selection
Differences between the predictedû t (x, t) andv t (x, t) and the actual values of the time derivatives u t (x, t) and v t (x, t) for three different values of ε are shown in Figure 5 when using FNNs and in Figure 6 when using RPNNs.

Numerical bifurcation analysis with feature selection
We used Diffusion Maps (setting the width parameter of the Gaussian kernel to σ = 10) to identify the three parsimonious leading eigenvectors as described in section 2.1. For our computations, we used the datafold package in python [72]. We denote them as φ 1 , φ 2 , φ 3 . The three parsimonious Diffusion Maps coordinates for different values of the parameter ε are shown in Figure 8. For ε = 0.114 that is close to the Andronov-Hopf point, the embedded space is a two dimensional "carpet" in the three dimensional space. The oscillatory behaviour leads to different values of the time derivative which can be effectively parametrized as shown by the coloring of the manifold (Figures 8(a), 8(b)).    the input data domain are presented in Table 2. As expected, the best candidate features are the (u, v, u xx ) for u t and (u, v, v xx ) for v t , which are the only features that indeed appear in the closed form of the FHN PDEs. Table 2: The "best" set of variables that effectively parametrize the intrinsic coordinates ((φ u 1 , φ 2 2 , φ u 3 ) and (φ v 1 , φ v 2 , φ v 3 )) and the corresponding sums of total losses across all the values of the bifurcation parameter ε.
Finally, we repeated the same steps but now using as inputs in the FNNs and RPNNs the reduced input domain as obtained from the feature selection process. Table 1 summarizes the performance of the schemes on the training and the test sets. Figures 9 and 10 illustrate the norms of the differences between the predicted from the FNNs and RPNNs and the actual time derivatives of both variables. Hence, as it is shown, the proposed feature selection approach based on the parsimonious Diffusion Maps revealed correctly the structure of the embedded PDEs in the form of: ∂u(x, t) ∂t =F u (u(x, t), v(x, t), u xx (x, t), ε), ∂v(x, t) ∂t =F v (u(x, t), v(x, t), v xx (x, t), ε) whereF u andF v are the outputs of the FNNs (or the RPNNs). The constructed bifurcation diagram with feature selection is shown in Figure 7. Using the FNNs, we estimated the Andronov-Hopf point at ε ≈ 0.0195 and the turning point at ε ≈ 0.9762. Using the RPNNs, we estimated the Andronov-Hopf point at ε ≈ 0.0192 and the turning point at ε ≈ 0.9752.

Conclusions
Building on previous efforts [29], we present a machine learning methodology for solving the inverse problem in complex systems modelling and analysis, thus identifying effective PDEs from data and constructing the coarse-bifurcation diagrams. The proposed approach is a three tier one. In the first step, we used non-linear manifold-learning and in particular Diffusion Maps to select an appropriate set of coarse-scale observables that define the low-dimensional manifold on which the emergent dynamics evolve in the parameter space. At the second step, we learned the righthand-side of the effective PDEs with respect to the coarse-scale observables; here we used shallow FNNs with two hidden layers and single layer RPNNs which basis functions were constructed using appropriately designed random sampling. Finally, based on the constructed black-box machine learning models, we constructed the coarse-grained bifurcation diagrams exploiting the arsenal of numerical bifurcation toolkit. To demonstrate the approach, we used D1Q3 Lattice Boltzmann simulations of the FitzHugh-Nagumo PDEs and compared the machine learning constructed bifurcation diagram with the one obtained by discretizing the PDEs with central finite differences.
The results show that the proposed machine learning framework was able to identify the "correct" set of coarse-scale variables that are required to model the emergent dynamics in the form of PDEs and based on them systematically study the coarse-scale dynamics by constructing the emerging macroscopic bifurcation diagram. In terms of approximation accuracy of the macroscopic dynamics, both schemes (the two hidden-layers FNNs and the single-hidden layer RPNNs) performed for all practical purposes equivalently. However, in terms of the computational cost in the training phase, the RPNNs were 20 to 30 times faster than the two hidden layers FNNs. Hence, the proposed RPNN scheme is a promising alternative approach to deep learning for solving the inverse problem for high-dimensional PDEs from big data [1,73,74].
Here, we have focused on the construction of black-box machine learning models of the emergent dynamics of complex systems that can be described by (parabolic) PDEs over the parametric space. The proposed approach can be extended to construct "gray-box" models by incorporating information from the physics into the machine learning architecture [75,1]. Furthermore, based on previous efforts aiming at extracting normal forms of ODEs from data [76], the proposed approach can be extended to discover normal forms of high-dimensional PDEs.