A comprehensive deep learning-based approach to reduced order modeling of nonlinear time-dependent parametrized PDEs

Traditional reduced order modeling techniques such as the reduced basis (RB) method (relying, e.g., on proper orthogonal decomposition (POD)) suffer from severe limitations when dealing with nonlinear time-dependent parametrized PDEs, because of the fundamental assumption of linear superimposition of modes they are based on. For this reason, in the case of problems featuring coherent structures that propagate over time such as transport, wave, or convection-dominated phenomena, the RB method usually yields inefficient reduced order models (ROMs) if one aims at obtaining reduced order approximations sufficiently accurate compared to the high-fidelity, full order model (FOM) solution. To overcome these limitations, in this work, we propose a new nonlinear approach to set reduced order models by exploiting deep learning (DL) algorithms. In the resulting nonlinear ROM, which we refer to as DL-ROM, both the nonlinear trial manifold (corresponding to the set of basis functions in a linear ROM) as well as the nonlinear reduced dynamics (corresponding to the projection stage in a linear ROM) are learned in a non-intrusive way by relying on DL algorithms; the latter are trained on a set of FOM solutions obtained for different parameter values. In this paper, we show how to construct a DL-ROM for both linear and nonlinear time-dependent parametrized PDEs; moreover, we assess its accuracy on test cases featuring different parametrized PDE problems. Numerical results indicate that DL-ROMs whose dimension is equal to the intrinsic dimensionality of the PDE solutions manifold are able to approximate the solution of parametrized PDEs in situations where a huge number of POD modes would be necessary to achieve the same degree of accuracy.


Introduction
The solution of a parametrized system of partial differential equations (PDEs) by means of a full-order model (FOM), whenever dealing with real-time or multi-query scenarios, entails prohibitive computational costs if the FOM is high-dimensional. In the former case, the FOM solution must be computed in a very limited amount of time; in the latter one, the FOM must be solved for a huge number of parameter instances sampled from the parameter space. Reduced order modeling techniques aim at replacing the FOM by a reduced order model (ROM), featuring a much lower dimension, still able to express the physical features of the problem described by the FOM. The basic assumption underlying the construction of such a ROM is that the solution of a parametrized PDE, belonging a priori to a high-dimensional (discrete) space, lies on a low-dimensional manifold embedded in this space. The goal of a ROM is then to approximate the solution manifold -that is, the set of all PDE solutions when the parameters vary in the parameter space -through a suitable, approximated trial manifold.
A widespread family of reduced order modeling techniques relies on the assumption that the reduced-order approximation can be expressed by a linear combination of basis functions, built starting from a set of FOM solutions, called snapshots. Among these techniques, proper orthogonal decomposition (POD) -equivalent to principal component analysis in statistics [1], or Karhunen-Loève expansion in stochastic applicationsexploits the singular value decomposition of a suitable snapshot matrix (or the eigen-decomposition of the corresponding snapshot correlation matrix), thus yielding linear ROMs, in which the ROM approximation is given by the linear superimposition of POD modes. In this case, the solution manifold is approximated through a linear trial manifold, that is, the ROM approximation is sought in a low-dimensional linear trial subspace.
Projection-based methods are linear ROMs in which the ROM approximation of the PDE solution, for any new parameter value, results from the solution of a low-dimensional (nonlinear, dynamical) system, whose unknowns are the ROM degrees of freedom (or generalized coordinates). Despite the PDE (and thus the FOM) being linear or not, the operators appearing in the ROM are obtained by imposing that the projection of the FOM residual evaluated on the ROM trial solution is orthogonal to a low-dimensional, linear test subspace, which might coincide with the trial subspace. Hence, no matter whether the PDE is linear or not, the resulting ROM is linear since the reduced dynamics is obtained through a projection onto a linear subspace [2,3,4]. However, linear ROMs show severe computational bottlenecks when dealing with problems featuring coherent structures (possibly dependent on parameters) that propagate over time, namely in transport and wave-type phenomena, or convection-dominated flows. In these cases, the dimension of the linear trial manifold can easily become extremely large if compared to the intrinsic dimension of the solution manifold for the sake of accuracy, thus compromising the ROM efficiency. To overcome this bottleneck, ad-hoc extensions of the POD strategy have been considered, towards nonlinear approaches to build a ROM [5,6].
In this paper we propose a computational, non-intrusive approach based on deep learning (DL) algorithms to deal with the construction of efficient ROMs (which we refer to as DL-ROMs) in order to tackle parameterdependent PDEs; in particular, we consider PDEs that feature wave-type phenomena. A comprehensive framework is presented for the global approximation of the map (t, µ) → u h (t, µ), where t ∈ (0, T ) denotes time, µ ∈ P ⊂ R nµ a vector of input parameters and u h (t, µ) ∈ R N h the solution of a large-scale dynamical system arising from the space discretization of a parametrized, time-dependent (non)linear PDE. Several recent works have shown possible applications of DL techniques to parametrized PDEs -thanks to their approximation capabilities, their extremely favorable computational performances during online testing phases, and their relative easiness of implementation -both from a theoretical [7] and a computational standpoint. Regarding this latter aspect, artificial neural networks (ANN), such as feedforward neural networks, have been employed to model the reduced dynamics in a data-driven [8], and less intrusive way (avoiding, e.g., the costs entailed by projection-based ROMs), but still relying on a linear trial manifold built, e.g., through POD. For instance, in [9,10,11,12] the solution of a (nonlinear, time-dependent) ROM for any new parameter value has been replaced by the evaluation of ANN-based regression models; similar ideas can be found, e.g., in [13,14,15]. Few attempts have been made in order to describe the reduced trial manifold where the approximation is sought (avoiding, e.g., the linear superimposition of POD modes) through ANNs, see, e.g., [16,17].
For instance, a projection-based ROM technique has been introduced in [17], in which the FOM system is projected onto a nonlinear trial manifold identified by means of the decoder function of a convolutional autoencoder neural network. However, the ROM is derived by minimizing a residual formulation, for which the quasi-Newton method herein employed requires the computation of an approximated Jacobian of the residual at each time step. A ROM technique based on a deep convolutional recurrent autoencoder has been proposed in [16], where a reduced trial manifold is obtained by means of a convolutional autoencoder; the latter is then used to train a Long Short-Term Memory (LSTM) neural network modeling the reduced dynamics. However, no explicit parameter dependence in the PDE problem is considered, apart from µdependent initial data, and the LSTM is trained on reduced approximations obtained through the encoder function of the autoencoder. Another promising application of machine and deep learning techniques within a ROM framework deals with the efficient evaluation of reduced error models, see, e.g., [18,19,20,21].
Our goal is to set up nonlinear ROMs whose dimension is nearly equal (if not equal) to the intrinsic dimension of the solution manifold that we aim at approximating. Our DL-ROM approach combines and improves the techniques introduced in [16,17] by shaping an all-inclusive DL-based ROM technique, where we both (i) construct the reduced trial manifold and (ii) model the reduced dynamics on it employing ANNs. The former task is achieved by using the decoder function of a convolutional autoencoder; the latter task is instead carried out by considering a feedforward neural network and the encoder function of a convolutional autoencoder. Moreover, we set up a computational procedure performing the training of both network architectures simultaneously, by minimizing a loss function that weights two terms, one dedicated to each single task. In this respect, we are able to design a flexible framework capable to handle parameters affecting both PDE operators and data, which avoids both the expensive projection stage of [17] and the training of a more expensive LSTM network. In our technique, the intrusive construction of a ROM is replaced by the evaluation of the ROM generalized coordinates through a deep feedforward neural network taking only (t, µ) as inputs. The proposed technique is purely data-driven, that is, it only relies on the computation of a set of FOM snapshots -in this respect, DL does not replace the high-fidelity FOM as, e.g., in the works by Karniadakis and coauthors [22,23,24,25,26]; rather, DL techniques are built upon it, to enhance the repeated evaluation of the FOM for different values of the parameters.
The structure of the paper is as follows. In section 2 we show how to generate nonlinear ROMs by reinterpreting the classical ideas behind linear ROMs for parametrized PDEs. In section 3 we detail the construction of the proposed DL-ROM, whose accuracy is numerically assessed in section 4 by considering three different test cases of increasing complexity (both with respect to the parametric dependence and the nature of the PDE). Finally, some conclusions are drawn in section 5. A quick overview of useful facts about deep feedforward, convolutional and autoencoders neural networks is reported in Appendix A to make the paper self-contained.

From linear to nonlinear dimensionality reduction
Starting from the well-known setting of linear (projection-based) ROMs, in this section we generalize this task to the case of nonlinear ROMs.

Problem formulation
We formulate the construction of ROMs in algebraic terms, starting from the high-fidelity (spatial) approximation of nonlinear, time-dependent, parametrized PDEs. By introducing suitable space discretizations techniques (such as, e.g., the Finite Element Method, Isogeometric Analysis or the Spectral Element Method) the high-fidelity, full order model (FOM) can be expressed as a nonlinear parametrized dynamical system. Given µ ∈ P, we aim at solving the initial value problem where the parameter space P ⊂ R nµ is a bounded and closed set, encoding the system dynamics. The FOM dimension N h is related with the finite dimensional subspaces introduced for the space discretization of the PDE -here h > 0 usually denotes a discretization parameter, such as the maximum diameter of elements in a computational mesh -and can be extremely small whenever the PDE problem shows complex physical behaviors and/or high degrees of accuracy are required to its solution. The parameter µ ∈ P may represent physical or geometrical properties of the system, like, e.g., material properties, initial and boundary conditions, or the shape of the domain. In order to solve problem (1), suitable time discretizations are employed, such as backward differentiation formulas [27]. Our goal is the efficient numerical approximation of the whole set of solutions to problem (1) when (t; µ) varies in [0, T ) × P, also referred to as solution manifold (a sketch is provided in Figure 1). Assuming that, for any given parameter µ ∈ P, problem (1) admits a unique solution, for each t ∈ (0, T ), the intrinsic dimension of the solution manifold is at most n µ + 1 N h , where n µ is the number of parameters (time plays the role of an additional coordinate). This means that each point u h (t; µ) belonging to S h is completely defined in terms of at most n µ + 1 intrinsic coordinates, or equivalently, the tangent space to the manifold at any given u h (t; µ) is spanned by n µ + 1 basis vectors.

Linear dimensionality reduction: projection-based ROMs
The most common way to build a ROM for the efficient approximation of problem (1) relies on the introduction of a reduced linear trial manifold, that is of a subspaceS n = Col(V ) of dimension n N h , spanned by the n columns of a matrix V ∈ R N h ×n . Hence, a linear ROM looks for an approximatioñ u h (t; µ) ≈ u h (t; µ) in the formũ whereũ h : [0, T ) × P →S n . Here u n (t; µ) ∈ R n for each t ∈ [0, T ), µ ∈ P denotes the vector of intrinsic coordinates (or degrees of freedom) of the ROM approximation; note that the map Ψ h : R n → R N h , s n →s h = V s n that, given the (low-dimensional) intrinsic coordinates, returns the (high-dimensional) approximation of the FOM solution u h (t; µ), is linear. Proper Orthogonal Decomposition (POD) is one of the most widely employed techniques to generate the linear trial manifold [4]. Considering a set of N train instances of the parameter µ ∈ P, we introduce the snapshot matrix S ∈ R N h ×Ns defined as where we have introduced a partition of the time interval [0, T ] in N t time steps {t k } Nt k=1 , t k = k∆t, of size ∆t = T /N t and N s = N train N t . Moreover, let us introduce a symmetric and positive definite matrix X h ∈ R N h ×N h encoding a suitable norm (e.g., the energy norm) on the high-dimensional space and admitting a Cholesky factorization X h = H T H. POD computes the Singular Value Decomposition (SVD) of HS, |ψ Ns ] ∈ R Ns×Ns and Σ = diag(σ 1 , . . . , σ r ) ∈ R N h ×Ns with σ 1 ≥ σ 2 ≥ . . . ≥ σ r , and r ≤ min(N h , N s ), and sets the columns of V in terms of the first n left singular vectors of S that is, V = [H −1 ζ 1 | . . . |H −1 ζ n ]. By construction, the columns of V are orthonormal (with respect to the scalar product ( · , · ) X h ) and among all possible n-dimensional subspaces spanned by the column of a matrix W ∈ R N h ×n , V provides the best reconstruction of the snapshots, that is, where V n = {W ∈ R N h ×n : W T X h W = I}. For this reason, we refer to V V T X h u h (t; µ) as to the optimal-POD reconstruction of u h (t; µ) onto a reduced subspace of dimension n < N h . In order to model the reduced dynamics of the system, that is, the time-evolution of the generalized coordinates u n (t; µ), we can replace u h (t; µ) by (3) in system (1), and impose that the residual associated to the first equation of (6) is orthogonal to a n-dimensional subspace spanned by the column of a matrix Y ∈ R N h ×n , that is, Y T r h (V u n ) = 0. This condition yields the following ROM In the case Y = V a Galerkin projection is performed, while the case Y = V yields a more general Petrov-Galerkin projection. Note that choosing Y such that Y T V = I ∈ R N h ×N h does not automatically ensure ROM stability on long time intervals. The RB method under the form of either Galerkin-POD or Petrov-Galerkin-POD methods has been successfully applied to a broad range of parametrized time-dependent (non)linear problems (see, e.g., [6,28]) however it provides low-dimensional subspaces of dimension n n µ + 1 much larger than the intrinsic dimension of the solution manifold -relying on a linear, global trial manifold thus represent a major bottleneck to computational efficiency [5,6]. This is the case, for instance, of hyperbolic problems, for which the RB method is not able in practice to significantly decrease the dimensionality of the problem. The same difficulty might also affect the use of hyper-reduction techniques, such as the (discrete) empirical interpolation [29,30], mandatory in order to assemble the operators appearing in the ROM (8) without relying on expensive N h -dimensional arrays. See, e.g., [31] for further details.

Nonlinear dimensionality reduction
A first attempt to overcome the computational bottleneck entailed by the use of a linear, global trial manifold is to build a piecewise linear trial manifold, using local reduced bases whose dimension is smaller than the one of the global linear trial manifold. Clustering algorithms applied on a set of snapshots can be employed to partition them into N c clusters from which POD can extract a subspace of reduced dimension; the ROM is then obtained by following the strategy described above on each cluster separately, see, e.g. [32,33]. An alternative approach based on classification binary trees has been introduced in [34]. These strategies have been employed (and compared) in [6] in order to solve parametrized problems in cardiac electrophysiology. Using a piecewise linear trial manifold partially overcomes the limitation of a linear dimensionality reduction technique as POD, yet employing local bases of dimension much higher than the intrinsic dimension of the solution manifold S h . An approach based on a dictionary of solutions, computed offline, has been developed in [35] as an alternative to using a truncated reduced basis based on POD, together with an online L 1 -norm minimization of the residual.
Other possible options involving nonlinear transformations of modes might rely on a reconstruction of the POD modes at each time step using Lax pairs [36], on the solution of Monge-Kantorovich optimal transport problems [37], on a problem-dependent change of coordinates requiring the solution of an optimization problem repeatedly [38], on shifted POD modes [39] after multiple transport velocities have been identified and separated, or again basis updates are derived from querying the full model at a few selected spatial coordinates [40]. Despite providing remarkable improvements compared to the classic (Petrov-)Galerkin-POD approach, all these strategies exhibit some drawbacks, such as: (i) the high computational costs entailed during the online testing evaluation stage of the ROM -which is not restricted to the intensive offline training stage; (ii) performances and settings are highly dependent on the problem at hand; (iii) the need to deal only with a linear superimposition of modes (which characterizes linear ROMs), yielding low-dimensional spaces whose dimension is still (much) higher than the intrinsic dimension of the solution manifold.
Motivated by the need of avoiding the drawbacks of linear ROMs and setting a general paradigm for the construction of efficient, extremely low-dimensional ROMs, we resort to nonlinear dimensionality reduction techniques. Similarly to [16,17], we build a nonlinear ROM to approximate where As a matter of fact, the solution manifold S h is approximated by a reduced nonlinear trial manifold so thatũ h : [0, T ) × P →S n . As before, u n : [0, T ) × P → R n denotes the vector-valued function of two arguments representing the intrinsic coordinates of the ROM approximation. Our goal is to set a ROM whose dimension n is as close as possible to the intrinsic dimension n µ + 1 of the solution manifold S h , i.e. n ≥ n µ + 1, in order to correctly capture the solution of the dynamical system by containing the size of the approximation spaces [17].
To model the relationship between each couple (t, µ) → u n (t, µ), and to describe the system dynamics on the reduced nonlinear trial manifoldS n in terms of the intrinsic coordinates, we consider a nonlinear map under the form where Φ n : [0, T ) × R nµ+1 → R n is a differentiable nonlinear function. No additional assumptions such as, e.g., the (exact, or approximate) affine µ-dependence as in the RB method, are needed.

A deep learning-based reduced order model (DL-ROM)
We now detail the construction of the proposed nonlinear ROM. In this respect, we define the functions Ψ h and Φ n in (9) and (11) by means of deep learning (DL) algorithms, exploiting neural network architectures. This choice is motivated by their ability of effectively approximating nonlinear maps, and by their ability to learn from data and generalize to unseen data. On the other hand, DL models enable us to build non-intrusive, completely data-driven, ROMs, since their construction only requires to access the dataset, the parameter values and the snapshot matrix, but not the FOM arrays appearing in (1).
The DL-ROM technique that we develop in this paper is composed by two main blocks responsible, respectively, for the reduced dynamics learning and the reduced trial manifold learning (see Figure 2). Hereon, we denote by N train , N test and N t the number of training-parameter instances, of testing-parameter instances and time instances, respectively, and we set N s = N train N t . The dimension of both the FOM solution and the ROM approximation is N h , while n denotes the number of intrinsic coordinates, with n N h . For the description of the system dynamics on the reduced nonlinear trial manifold (which we refer to as reduced dynamics learning), we employ a deep feedforward neural network (DFNN) with L layers, that is, we define the function Φ n in definition (11) as thus yielding the map , and results from the subsequent composition of a nonlinear activation function, applied to a linear transformation of the input, L times. Here µ ∈ P ⊂ R nµ and θ DF denotes the vector of parameters of the DFNN.
Regarding instead the description of the reduced nonlinear trial manifoldS n defined in (10) (which we refer to as reduced trial manifold learning), we employ the decoder function of a convolutional autoencoder (AE), that is, we define the function Ψ h appearing in (9) and (10) as thus yielding the map h results from the composition of several layers, some of which of convolutional type, overall depending on the vector θ D of parameters of the decoder function.
Combining the two former stages, the DL-ROM approximation is then given bỹ where φ DF n (·; ·, θ DF ) : (12) and (13), respectively, and θ = (θ DF , θ D ) are the parameters defining the neural network. The architecture of DL-ROM is shown in Figure 2. Computing the ROM approximation (14) for any new value of µ ∈ P, at any given time, requires to evaluate the map (t, µ) →ũ h (t; µ, θ) at the testing stage, once the parameters θ = (θ DF , θ D ) have been determined, once and for all, during the training stage. The training stage consists in solving an optimization problem (in the variable θ) after a set of snapshots of the FOM have been computed. More precisely, provided the parameter matrix M ∈ R (nµ+1)×Ns defined as and the snapshot matrix S, defined in (4), we solve the problem: find the optimal parameters θ * solution of where To solve the optimization problem (16)- (17) we use the ADAM algorithm [41] which is a stochastic gradient descent method [42] computing an adaptive approximation of the first and second momentum of the gradients of the loss function. In particular, it computes exponentially weighted moving averages of the gradients and of the squared gradients. We set the starting learning rate to η = 10 −4 , the batch size to N b = 20 and the maximum number of epochs to N epochs = 10000. We perform cross-validation, in order to tune the hyper-parameters of the DL-ROM, by splitting the data in training and validation and following a proportion 8:2. Moreover, we implement an early-stopping regularization technique to reduce overfitting [43]. In particular, we stop the training if the loss does not decrease over 500 epochs. As nonlinear activation function we employ the ELU function [44] defined as No activation function is applied at the last convolutional layer of the decoder neural network, as usually done when dealing with autoencoders. The parameters, weights and biases, are initialized through the He uniform initialization [45].
As we rely on a convolutional autoencoder to define the function Ψ h , we also exploit the encoder functioñ which maps each FOM solution associated to the pairs (t; µ) ∈ Col(M ) provided as inputs to the feed-forward neural network (12), onto a low-dimensional representationũ n (t; µ, θ E ) depending on the parameters vector θ E defining the encoder function. Indeed, the actual architecture of DL-ROM that is used only during the training and the validation phases, but not during testing, is the one shown in Figure 3. In practice, we add to the architecture of the DL-ROM introduced above the encoder function of the convolutional autoencoder. This produces an additional term in the per-example loss function (17), thus calling the following optimization problem to be solved: where (20) and θ = (θ E , θ DF , θ D ), with ω h ∈ [0, 1]. The per-example loss function (20) combines the reconstruction error (that is, the error between the FOM solution and the DL-ROM approximation) and the error between the intrinsic coordinates and the output of the encoder. This further term allows to enhance the performance of the DL-ROM, as shown in Test 3 of section 4.

Training and Testing Algorithms
Let us now detail the algorithms through which the training and testing phases of the networks are performed.
First of all, data normalization and standardization enhance the training phase of the network by rescaling all the values contained in the dataset to a common frame. For this reason, the inputs and the output of DL-ROM are normalized by applying an affine transformation in order to rescale them in the range [0, 1]. In particular, provided a training dataset so that data are normalized by applying the following transformation Transformation (22) is applied also to the validation and testing sets, but considering as X max and X min the values computed over the training set. We point out that the input of the encoder function, the FOM solution u h = u h (t k ; µ i ) for a given (time, parameter) instance (t k , µ i ), is reshaped in a matrix. In particular, starting from u h ∈ R N h we apply the transformation If N h is not a square, the input u h is zero-padded [43]. For the sake of simplicity, we continue to refer to the reshaped FOM solution to as u h . The inverse reshaping transformation is applied to the output of the last convolutional layer in the decoder function, the ROM approximation. Moreover, we highlight that applying one of the functions (12)-(13)- (18) to the matrix X means applying it row-wise.
The training algorithm referring to the architecture of DL-ROM depicted in Figure 3 is reported in Algorithm 1. During the training phase, the optimal parameters of the DL-ROM neural network are found by solving the optimization problem (19)-(20) through the back-propagation and ADAM algorithms.
At testing time, the encoder function is instead discarded, that is the DL-ROM architecture is the one shown in Figure 2 and the testing algorithm is the one pointed out in Algorithm 2. The testing phase corresponds to a forward step of the DL-ROM neural network in Figure 2.

Numerical results
In this section, we report the numerical results obtained by applying the proposed DL-ROM technique to three parametrized, time-dependent PDE problems, namely (i) Burgers equation, (ii) a linear transport equation, and (iii) a coupled PDE-ODE system arising from cardiac electrophysiology, namely the monodomain equation; this latter is a system of time dependent, nonlinear equations, whose solutions feature a traveling wave behavior. For the time being, we deal with problems set in d = 1 (spatial) dimension featuring up to n µ = 2 parameters; we will consider the extension to differential problems in d = 2 and d = 3 θ N batches n epochs +k+1 = ADAM(η, ∇ θ J , θ N batches n epochs +k ) 16: end for 17: Repeat instructions 9-13 on (M val , S val ) with the updated weights θ N batches n epochs +k+1
in a forthcoming publication. For this reason, our focus is now on the numerical accuracy of our DL-ROM technique rather than on its computational efficiency and, therefore, on its comparison with linear ROMs such as the RB method featuring linear (possibly, piecewise linear) trial manifolds built through POD.
To evaluate the performance of DL-ROM we rely on the loss function (20) and on the following error indicator We implement the neural network required by our DL-ROM technique by means of the Tensorflow deep learning framework [46] and the numerical simulations are performed on a workstation equipped with an Nvidia GeForce GTX 1070 8 GB GPU.

Test 1: Burgers Equation
Let us consider the parametrized one-dimensional nonlinear Burgers equation where with A 0 = exp(µ/8), L = 1 and T = 2. System (24) has been discretized in space by means of linear finite elements, with N h = 256 grid points, and in time by means of the Backward Euler scheme, with N t = 100 time instances. The parameter space, to which belongs the single (n µ = 1) parameter, is given by P = [100, 1000]. We consider N train = 20 training-parameter instances uniformly distributed over P and N test = 19 testing-parameter instances, each of them corresponding to the midpoint between two consecutive training-parameter instances. The configuration of the DL-ROM neural network used for this test case is the following. We choose a 12-layers DFNN equipped with 50 neurons per hidden layer and n neurons in the output layer, where n corresponds to the dimension of the reduced trial manifold. The architectures of the encoder and decoder functions are instead reported in Tables 1 and 2, and are similar to the ones used in [17].  Problem (24) does not represent a remarkably challenging task for linear ROM, indeed by considering for example POD and by applying it to the snapshot matrix (the latter built by collecting the solution of (24) for N s = N train N t training-parameter instances) it is sufficient to assemble a linear trial manifold of dimension 20 in order to capture more than the 99.99% of the energy of the system [12,4]. In order to assess the performance of our DL-ROM technique, we compute the DL-ROM solution by fixing the dimension of the nonlinear trial manifold to n = 20. In Figure 4 we show the DL-ROM and the optimal-POD reconstructions, along with the FOM solution, for the time instance t = 0.02 and for the testing-parameter instance µ test = 976.32, the testing value of µ for which the reconstruction task results to be the most difficult both for POD and DL-ROM, being the diffusion term in (24) smaller and the solution closer to the one of a purely hyperbolic system. In particular, for µ test = 976.32, employing the DL-based ROM technique presented in this work allows us to halve the error indicator rel associated to the optimal-POD approximation of the FOM solution. Referring to Figure 4, the DL-ROM reconstruction is more accurate than the optimal-POD one, indeed it mostly fits the FOM solution, even in correspondence of its maximum, as shown in the zooms of Figure 4. Moreover, it does not introduce oscillations where a large gradient of the FOM solution is observed, as it happens instead by employing POD. In Figure 5 we show the same comparison of Figure 4 but this time considering both for POD and DL-ROM a reduced dimension n = 10. The difference in terms of accuracy provided by the two approaches is even more striking in this case. Finally, in Figure 6 we highlight the accuracy properties of both the DL-ROM and POD techniques by displaying the behavior of the error indicator rel , defined in (23), with respect to the dimension n of the corresponding reduced trial manifold. For n < 20 the DL-ROM approximation is more accurate than the one provided by POD, and only for n = 20 the two techniques provide almost the same accuracy.

Test 2: Linear Transport Equation
We consider two tests for this set of parametrized differential models.
Test 2.1: n µ = 1 First, we consider the parametrized one-dimensional linear transport equation whose exact solution is u(x, t) = u 0 (x − µt). We set u 0 (x) = (1/ √ 2πσ)e −x 2 /2σ and T = 1. The parameter (here n µ = 1) represents the velocity of the travelling wave and the parameter space is given by P = [0.775, 1.25]. The dataset is built by uniformly sampling the exact solution in the domain (0, L) × (0, T ), with L = 1, and by considering N h = 256 degrees of freedom in the space discretization and N t = 200 time instances in the time one. We consider N train = 20 training-parameter instances uniformly distributed in the parameter space P and N test = 19 testing-parameter instances such that µ test,i = (µ train,i + µ train,i+1 )/2, for i = 1, . . . , N test . This test case, and more in general hyperbolic problems, are examples in which the use of a linear approach to ROM generally yields poor performance in terms of accuracy. Indeed, the dimension of the linear trial manifold must be very large, if compared to the dimension of the solution manifold, in order to capture the variability of the FOM solution over the parameter space P. We set σ = 10 −4 in order to assess the performance of DL-ROM in a scenario which is still remarkably challenging for ROM on linear trial manifolds. Figure 7 shows the exact solution, which here plays the role of the FOM solution, and the DL-ROM one for the testing-parameter instance µ test = 0.8625; here, we set the dimension of the nonlinear trial manifold to n = 2, equal to the dimension of the solution manifold n µ + 1. Moreover, in Figure 7 we highlight the relative error k ∈ R N h , for k = 1, . . . , N t , associated to a given µ test ∈ P ⊂ R nµ (in this case n µ = 1), defined as which widens in proximity of the spike of the exact solution.
In Figure 8 we report the exact solution and the DL-ROM one, obtained by setting n = 2, for three particular time instances. In order to compare the performance of the proposed nonlinear ROM with a linear approach, we perform the POD on the snapshot matrix and show, for the same testing-parameter instance, the optimal POD-reconstruction, i.e. the projection of the FOM (exact) solution onto the POD basis, in Figure 8. For example, by considering n = 2, the error indicator, defined in (23), is rel = 8.74 · 10 −3 . By considering a linear ROM technique instead, even by considering a reduced trial manifold of dimension n = 50, built by means of the POD, the reconstructed solution presents spurious oscillations which result in a poor approximation of the FOM solution (see Figure 8). Indeed, in order to achieve the same accuracy obtained through DL-ROM over the testing set one has to select 90 basis functions, i.e. a linear trial manifold of dimension n = 90.  Figure 9 shows the behavior of the error indicator (23) with respect to the reduced dimension n. By increasing the dimension of the nonlinear trial manifold there is a slight improvement of the performance of the DL-ROM neural network, i.e. the error indicator decreases. This improvement is not particularly relevant because by increasing n, the number of parameters of the DL-ROM neural network, i.e. weights and biases, is increased by a limited quantity. In this way the approximation capability of the neural network remains almost the same and so does the error indicator (23).  Table 3. Then, the best values are found iteratively by studying the impact of the variation of a single hyperparameter at a time on the validation loss. Once the best value of a hyperparameter is found, this value replaces the default value from that point on. For each hyperparameter the tuning is performed in a range of values for which the training of the network is affordable regarding computational costs. In Figure 10, we show the impact of the size of the convolutional kernels on the loss over the validation and testing sets, the number of hidden layers in the feedforward forward neural network and the number of neurons in each hidden layer by varying the reduced dimension in order to find the best value of such hyperparameter over n. The final configuration of the DL-ROM neural network is the one provided in Table 4. [7,7] 4 200 Test 2.2: n µ = 2 Here we consider again the parametrized one-dimensional transport equation

Kernel Size # Hidden Layers # Neurons
The exact solution of (27) is u(x, t) = u 0 (x − t; µ) but this time we set the initial datum equal to where µ = [µ 1 , µ 2 ] T . The n µ = 2 parameters belong to the parameter space P = P µ1 × P µ2 = [0.025, 0.25] × [0.5, 1]. We build the dataset by uniformly sampling the exact solution in the domain (0, L) × (0, T ), with L = 1 and T = 1, and by considering N h = 256 grid points for the space discretization and N t = 100 time instances for the time one. We collect, both for µ 1 and µ 2 , N train = 21 training-parameter instances uniformly distributed in the parameter space P and N test = 20 testing-parameter instances, selected as in the other test cases. Equation (27), completed with the initial datum (28), stands as one of the most challenging problems for linear ROM techniques because of the difficulty to accurately reconstruct the jump discontinuity of the exact solution as a linear combination of basis functions computed from the snapshots, for a testing-parameter instance. The architecture of the DL-ROM neural network used here is the one presented in the Test 2.1.
In Figure 11 we show the exact solution, which here again plays the role of the FOM solution, and the DL-ROM one, obtained by setting n = 3, equal to the dimension of the solution manifold n µ + 1, for the testing-parameter instance µ test = (0.154375, 0.6375), along with the relative error k , defined in (26), which is larger near the jump of the FOM solution. In Figure 12 we report the DL-ROM and optimal-POD reconstructions, together with the FOM solution, for the time instances t = 0.245, 0.495 and 0.745, and the testing-parameter instance µ test = (0.154375, 0.6375). The dimension of the reduced manifolds are n = 3 and n = 50 for the DL-ROM and POD techniques, respectively. By considering a linear ROM technique, even by setting the dimension of the reduced manifold equal to n = 50, the reconstructed solution presents spurious oscillations which lead to a poor approximation of the FOM solution. Moreover, the optimal-POD solution is not able to fit the discontinuity of the FOM solution in a sharp way. These oscillations are significantly mitigated by the use of our DL-ROM and the jump discontinuity is accurately fit by the DL-ROM solution, as shown in Figure 12. Finally, in Figure 13 we highlight the accuracy properties of both the DL-ROM and POD techniques. In particular, the same conclusions observed in Test 2.1, namely those regarding the behaviour of the error indicator (23) with respect to the reduced dimension n, still hold. The developed DL-ROM technique allows us to obtain a value for the error indicator equal to rel = 2.85 · 10 −2 with n = 3, which instead is achieved by POD only by selecting 165 basis functions, i.e. by building a linear trial manifold of dimension n = 165.

Test 3: Monodomain Equation
We now consider the following one-dimensional coupled PDE-ODE nonlinear system where L = 1, T = 2, γ = 2 and β = 0.5. The parameter µ (n µ = 1) belongs to the parameter space P = 5 · [10 −3 , 10 −2 ]. This system consists in a parametrized version of the Monodomain equation coupled with the FitzHugh-Nagumo cellular model which describes the excitation-relaxation of the cell membrane in the cardiac tisuue [47,48]. In such a model, the ionic current is a cubic function of the electrical potential v and linear in the recovery variable w. Eqs (29) have been discretized in space through linear finite elements by considering N h = 256 grid points. We use a one-step, semi-implicit, first order scheme similar to the one discussed in [6] for time discretization and the treatment of the nonlinear term 1 . The solution of the former problem consists in a parameter-depending travelling wave, which exhibits sharper and sharper fronts as the parameter µ gets smaller (see Figure 14). We consider N train = 20 training-parameter instances uniformly distributed in the parameter space P and N test = 19 testing-parameter instances, each of them corresponding to the midpoint between two consecutive training parameter instances. Figure 15 shows the FOM solution and the DL-ROM one obtained by setting n = 2, the dimension of the solution manifold, for the testing-parameter instance µ test = 0.0062. We also report in Figure 15 the error indicator k (26), which is higher in correspondence of the large gradients of the FOM solution.
The accuracy obtained by our DL-ROM technique, with n = 2, on the testing set is rel = 3.42 · 10 −3 . In order to assess the performance of DL-ROM with respect to a linear ROM technique we point out in Table 5 the maximum number of basis functions among all the clusters, i.e. the dimension of the largest linear trial manifold, required by the (local) RB method in order to achieve the same accuracy obtained through DL-ROM. By increasing the number of clusters, the dimension of the largest linear trial subspace decreases; this does not hold as long as the number of clusters is larger than k = 32. Indeed, the dimension of some linear subspaces become so small that the error increases with respect the one obtained with fewer clusters. In particular, in Figure 16    The convergence of the error indicator (23) as a function of the reduced dimension n is shown in Figure 18. For the (local) RB method, by increasing the dimension of the largest linear trial manifold, the error indicator decreases, this occurs also by applying the DL-ROM technique for n ≤ 20. The decay of the error indicator in the latter case is not so remarkable for the same reason pointed out in Test 2.1. If we consider larger values of n, e.g. n = 40, overfitting occurs, meaning that the neural network model is too complex with respect to the amount of data provided it. For this reason, by considering, for example n = 40, the error indicator rel increases. Finally, in Figure 19 we report the behavior of the loss function and of the error indicator (23) with respect to the number of training-parameter instances, i.e. the size of the training dataset. By providing more data to the DL-ROM neural network, its approximation capability increases, thus yielding a decrease in the generalization error and the error indicator. In particular, the decay of the loss function with respect to the number of training-parameter instances N train is approximately proportional to 1/N 3 train and the one of the error indicator (23) is about 1/N 2 train . Remark 2. (Hyperparameters Tuning). In order to perform hyperparameters tuning we follow the same procedure used for Test 2.1. We start from the default configuration and we tune the size of the (transposed) convolutional kernels in the (decoder) encoder function, the number of hidden layers in the feedforward neural network and the number of neurons for each hidden layer. In Figure 20 we show the impact of the different hyperparameters on the validation and testing losses. The final configuration of the DL-ROM neural network is the one provided in Table 6. Kernel Size # Hidden Layers # Neurons [7,7] 1 200  (20) equal to ω h = 1/2. In order to justify this choice we performed a sensitivity analysis for problem (29) as shown in Figure 21. For extreme values of ω h , the error indicator (23) worsens of about one order of magnitude. In particular, not considering the encoder function f E n , that corresponds to the case ω h = 1, yields worse performance of the DL-ROM neural network, as highlighted in Figure 21. Similarly, by taking ω h = 0, we would neglect the reconstruction error (that is, the first term in the perexample loss function (20)); this is why the error indicator is large for ω h = 0.1. All the values of ω h in the range [0.2, 0.9] do not yield significant differences in terms of error indicator, so we decided to set ω h = 1/2 -and, as a matter of fact, 1 − ω h = 1/2.

Conclusions
In this work we have proposed a novel technique to build low-dimensional ROMs exploiting deep learning models in order to overcome the usual computational bottlenecks shown by classical, linear projection-based ROM techniques (such as the reduced basis method relying on proper orthogonal decomposition) when dealing with problems featuring coherent structures that propagate over time, such as transport and wavetype phenomena, or convection-dominated flows.
The proposed Deep Learning-based Reduced Order Model (DL-ROM) allows to approximate both the solution manifold of a given parametrized nonlinear, time-dependent PDE by means of a low-dimensional, nonlinear trial manifold, and the nonlinear dynamics of the generalized coordinates on such reduced trial manifold, as a function of the time coordinate and the parameters. Both (i) the nonlinear trial manifold and (ii) the reduced dynamics are learnt in a non-intrusive way, thus avoiding to query the arrays related to the FOM; the former is learnt by means of the decoder function of a convolutional autoencoder neural network, whereas the latter through a (deep) feedforward neural network, and the encoder function of the convolutional autoencoder.
The numerical results obtained for three different test cases show that the proposed DL-ROM technique provides sufficiently accurate solutions to the parametrized PDEs involving a low-dimensional solution manifold whose dimension is n µ + 1. The proposed DL-ROM outperforms linear ROMs such as the RB method (relying on a global POD basis), as well as nonlinear approaches exploiting local POD bases, when applied both to (i) problems which are extremely challenging for linear ROMs, such as the linear transport equation or nonlinear diffusion-reaction PDEs coupled to ODEs, and (ii) problems which are more tractable using a linear ROM, like Burgers equation, however featuring POD bases with much higher dimension.
Regarding numerical accuracy, the proposed DL-ROM technique provides approximations that are orders of magnitude more accurate than the ones provided by linear ROMs, when keeping the same dimension. We do not obtain remarkable error decays when considering low-dimensional spaces of increasing dimensions, thus making the accuracy of both approximations comparable when dealing with O(10 2 ) POD basis functions -a dimension which makes however linear ROMs infeasible when moving to more involved parametrized problems in higher space dimensions. Regarding computational efficiency, we deem not appropriate to perform comparisons with one-dimensional test cases (on meshes featuring no more than O(10 3 ) degrees of freedom). We will perform the assessment of the computational speedup of our DL-ROM technique compared to linear ROMs in future publications; we expect however to obtain remarkable computational gains when dealing with two and three-dimensional problems for which linear ROMs are not well-suited to approximate the solution to parametrized, nonlinear time-dependent PDEs. Numerical results shown that DL-ROM allows to generate approximation spaces of dimension close to the intrinsic dimension of the solution manifold, by providing also remarkably improvements in terms of efficiency, will be published in a forthcoming paper. derivatives of the loss function with respect to parameters. In particular, the gradient descent method requires to evaluate a task which might easily become prohibitive when the size M of the training dataset is very large, thus causing a single step of the gradient descent method to require a huge amount of time. The stochastic gradient descent (SGD) method allows to reduce the computational cost associated to the computation of the gradient of the loss function, by exploiting the fact that (A.3) can be considered as an expectation over the entire training dataset. Such an expectation can be approximated using a small set (or minibatch) of samples; hence, at each iteration the SGD method samples a minibatch of m < M data points, drawn (e.g., uniformly) from the training dataset [43], and approximates the gradient (A.3) of the loss function by ∇ θ DF L(y i , y i L ; θ DF ).
Appendix A.2. Convolutional neural network Convolutional neural networks (CNNs) [56] are the standard neural network architecture in computer vision tasks, since they are well-suited to high-dimensional and spatially distributed data like images. This is due to the local approach of convolutional layers which enables them to exploit spatial correlations among pixels in order to extract low-level features of the input to carry out the task. The main ingredients of a convolutional layer are convolutional kernels, or filters, which consist in tensors of smaller dimensions with respect to the input. Each element of a feature map is obtained by sliding the kernel over the image and by computing the discrete convolution, as shown in Figure A i ) with N 1 i and N 2 i depending on n 1 i and n 2 i , respectively, the padding and the striding strategies, and N 3 i = K i . Convolutional layers are characterized by shared parameters, that is, weights are shared by all the elements (neurons) in a particular feature map, and local connectivity, that is, each neuron in a feature map is connected only to a local region of the input. Parameter sharing allows convolutional layers to enjoy another property: translation invariance or, more precisely, translation equivariance. This means that if the input varies, the output changes accordingly [43]. In particular, if we apply a transformation to the input Y 0 and then compute the convolution, the result is the same we would obtain by computing the convolution and then applying the transformation to the output. The two properties above increase efficiency of CNNs, both in terms of memory and computational costs, with respect to DFNNs, thus making them preferable to the latter when dealing with extremely high-dimensional data.

Appendix A.3. Autoencoder neural network
Autoencoders (AEs) [57,58] are a particular type of feedforward neural networks aiming at learning, under suitable constraints, the identity function Internally, an autoencoder has a hidden layer consisting in a code used to represent the input. We focus on undercomplete autoencoders [43] where the constraint imposed is the reduction of the dimension of the code with respect to the input and output dimension. By considering the input y 0 = x h ∈ R N h and the output y L =x h ∈ R N h , an autoencoder is composed by two main parts (see Figure A.

24)
• the encoder function f E n (·; θ E ) : x h →x n = f E n (x h ; θ E ), where f E n (·; θ E ) : R N h → R n and n N h , mapping the high-dimensional input x h onto the low-dimensional codex n . The encoder function depends on a vector of parameters θ E ∈ R N E collecting all the weights and biases specifying the function itself; • the decoder function f D h (·; θ D ) :x n →x h = f D h (x n ; θ D ), where f D h (·; θ D ) : R n → R N h , mapping the codex n to an approximation of the original high-dimensional inputx h . Similarly to the encoder function, the decoder function depends on a vector of parameters θ D ∈ R N D collecting all the weights and biases specifying the function itself.
The autoencoder is then defined as Autoencoder learning lays within the unsupervised learning paradigm [43] since its goal is to reconstruct the input being the target output an approximation of the input. An autoencoder not only learns a lowdimensional representation of the high-dimensional input but also learns how to reconstruct the input from the code through the encoder and the decoder functions.
When dealing with large inputs, as the ones arising from the discretization of system (1), the use of a feedforward autoencoder may become prohibitive as the number of parameters (weights and biases) required may be very large. As pointed out in Appendix A.2, parameter sharing and local connectivity allow to reduce the numbers of parameters of the network and the number of associated computations, both in the forward and in the backward pass, hence the idea of relying on convolutional autoencoders for the sake of building our DL-ROM technique.