Error estimates for POD-DL-ROMs: a deep learning framework for reduced order modeling of nonlinear parametrized PDEs enhanced by proper orthogonal decomposition

POD-DL-ROMs have been recently proposed as an extremely versatile strategy to build accurate and reliable reduced order models (ROMs) for nonlinear parametrized partial differential equations, combining (i) a preliminary dimensionality reduction obtained through proper orthogonal decomposition (POD) for the sake of efficiency, (ii) an autoencoder architecture that further reduces the dimensionality of the POD space to a handful of latent coordinates, and (iii) a dense neural network to learn the map that describes the dynamics of the latent coordinates as a function of the input parameters and the time variable. Within this work, we aim at justifying the outstanding approximation capabilities of POD-DL-ROMs by means of a thorough error analysis, showing how the sampling required to generate training data, the dimension of the POD space, and the complexity of the underlying neural networks, impact on the solution accuracy. This decomposition, combined with the constructive nature of the proofs, allows us to formulate practical criteria to control the relative error in the approximation of the solution field of interest, and derive general error estimates. Furthermore, we show that, from a theoretical point of view, POD-DL-ROMs outperform several deep learning-based techniques in terms of model complexity. Finally, we validate our findings by means of suitable numerical experiments, ranging from parameter-dependent operators analytically defined to several parametrized PDEs.


Introduction
Solutions to partial differential equations (PDEs) are not usually available in analytic form and need to be approximated by suitable high-fidelity methods, such as the Finite Element Method (FEM) [33,35].The latter usually entails a suitable spatial discretization of the (bounded, compact) computational domain Ω ⊂ R d , d = 1, 2, 3, regulated by the step size h > 0 and yielding a set of N h degrees of freedom, that in some cases might correspond to the vertices of the elements providing the domain discretization.High-fidelity methods are usually referred to as full order models (FOMs) as they provide very accurate solutions, however resulting in computationally demanding strategies in terms of either time or resources.Within this work, we focus on a parametric setting, where in general the PDE solution u depends not only on the spatial coordinate x ∈ Ω and the time variable t ∈ T = [0, T ], but also on a parameter vector µ ∈ P -being the parameter space P ⊂ R p a compact set -namely u = u(x, µ, t).Once the problem has been discretized in space, we aim at exploring the solution manifold S N h = {u(µ, t) = [u(x i , µ, t)] N h i=1 ∈ R N h : (µ, t) ∈ P × T }, evaluating the problem solution in multiple scenarios, for different parameter values.To carry out this task efficiently, as well as to tackle other multi-query tasks such as those involving Uncertainty Quantification and to perform real-time numerical simulations, FOMs must be replaced by efficient and reliable reduced order models (ROMs), a wide class of strategies providing very efficient results yet retaining an adequate representation of the solution manifold S N h .
Linear projection-based ROMs, such as the reduced basis (RB) method relying on either greedy algorithms or the Proper Orthogonal Decomposition (POD) to build a low-dimensional linear trial subspace, are widely used in the context of parametrized PDEs.Usually relying on a (Petrov-)Galerkin projection to generate the corresponding ROM by enforcing at the reduced order level the physical constraints expressed by the FOM, these strategies feature however several drawbacks, especially when dealing with time-dependent, nonlinear, and nonaffine problems, ultimately requiring suitable hyper-reduction strategies such as the Empirical Interpolation Method (EIM) [1,8,34] or the Discrete EIM (DEIM, [4]).Despite being very general, and widely applied, hyper-reduction techniques usually feature an intrusive nature, require to handle algebraic arrays extracted from the FOM, ultimately resulting in overwhelming computational costs when dealing with nonlinear time-dependent parametrized PDEs.
To overcome these limitations, data-driven Deep Learning-based ROMs (DL-ROMs) were recently proposed in [10,12] and similar works [26,29,32,43] to exploit the power of DNNs to both perform dimensionality reduction of a set of high-dimensional snapshots data (obtained by sampling the solution manifold) and learn parameter-to-solution maps nonintrusively.Unfortunately, these techniques require to train complex architectures and might become unfeasible to train as soon as the FOM dimension N h increases, suffering from the curse of dimensionality in their vanilla version.To counter this issue, POD-DL-ROMs were then introduced in [14], leveraging on the power of DL-ROMs and the physically-consistent dimensionality reduction achieved through POD, and then training a DL-ROM network using FOM data projected on a (possibly, large dimensional) POD space: overall, POD-DL-ROMs are capable of lower training efforts in terms of both memory storage and computational time.The POD-DL-ROM paradigm has been tested against several problems, showing remarkable approximation capabilities in the numerical simulation of, e.g., fluid flows and fluid-structure interaction problems [15,14], cardiac electrophysiology [18], and micro-electromechanical systems [16] among others.However, a thorough numerical analysis of the POD-DL-ROM technique -connecting, e.g., the complexity of the NN architectures involved in a POD-DL-ROM, the sampling error entailed by the selection of training data, the POD error generated while projecting those data onto a POD space, with the overall accuracy of the computed solution -is still lacking.Within this work, we aim at addressing these questions in light of a solid theoretical analysis, providing general error estimates for the POD-DL-ROM technique, assessing their validity in a series of numerical experiments involving different parametrized problems.

Literature review and existing results
Thanks to the flourishing and rapidly evolving literature of Approximation Theory, many Deep Learning-based approaches to reduced order modeling are now being justified with rigorous theoretical results and error estimates.The majority of these are grounded on a notorious result by Yarotski (2017) [40], which we report below.
Theorem (Yarotski [40]).Let b ∈ N, b ≥ 1 and 0 < ε < 1/2.Any f ∈ W s,+∞ ([0, 1] b ) can be approximated uniformly with an error of at most ε by a ReLU Deep Neural Network (DNN) having at most c log(1/ε) layers and cε −b/s log(1/ε) weights, where c = c(s, b, f ) is a constant.Indeed, this result and its subsequent generalizations, see e.g.[41,20], constitute the foundation of many recent works, for instance: (i) in [9], the authors exploited these results to formulate an error analysis for general DL-ROMs.However, their analysis is limited to the time-independent case and does not resolve the curse of dimensionality, as it binds the complexity of DL-ROMs linearly with the FOM dimension N h ; (ii) Yarotski's Theorem was also considered in [11], where the authors investigated the approximation capabilities of Convolutional Neural Networks (CNNs), suggesting a strong connection between these architectures and the Fourier transform; (iii) similarly, the results in [40] are fundamental for the derivation of the approximation bounds reported in [25], which, instead, concern the DeepONet paradigm, an approach first proposed by Lu et al. in [28]; (iv) finally, Yarotski's Theorem and its generalizations were also employed to derive approximation bounds for deep learning-based ROM strategies that couple POD and feedforward neural networks, see, e.g., [2].
Here, we aim at proposing a similar analysis for POD-DL-ROMs, emphasizing the main differences between this approach and the existing literature.

Overall idea and paper structure
We analyze the overall approximation error entailed by the use of POD-DL-ROMs when dealing with the solution of both linear and nonlinear time-dependent parametrized PDEs by highlighting two separate error contributions: one, coming from the preliminary dimensionality reduction obtained through POD, and one entailed by the use of neural networks.
In brief, the idea goes as follows.First, we show that in the finite data regime, the overall error of a POD-DL-ROM, E R , can be decomposed as where E S is the sampling error, E P OD is the POD projection error, and E N N is the approximation error of the neural network model in the DL-ROM pipeline.Then, we address each of the three contributions separately.
For the first two, we rely on classical arguments that bind together the discrete and the continuous formulation of POD, see e.g.[8,34], ultimately showing that the sampling error vanishes as a function of the sample size, while E P OD is uniquely characterized by the eigenvalue decay of the data correlation matrix.In this sense, our analysis is strictly related to the one proposed in [25].To study the neural network error, instead, we consider a specific construction that reflects the general philosophy of DL-ROM techniques.More precisely, we emphasize the fact that POD-DL-ROMs use a neural network architecture that is obtained through the combination of two networks: a feature map, φ, which captures the roughness in the parameterto-solution operator, and a smoother decoder φ.In particular, we base our proof on a generalization of Yarotski's Theorem, due to Gühring et al. [20], which, during the composition step, allows us to keep the approximation error under control.For the sake of better readability, we report the latter result below.
The paper is organized as follows: in Section 2 we formulate the problem, describing rigorously the POD-DL-ROM approach and the reducibility measures for the framework at hand; Section 3 contains the main results of this work, namely the error decomposition formula, a lower bound result and an upper bound result for the approximation error.Section 4 then demonstrates advantages of POD-DL-ROMs when compared to similar deep learning-based frameworks, such as, e.g., POD+DNN and DeepONets.Finally, a series of numerical experiments that validate the theoretical analysis is shown in Section 5, while the last section draws some conclusions and summarizes possible further developments.

An overview of the POD-DL-ROM technique
POD-DL-ROMs provide a general-purpose ROM approach combining a data dimensionality reduction obtained through POD with the DL-ROM approach.After introducing the general class of problems we deal with, we overview the main building blocks of the POD-DL-ROM technique.For further details regarding, e.g., detailed algorithms for the offline (or training) and the online query (or testing) stages, the interested reader can refer to, e.g., [14].An extension of the POD-DL-ROM technique in view of time forecasts of the problem solution out of the training time window has been proposed in [13].

Problem formulation
Within this work, we consider time-dependent parametric PDEs of the following type where: Here we highlight the explicit dependence of u on the time variable t ∈ T = [0, T ] (for some T > 0) and the input parameter vector µ ∈ P ⊂ R p , P compact; • L is a linear operator, whereas N is a nonlinear operator and B is the boundary operator; virtually, all these operators might be µ-dependent • Ω is the (bounded) spatial domain where the problem is set.
Depending on the nature of the problem, input parameter can refer to either physical or geometrical properties of the problem at hand.We considered the formulation (2) as general framework since it describes a wide variety of problems ranging in the fields of engineering, physics, and life sciences, just to make a few examples.Introducing a computational mesh over Ω with mesh size h > 0 and a corresponding space discretization of the problem (1) having N h degrees of freedom (dofs) obtained through, e.g., the finite element method, the finite-dimensional counterpart of problem (1) provides our FOM and reads as follows: where u(µ, t) ∈ R N h denotes the vector of the N h dofs of the FOM solution, M(µ) ∈ R N h ×N h the mass matrix, A(µ) ∈ R N h ×N h the stiffness matrix, N(•, µ) : R N h → R N h a nonlinear map, f (µ, t) ∈ R N h the source term and u 0 (µ) ∈ R N h the initial data.The FOM (2) is then discretized in time, introducing a suitable time advancing scheme over a partition of T made by N t time steps {t k } Nt k=1 .To explore efficiently the solution manifold S N h = {u(µ, t) : (µ, t) ∈ P × T } we employ the POD-DL-ROM technique, performing a two-step dimensionality reduction: first, POD (realized through randomized SVD) is applied on a set of FOM snapshots; then, a DL-ROM is built to approximate the map between (µ, t) and the POD generalized coordinates.This latter task can be achieved by relying on two neural network architectures, (i) a deep autoencoder -possibly involving convolutional layers -that extracts a set of few, latent coordinates, ultimately representing the reduced-order coordinates of the ROM, and (ii) a deep feedforward neural network, to learn the map between (µ, t) and these latent coordinates.Below, we report the main building blocks of a POD-DL-ROM, originally proposed in [14]: (i) the snapshot matrix for the parameter vectors µ j , j = 1, . . ., N s is collected, thus obtaining (ii) the whole snapshot matrix is obtained stacking U j , j = 1, . . ., N s , namely (iii) a singular value decomposition (SVD) is performed on the snapshot matrix U, and the first N left singular vectors are retained, thus yielding U ≈ VΣW T , where Then, projecting U on the reduced linear subspace V ∈ R N h ×N , we obtain a snapshot matrix for the POD coefficients Q = V T U; (iv) the POD coefficient vectors q(µ j , t k ), j = 1, . . ., N s , k = 1, . . ., N t , obtained from the columns of Q, along with the parameters vector µ j and the time instants t k , are used to train a DL-ROM.This latter consists of a deep autoencoder Ψ • Ψ ′ and a deep feedforward neural network (to which we refer to as reduced network) φ, defined as follows: where φ, Ψ ′ , Ψ are the reduced network, the encoder and the decoder, respectively, while θ DY N , θ EN C , θ DEC are their corresponding neural network weights and biases (they are omitted, hereon, for the sake of readability).The three networks are trained according to the per-example loss function below, where and n denotes the latent dimension of the architecture.As a matter of notation, from hereon we equip any finite dimensional space R b (for some b ∈ N) with the ℓ 2 norm: thus, unless otherwise stated, we define It is worth to remark that L N penalizes high reconstruction errors and L n ensures a good representation in the latent space.
Recalling that (µ, t) → V q(µ, t) ≈ u(µ, t) provides the POD-DL-ROM approximation, the objective of the present work is to characterize the relative error in terms of the POD-DL-ROMs complexity.Here, we choose to focus on analyzing E R since it is a common measure for the accuracy in the ROM literature.Moreover, we highlight that the entire workflow yielding the error estimate we propose in this work is only based on the approximation error, without considering the contribution carried by the training error.The extension to more general vector energy norms including the contribution of symmetric positive definite mass matrices to define the counterpart of norms in functional spaces like, e.g., L 2 (Ω) or H 1 (Ω), is also straightforward and is not considered here for the sake of simplicity.

POD: from the discrete to the continuous formulation
Before proceeding towards the thorough analysis of E R , we have to appropriately define the working setting, which depends on the linear dimensionality reduction.First, we notice that even though within the POD-DL-ROM pipeline we computed the POD matrix V through the (randomized) SVD algorithm, thus using a fully datadriven procedure that employs a set of training data, the relative error E R aims at measuring the approximation capabilities over the entire time-parameter space P ×T , taking advantage of a continuous formulation.Within this section, we aim at filling the gap between the discrete and the continuous formulation of POD, highlighting links and bounds, focusing initially only on the source of error coming from the projection phase, rather than directly considering E R : this allows us to set the ground upon which the more complex approximation results of POD-DL-ROM are based.We start by considering the (P × T )-discrete setting, and the fact that V results from the solution of a minimization problem; indeed, denoting by the (discrete) correlation matrix and by σ 2 k its eigenvalues, it holds that [34] k>N where N is the chosen POD dimension and u j is the solution vector that corresponds to the tuple (µ, t) j .We can proceed analogously for the (P × T )-continuous setting, by considering as the (continuous) correlation matrix and denoting by σ 2 k,∞ its eigenvalues; similarly, we can prove that there exists an optimal rank-N matrix From the considerations above, we can infer that from the inequality above, we can remark that the data-driven POD matrix V is not optimal for the continuous formulation, which stems from the hypothesis of having infinite data samples, while being the best orthogonal matrix in terms of explained variability with respect the training data at hand.In other words, even though V is optimal for the training data, we have no guarantee that it is optimal for the test data, too; however, since in practice we are not able to obtain the matrix V ∞ , we must necessarily rely on V also in the online testing phase.Finally, we show how the discrete and the continuous POD formulations are related: indeed, denoting by [•] i the i-th entry of a vector, and extending this notation to matrices, we have that ∀k, l = 1, . . ., N h recalling that u j is the solution vector that corresponds to the tuple (µ, t) j .Upon requiring integrability (easily verified for non-trivial bounded solutions), we can use the Strong Law of Large Numbers [22] and obtain − − → 0, being Z 1 any 1-norm of the squared matrix Z.By employing Bauer-Fike's theorem [35] with the 1-norm, we can state that, upon ordering, for any σ 2 k,∞ , there exists σ 2 k belonging to the spectrum of K such that where X is the matrix collecting the right eigenvectors of K, and K 1 (X) denotes its condition number.Thus, we can conclude that, setting N as the POD dimension, it holds that

An overlook over the reducibility measures for POD-DL-ROMs
POD-DL-ROMs couple POD, for the sake of a preliminary dimensionality reduction, with an autoencoder-based architecture to reconstruct the parameter-to-PODcoefficients map.Thus, at first it is evident that the projection-based nature of the paradigm invokes the definition of a linear reducibility measure to account for the FOM-to-POD dimensionality reduction task.
The linear Kolmogorov N-width of S N h is defined as It is worth to notice that the linear Kolmogorov N -width is strictly related to the eigenvalues decay of the correlation matrix K ∞ ∈ R N h ×N h .In fact, following the same notation of Subsection 2.2, we have that: The above relationship shows that the eigenvalue decay is an alternative (and more practical) measure of reducibility, with respect to a weaker norm.However, notice that in practice we can only approximate the quantity k>N σ 2 k,∞ ≈ k>N σ 2 k , which is consistent with the theory thanks to the convergence result presented in Subsection 2.2.
The autoencoder-based architecture of a POD-DL-ROM introduces a second level of dimensionality reduction, which operates a further compression of the information coming from the parameter-to-POD-coefficients map Q : (µ, t) → V T u(µ, t).The nonlinear nature of the dimensionality reduction performed through the autoencoder Ψ • Ψ ′ (being Ψ ′ , Ψ the encoder and the decoder, respectively) induces a nonlinear analogue of the Kolmogorov n-width [7].
Definition 2. The nonlinear Kolmogorov n-width of the reduced manifold Now, to deal with nonlinear approximation methods, we state another fundamental definition upon which the main results of this work are based.
In conclusion, as we did with the POD dimension N , we need to characterize the latent dimension n with a practical criterion.To do that, an extension of Theorem 3 provided in [9] shows that if the parameter-to-solution map G : (µ, t) → u(µ, t) and thus the parameter-to-POD-coefficients map Q : (µ, t) → V T u(µ, t) are Lipschitzcontinuous, there exists n ≤ 2p + 3 such that δ n (S N ) = 0.

Main results
Before stating the main result of this work, namely an upper bound result, that concerns only POD-DL-ROMs, we make some preliminary reasoning that, instead, applies to any POD+DNN approach, i.e. we do not constrain the neural network q, that approximates the parameter-to-POD-coefficients map, to be a DL-ROM.For this purpose, we briefly recall that the POD+DNN technique involves the reconstruction of the parameter-to-solution map through the approximation (µ, t) → V q(µ, t) ≈ u(µ, t), where q is a generic (possibly dense) neural network.
In particular, we start by characterizing E R through an error decomposition formula, that enables us to describe the various error contributions and formulate possible strategies to control them.Secondly, we state a lower bound result, that highlights how, regardless of the architecture of neural network q, the relative error E R can be bounded from below by a quantity depending on the POD projection.Then, we move to our upper bound result, where we quantify how complex a POD-DL-ROM should be in order to achieve a specific bound on the relative error E R .
We initially remark that the computation of the error E R and other related quantities hinges upon the evaluation of complex integrals, possibly in high dimensional spaces, which can be effectively handled through Monte Carlo methods.In this respect, we shall make the following assumptions, which we assume to hold true hereon.
Assumption 1 (Sampling criterion).Let p > 0, assume that P ⊂ R p is compact and denote T = [0, T ] for some T > 0. We assume that the training (and testing) snapshots are sampled uniformly and iid in the parameter space, µ ∼ U(P), while a uniform grid is employed for the time variable, t ∈ {∆t, 2∆t, ..., N t ∆t}, where N t ∈ N ≥2 and ∆t = T /N t .
Assumption 2 (Parameter-to-solution map).Let G : P ×T → R N h be the parameterto-solution map, mapping (µ, t) → u(µ, t).We assume that From these assumptions, one can easily derive a couple of auxiliary results, which will be of practical interest in the remainder, and are reported below; for the sake of brevity, their proofs are postponed to Appendix A.
where the expectation is taken across all the possible realizations of the data sampling procedure.

The error decomposition formula
In the following, we state an error decomposition formula that is valid for any POD+DNN approach -and, in particular, for our POD-DL-ROM strategy.Given the more general nature of this result, its formulation is therefore not restricted to the technique at hand.
Theorem 3.1.Let G : (µ, t) → u(µ, t) for any (µ, t) ∈ P × T be the parameter-tosolution map.Consider a POD+DNN approximation of G as G(µ, t) ≈ V q, where q : R p+1 → R N is a neural network trained over a given training set made by a collection of input parameters and the corresponding snapshot matrix is the POD projection matrix.Then, under the Assumptions 1 and 2, we have where: is the approximation error of the neural network, which is arbitrarily low depending of the approximation capabilities of the network q.
Proof.By means of the triangular inequality, we obtain According to the notation of Section 2, let q(µ, t) := V T u(µ, t).We define and notice that E N N is the only error component that depends on the neural network approximation.Moreover, we can bound the remaining term in (5) as ×N h be the discrete correlation matrix and let σ 2 k be its eigenvalues.By employing the triangular inequality and the trivial inequality In light of this, we define the sampling error as , and the POD error as Thus, we obtain the inequality in ( 4) In the last part of the proof we aim at showing the characteristic properties of E S and E P OD ; recalling that we can write the sampling error in a slightly different form Moreover, thanks to the compactness hypothesis of Assumption 1 and the boundedness hypothesis of Assumption 2 we have that ).Finally, since E S and E P OD depend on the number of samples and snapshots in the training set, it is natural to verify their behavior in the infinite data limit.Thanks to Assumption 1, by the Strong Law of Large Numbers, it is evident that E S a.s.− − → 0 as N s , N t → ∞ and, by means of the results in Section 2, Remark 1.The convergence rate for E S can be improved by modifying Assumption 1. Indeed, Monte Carlo sampling could be replaced by other strategies: for instance, using Quasi-Monte Carlo techniques [30,3], and under suitable regularity assumptions, one has ).

Lower bound for the relative error
POD-DL-ROMs couple classical projection-based methods such as the POD with Deep Learning-based techniques that allow to correctly reproduce the nonlinearity of the parameter-to-POD-coefficient map Q.This means that we still need to rely on the linear transformation represented by the POD matrix V ∈ R N h ×N (or V ∞ in the infinite data limit) to expand the neural network approximation of the POD coefficients.This last consideration is crucial: indeed, the fact that the POD-DL-ROM technique hinges upon a linear decomposition forces the relative error to still depend on the eigenvalues decay of the correlation matrix; the mentioned dependence is highlighted in the lower bound result provided Theorem 3.2.
First, we derive a lower bound for E S + E P OD : we immediately prove that by trivially employing the definition of E S , E P OD , and V ∞ .It is worth to remark that E P OD,∞ only depends on the eigenstructure of the continuous correlation matrix K ∞ , while it is independent of the data sampling.Thus, in the following, we aim at showing that, up to a constant, E P OD,∞ represent a lower bound also for the relative error E R .
Theorem 3.2.Under the same assumptions of Theorem 3.1, we have that Proof.We immediately notice that, by optimality of projection coefficients, where we recall that V is the POD matrix computed via SVD using the discrete formulation and V ∞ is relative to the continuous formulation.Then, from which the thesis follows.
Remark 2. Since V ∞ is not available in practice, we cannot compute exactly E P OD,∞ .
In practice we can use a stricter bound: leveraging on quantities emerging from the proof, we actually employ when we either compute analytically (if possible) or estimate via Monte-Carlo.
This result states that no matter how accurate the neural networks approximation is the relative error E R is still bounded from below by the variance that is not explained by the POD projection.Additionally, the lower bound does not depend on how much data we gather for the supervised training phase.Of note, this is in agreement with the results provided in the analysis of other linear decomposition-based techniques, such as DeepONets [25].

Upper bound for the relative error
On the basis of the error decomposition and the perfect embedding hypothesis, we aim at providing the main result of this work, which is contained in the Theorem 3.3 and is endowed with a constructive proof founded on the approximation results of [40].We remark that the present result is only valid for POD-DL-ROMs.Theorem 3.3.Let G : (µ, t) → u(µ, t) for any (µ, t) ∈ P × T be the parameterto-solution map and suppose valid Assumptions 1 and 2. Let δ > 0 and 0 < ε < 1; suppose to have collected We define the parameter-to-POD-coefficients map Q : (µ, t) → q(µ, t) as Q(µ, t) = V T G(µ, t) for any (µ, t) ∈ P × T , where V ∈ R N h ×N is the reduced rank-N POD matrix computed via SVD.We assume that there exists n > 0, Ψ * : R n → R N , Ψ ′ * : R N → R n that are respectively s-times and s ′ -times differentiable (with s ≫ s ′ ≥ 2), such that they enjoy the perfect embedding assumption stated in Definition 3, namely We let Then, there exists a constant c = c(P, T , L, C 1 , C 2 , p, n, s, s ′ ) and a POD-DL-ROM architecture V q = Vψ • φ : R p+1 → R N composed of a decoder ψ : R n → R N having at most: • w n→N = cN ε −n/(s−1) log(ε −1 ) active weights, and a reduced map φ : R p+1 → R n having at most: Proof.We immediately notice that, choosing N as in the theorem statement, we derive Then, we aim at bounding E S = E S (N s , N t ); under the Assumption 1, by the Weak Law of Large Numbers [22] we can infer the following statement: Then, we are left to bound E N N : by means of the Cauchy-Schwarz and the Hölder inequalities, considering that V 2 = 1, it is trivial that Therefore, we are left to bound the error due to the neural network approximation of the map Q, namely sup (µ,t)∈P×T q(µ, t) − q(µ, t) .
Moreover, let q = ψ • φ : R p+1 → R N be the underlying neural network of the POD-DL-ROM.Then, by means of the triangular inequality, the perfect embedding Assumption, the definition of φ * , and the Lipschitz-continuity of ψ, we derive: employing the bounds ( 7) and (8).Then, plugging the last inequality in ( 6) we can state that E N N < ε 3 .Finally, by means of the error decomposition formula, we derive the desired bound with probability greater than 1 − δ.
Remark 3. The DL-ROM paradigm proposed in [12] and applied to cardiac electrophysiology in [17], has been theoretically analyzed in [9], providing approximation bounds and a complexity analysis, which shows that in general DL-ROMs suffer from curse of dimensionality with respect the number of high-fidelity dofs N h .Relying on the present Theorem 3.3, we demonstrate how the preliminary dimensionality reduction through POD affects both the complexity of the POD-DL-ROM and its approximation capabilities.Indeed, POD-DL-ROMs avoid the curse of dimensionality of the DL-ROMs at the cost of discarding the small scales contribution, which might be however relevant when considering, e.g., highly nonlinear problems showing a slow eigenvalue decay.On the other hand, POD-DL-ROMs provide a neural network architecture with a lower number of trainable weights, thus yielding a lighter training procedure in practice.Finally, we can highlight that the a priori choice of employing DL-ROMs or POD-DL-ROMs must be based exclusively on the linear reducibility of the problem and the availability of computational resources.

Comparative analysis with deep learning-based existing strategies
On the basis of the results of the previous section, we comment the advantages of POD-DL-ROMs when compared with other deep learning-based existing strategies present in the literature, namely: • simple DNNs to approximate the POD (or Kernel-POD) coefficients, that results in the widely used POD+DNN approach [6,21,36,39]; • the POD-DeepONets architecture, which was proposed in [27] and based on the classical DeepONets approach [28]; • the technique presented in [31], which aims at reconstructing the parameter-tosolution map by coupling linear projection methods and residual networks and which we will hereon refer to as lin+ResNets; • the CNNs architecture for operator learning proposed in [11], whose analysis is based on the Fourier decomposition.

POD-DL-ROMs vs POD+DNNs: a matter of regularity
The purpose of this subsection is to highlight how the POD-DL-ROM approach provides a suitable setting to establish tighter bounds on the model complexity when compared to generic POD+DNNs, especially when the parameter-to-solution map is not regular.
It is worth to remark that, under the hypothesis of Theorem 3.3, the number of layers of the POD-DL-ROM network architecture is expected to scale as while the total number of active weights behaves as We expect that, in general n ≪ N = N (ε); moreover, n ≤ 2p + 3 since the parameter-to-POD-coefficients map is Lipschitz-continuous, due to Assumption 2. Thus, it is evident that the majority of the neural network complexity amounts to the decoder, which has to perform the most difficult task, namely, decoding the information provided by the latent coordinates.Instead, the reduced network only aims at providing an alternative representation of the time-parameters vector (µ, t) such that it makes as easy as possible for the decoder to reconstruct the POD coefficients.Noting that w P OD−DL−ROM depends exponentially on s, we can control the complexity of the POD-DL-ROM by choosing s as large as possible, namely, s ≫ s ′ ≥ 2.
Essentially, we aim at finding a representation of POD coefficients of the form through the composition of an encoder Ψ ′ * that absorbs all the irregularity of the identity map I = Ψ * • Ψ ′ * , and a decoder Ψ * that is extremely regular.We highlight that the perfect embedding Assumption stated in Definition 3 is critical; indeed, under the hypothesis of Theorem 3.3, leaving out only the perfect embedding assumption, we may be tempted to trivially use Yarotski's Theorem [40] to construct a ReLU DNN which has L P OD+DN N layers and w P OD+DN N active weights, where in order to control the relative error with E R < ε.Notice that: • the number of layers L P OD+DN N is of the same order as L P OD−DL−ROM ; • the estimate of the number of active weights w P OD+DN N can only take advantage of mild regularity assumptions on G (and Q), that is only Lipschitzcontinuous.
However, it is evident that Theorem 3.3 only provides a theoretical result offering a different perspective in order to enhance the complexity estimate of POD+DNN.Indeed, within the framework stated by Theorem 3.3, given an accuracy level ε one could take advantage of the POD-DL-ROM theoretical setting, and thus the perfect embedding Assumption, to construct a proper architecture that approximates the parameter-to-solution map keeping E R < ε -and, then, notice that the resulting architecture is indeed in general a POD+DNN.The difference in practice is represented by the training procedure.Indeed, notice that training a network like the one involved in a POD+DNN with the classical supervised loss formulation, by letting ω n = 0 in (2.1) and thus without taking advantage of the encoder, does not ensure to recover an adequate representation in the latent space.Instead, if we train the network relying on thje POD-DL-ROM paradigm, namely taking ω n > 0 in (2.1), we actually employ the encoder to implicitly enforce the architecture to satisfy the perfect embedding Assumption, and then discard the encoder in the online testing phase.
Suppose now that N ≫ n: trivially, we have that w DN N w (p+1)→n ; moreover, w P OD+DN N w n→N , upon requiring that n (s−1) < p + 1, that provides an estimate for the regularity of the decoder in the representation (9), that is s > n p+1 + 1.In practice, given that n ≤ 2p + 3, we can safely assume that s 3 + 1 p+1 and finally s 4. Thus, if the parameter-to-solution map G is only Lipschitz-continuous, if the perfect embedding Assumption is satisfied for s ≥ 4, POD-DL-ROMs achieve a tighter bound on the model complexity when compared to general POD+DNN approaches: this is due to the fact that there exists a better representation (in terms of regularity) φ * (µ, t) for the time-parameters vector (µ, t) that can be recovered by the reduced network.
Until now, we considered the case where the parameter-to-solution map is only Lipschitz-continuous; however, it is interesting to consider cases where we can verify that the map G shows higher regularity, and see how this increased regularity affects the complexity of both POD-DL-ROMs and POD+DNNs in terms of number of active weights.Indeed, by means of similar arguments employed previously, and thanks to the Theorem due to Yarotski [40] recalled in Section 1.1, assuming that G ∈ W r,+∞ (P × T ; R N h ), we obtain that Thanks to the fact that the exact reduced map φ * of Theorem 3.3 now would be min{r, s ′ }-times differentiable, Assuming that N ≫ n, it is trivial to verify that w P OD+DN N w (p+1)→n ; furthermore, it is valid that w P OD+DN N w n→N if n (s−1) > p+1 r , that is s > nr p+1 + 1, which gives the estimate s (2 + 1 p+1 )r + 1 and finally s 3r + 1.The meaning of the last estimate is that, if the parameter-to-solution map is extremely regular (namely, r → ∞), it becomes more and more difficult for the POD-DL-ROMs to guarantee lower complexity than simple POD+DNNs, since the perfect embedding Assumption should be verified for s → ∞.This is rather intuitive: indeed, if the parameter-to-solution map is extremely regular, we do not need a to recover a better representation φ * (µ, t) for the time-parameter vector (µ, t) in order to make it easier for the underlying neural network to learn the solution manifold.

POD-DeepONets and POD-DL-ROMs: a comparison
In this subsection, we aim at analyzing the POD-DeepONet architecture from a theoretical standpoint, showing the close relationship with POD-DL-ROMs when dealing with problems whose general formulation can be reduced to (2).We let X be a Banach space and consider a compact subset K 1 ⊂ X and a compact subset K 2 ⊂ R d , where d denotes the number of spatial (or spatio-temporal) dimensions of the problem at hand.Defining W ⊂ C(K 1 ) as a compact subset, we suppose that we aim at learning the operator G ∞→∞ : W → C(K 2 ), where the subscript highlights that the considered operator is a map between infinite-dimensional spaces.We first consider a DeepONet architecture [28] employed to reconstruct G ∞→∞ , which in its unstacked formulation consists in the combination of the output of two different neural networks through the scalar product.In particular, we define the branch net b : W → R N as the neural network that processes information about the input function φ ∈ W , and the trunk net τ : R d → R N , which aims at encoding the coordinate input y ∈ R d in a set of basis functions.Then, we can define the DeepONet approximation as and note that N describes the number of basis functions employed in the decomposition (10); thus, N plays the same role as the POD dimension in the POD-DL-ROM architecture.Based on the analysis proposed in [25], we can split the DeepONet operator Ĝ : , where E m , A m→N and R τ are defined as follows: • the encoder operator is defined as the map E m : C(K) → R m , such that, given It is worth to notice that E m is well defined since any continuous function can be evaluated pointwise; • A m→N : R m → R N is the approximation operator; thus, we can decompose the branch net of the DeepONet operator as b = A m→N • E m ; • recalling that τ : R d → R N is the trunk net, we define the τ -induced recon- In a more compact formulation, we retrieve the classical architecture of the Deep-ONets, namely: POD-DeepONets were recently introduced in [27] and the test cases considered within the paper confirm better approximation accuracy when compared with classical Deep-ONets: the methodology consists in substituting the trunk net with the corresponding row of the POD matrix.The drawback is that POD-DeepONets can only approximate operators defined as G ∞→N h : W → R N h , losing the capability of mapping between infinite-dimensional spaces.Supposing to initially deal with stationary, time-independent problems and denoting by v j ∈ R N the j-th row of the POD matrix V ∈ R N h ×N , we define the expansion operator L vj : R N → R N h as and the POD-DeepONet operator as , where q is the corresponding branch net, which now approximates the underlying POD coefficients.It is worth to notice that, by employing the vector formulation, we can write: Then, we need to adapt the POD-DeepONet framework to the problem considered within this work (2), where even the input parameter space is finite-dimensional, thus eliminating the need of the encoder operator E m .Indeed, POD-DeepONets for finitedimensional-input problems involving the reconstruction of the map G p→N h : R p → R N h take the form , or in a more compact way where µ ∈ P ⊂ R p , P compact.It is worth to notice that in this case the branch net coincides with the approximation operator A m→N .Finally, in order to include also the time-dependence, we could adopt two different strategies: • we could treat the time t as a spatial coordinate in a DeepONet-like way, leading to a POD matrix of dimension h N t × N , that however increases the possible impact of the curse of dimensionality, however offering the opportunity to deal with time-dependent basis functions; • alternatively, we may consider the time t as an additional parameter, a choice which reduces the computational requirements and is consistent with the POD-DL-ROM approach, leading to the construction of time-independent global spatial basis functions.
Within this comparison, for the sake of consistency, we choose to employ this latter approach.Thus, aiming at reconstructing the map (µ, t) → u(µ, t), we could employ different neural network architectures; for instance, if we choose to employ a DL-ROM architecture as the branch net of the POD-DeepONets, we retrieve the POD-DL-ROM approach, while employing a vanilla DNN as the branch net results in the POD+DNN approach.The comparison between POD-DL-ROM and POD+DNNs is extensively treated in the previous subsection.
Finally, inspired by the DeepONet approach, we notice that extending the content of the present paper to the case of infinite-dimensional input parameters is straightforward and introduces an additional source of error, namely the encoding error, that ultimately depends on the variability of the input parameters and their spatial discretization; for a thorough discussion on the topic, we refer the reader to, e.g., [25].

Learning POD coefficients with ResNets
The ResNets-based approach proposed in [31] couples linear decompositions and residual networks (ResNets) to reconstruct field-to-solution maps, an approach which is inherently close to POD-DL-ROMs.In this case, we start our analysis of the technique by examining the proposed architecture, and by adapting it to the problem formulation considered within the present work.
Indeed, we immediately notice that the lin+ResNet architecture needs that every residual layer has input dimension equal to the output dimension layer output dimension: iterating, for a fully residual network, we must require that the input of the network has the same dimension of the network output.Such a constraint in the architecture is managed in [31] by projecting both the input fields and the output targets onto two linear subspace of equal dimension N ≪ N h , where N h is the FOM dimension.Then, the output targets are numerically approximated on the same mesh and projected onto a subspace of dimension N , too.The approach results in the sequence of maps: where the linear projection is usually carried out by employing POD, Karhunen-Loève expansions [37] or active subspaces [42].However, when dealing with finite dimensional parameter inputs instead of fields (for instance (µ, t) ∈ R p+1 with p+1 < N ), it may occur that the ResNet input dimension (p + 1) is different from the output dimension N ; to fill the gap, it is necessary to employ for instance a dense layer R p+1 → R N as the first layer of the architecture.Thus, we will consider the sequence of maps: The lin+ResNets approach ultimately aims at providing a constructive way to build a neural network in terms of breadth and depth.
The breadth, which may be intuitively defined as the maximum number of neurons per layer in the network, coincides with N , the characteristic dimension of the preliminary dimensionality reduction.In order to favour compressed representations, the authors of [31] suggest keeping as low as possible the latent dimension k of the ResNet, which can be identified with the dimension of the nonlinearity added at each layer.Indeed, the residual map between the layer z l ∈ R N and z l+1 ∈ R N can be identified with where and σ is the activation function; the total number of weights per layer is then O(N k).However, in contrast to our approach, they did not propose a way to identify k: we remark that the discussion on the latent dimension n of the POD-DL-ROM architecture is fundamental because it allows to set a tighter bound on the complexity of the decoder network in terms of active weights.Furthermore, the authors developed approximation bounds on the underlying ResNet complexity in terms of its depth, employing the connection between ResNets, Neural ODE and control flows [5].The bound on the ResNet depth enable the user to control the ℓ 2 error on the solution (and by extension the relative error too) with a suitable bound ε by employing O(ε −1 ) layers.Thus, we can straightforwardly state that, on the basis of the complexity analysis, POD-DL-ROMs outperform the ResNets-based approach in terms of number of layers: and number of active weights: supposing for instance N ≫ n and s ≥ n + 1, which are reasonable assumptions.Indeed, N ≫ n is satisfied when the nonlinear Kolmogorov n-width decays much faster that the eigenvalue decay of the correlation matrix, a phenomenon that is usually encountered in applications; the condition s ≥ n + 1 is valid by ensuring s ≥ 2p + 4, that is the decoder map must be sufficiently regular.
Despite the disadvantage on the complexity front, we remark that ResNets constitute one of the most suitable paradigms to implement adaptive-depth architectures, since adding a layer to an already trained architecture can produce an arbitrary small perturbation on the network output; for a more detailed analysis on the lin+ResNets training, we refer the reader to [31].

The effect of the POD basis optimality on the network complexity
Within this subsection, our purpose is finally to show how choosing the POD basis as global spatial basis function in the linear decomposition leads to a reduced complexity of the underlying neural network, comparing in details CNNs for operator learning and POD-DL-ROMs.In particular, we notice that, within the POD-DL-ROM approach, the reconstruction of the approximated solution at the high-fidelity level depends on the decomposition assumption u(µ, t) ≈ j<N qj (µ, t)v j , where N denotes the POD dimension.Analogously, the recent work on the approximation bounds for CNNs proposed in [11] strives to reconstruct a decomposition between global spatial basis functions that are strictly related to the Fourier modes, and a set of coefficients, that is, u(µ, t) ≈ j<C âj (µ, t)f j , where the sum is over C terms (the number of channels in the input and output is O(C)).
In the following, we assume that u(•, µ, t) ∈ C α (Ω) for any (µ, t) ∈ P × T , being α ≥ 1 the spatial regularity, and ε > 0 is the desired accuracy level; we then describe the three main differences between the CNN-based approach and the POD-DL-ROM technique: • The convolutional block is limited to uniformly spaced mesh points (h is the spacing parameter) in square domains, while POD-DL-ROMs are more versatile both in terms of the domain shape and the mesh properties.
• The architecture proposed in [11] consists of two different blocks: the dense block is devoted to the parameter-dependent coefficient approximation, while the convolutional block strives to reconstruct the spatial basis function.Instead, POD-DL-ROMs compute the spatial basis before the training of neural networks by means of SVD [34] or randomized SVD [38] through an unsupervised learning criterion: in principle, this means that POD-DL-ROMs do not need any active weights to reconstruct the spatial basis functions, while the CNN approach needs O(ε − 2 2α−1 log(h −1 )) weights to learn them (we refer the reader to Theorem 2 in [11]).
• In the decomposition employed in [11], C plays the role of the reduced dimension: it is an analogue of the POD-dimension N employed within the POD-DL-ROM technique.In the following, we exploit an optimality result fulfilled by the POD basis to show that the complexity of the neural network in the parameterto-coefficient map approximation is lower in the case of POD-DL-ROM when compared to the approach proposed in [11].
The quasi-optimality of the POD decomposition in its discrete formulation confirms that with a N -terms truncation, provided a sufficient amount of data have been suitably sampled, no linear decomposition captures as much variance as the discrete formulation of the POD decomposition, so that the reduced dimension C of [11] satisfies the inequality C > N with probability 1 − δ (see Subsection 2.2 and Appendix A).Furthermore, we assume that: (i) N ≫ n as usual, since we expect that the nonlinear Kolmogorov n-width decays (much) faster than the linear reduced dimension N ; (ii) u(•, µ, t) ∈ C α (Ω) for any (µ, t) ∈ P × T for some α ≥ 1 to comply with the hypotheses of Theorem 2 of [11]; (iii) the parameter-to-solution map has regularity r, i.e.G ∈ W r,∞ (P × T ; R N h ); (iv) the decoder map is adequately regular, namely n s−1 > p+1 r (s ≥ 3r + 1 is sufficient, as in 4.1).
We recall that Theorem 2 in [11] provides the estimate C = O(ε − 2 2α−1 ).Therefore, in the worst case scenario N = O(ε − 2 2α−1 ); however, depending on the singular values decay that in some cases might be even exponential (e.g.stationary elliptic PDEs, analytic parameter-to-solution maps, see [34]) we actually obtain improved estimates.We then derive: Thus, we can conclude that, if the hypotheses setting is verified, the overall complexity of the POD-DL-ROMs in terms of active weights is lower (or equal) than the complexity of the CNN architecture proposed in [11].

Numerical experiments
Within this section, we present different numerical tests, aiming at validating the theoretical analysis proposed in the previous Sections.In particular, we focus on (i) the error bounds of Theorems 3.2-3.3and the error decomposition formula, as well as on (ii) the role of the reduced dimension N and the total number of snapshots N data and on (iii) the comparison against recent approaches proposed in the literature, in light of the theoretical results of Sections 3 and 4. In particular, the numerical experiments involve: a) a benchmark test case with an analytically defined operator that allows us to know a priori the properties of the parametric operator (like, e.g., the regularity of the parameter-to-solution map) in order to validate the theoretical estimates on the network complexity; b) a linear 1D Initial Boundary Value Problem (IBVP), to show how to select N data and N in order to minimize the a priori error (given by the sum of E S and E P OD ), then validating a posteriori the network complexity as a function of the relative error; c) a nonlinear 2D time-dependent IBVP in a non-conventional domain, to show the effectiveness of the POD-DL-ROM approach when dealing with more complex problems, validating also the lower bound and the upper bound on the relative error E R , which stem from the theoretical analysis.
We remark that the complexity analysis of POD-DL-ROM and related approaches is discussed from a theoretical point of view only in terms of the approximation error; however, when numerical experiments are addressed, we also have to take into account the training error, which plays a major role especially when the network is sufficiently deep or wide, or data are limited.For the same reasons, in our numerical experiments we mainly address the complexity study in terms of number of active weights w, since the latter is a quantity which is less sensitive (when compared to the depth L) to the training error.Thus, the experimental complexity analysis presented here may not reflect exactly the estimates provided in the previous sections, but they validate qualitatively the theory.However, within the present section, aiming at mitigating the effect of the training error on the error estimates, we employ several ad hoc strategies, like, e.g., • we employ early stopping to prevent overfitting; • the approximation results in terms of network complexity are achieved in an error range [ε 1 , ε 2 ] that is deemed appropriate for the chosen number of samples N data : in practice the training error depends on data availability; • for fixed number of active weights, we regulate the network architecture trying to randomly achieve the configuration that minimizes the training error; we keep the depth of the network as low as possible in order to ensure convergence to a suitable minimum and avoid expensive training loops; • starting from educated guesses, we look for the best training hyperparamenters (which are the learning rate and the learning rate decay).
Finally, we remark that, in order to comply with the hypotheses of the Theorems of Section 3, we limit the numerical experiments to generic dense layers and employ LeakyReLU as activation function: Unless otherwise stated, we set α = 0.1.The optimization procedure is carried out by employing the Adam algorithm [24].

Benchmark test case
We begin our experimental analysis by considering a benchmark test case similar to the one described in [11], and involving the reconstruction of an analytically defined operator, namely , 2].Within this numerical test we vary β ∈ {3/2, 7/3, 3} and we analyze the three resulting cases independently.Notice that the hyperparameter β > 0 controls the regularity of the parameter-to-solution map.Indeed, u 3/2 (x, •) ∈ W 1,+∞ (P) \ W 2,+∞ (P) •) ∈ W 3,+∞ (P) \ W 4,+∞ (P); thus β = 3/2, 7/3, 3 correspond to r = 1, 2, 3 respectively, where r is defined as the regularity of the parameter-to-solution map in agreement with this paper notation.Furthermore, the problem does not depend on the time variable, thus we set N t = 1, N data = N s and p = 2 (instead of p = 3) to comply with the theoretical framework of the present work.Moreover, we discretize the problem in space by means of a uniform discretization with N h = 1000.Selecting n = 5 ≤ 2p + 3 = 7 to ensure both a suitable compression and an adequate representation in the latent space, N s = 500, and to control the variability retained by the preliminary linear dimensionality reduction.
We then proceed towards a complexity analysis, showing a comparison of the results against the CNN approach considered in [11], the POD+DNN framework and the lin+ResNets technique.We remark that for the sake of fairness and consistency, we keep the batch size during training equal to B = 20 for every comparison considered in the benchmark test case.Then, for any r ∈ {1, 2, 3}, we estimate the approximation error E R on the respective test set consisting of N test s = 10 4 samples.From a theoretical standpoint, we immediately notice G : µ → u(µ) ∈ W r,+∞ (P; R N h ); then, from the findings of Section 4, since n ≪ N , we can infer that Thus, owing to the fact that in the POD-DL-ROMs approach the perfect embedding Assumption with coefficients s, s ′ is enforced thanks to their peculiar loss formulation, we expect them yielding a less steep increase (when compared to POD+DNNs) in the model complexity as the accuracy level decreases whenever the decoder map is suitably regular, which is equivalent to require s > 5  3 r + 1. Figure 1 demonstrates that the latter behavior is more likely to happen as the regularity of the parameter-to-solution map r decreases.
We then compare POD-DL-ROMs against the lin+ResNets approach; for the latter, we limit the analysis to the case where the basis functions are yielded by POD for the sake of consistency.We thus fix the latent space dimension of the residual layers as k = 5 and, from the estimates obtained in Section 4, we recall that the complexity bound of lin+ResNets in terms of number of active weights is in general independent of the regularity of the parameter-to-solution map, namely: We thus remark that the lin+ResNets approach does not take advantage of any regularity assumption on the parameter-to-solution map: we then expect a similar trend as r varies in {1, 2, 3}.Nonetheless, if the trained POD-DL-ROM architecture are able to find an adequate representation in the latent space which induces a very regular decoder, that is s > 6, we can ensure that the POD-DL-ROM outperform the lin+ResNets approach in terms of complexity: this behavior is indeed observed in Figure 2. Finally, we consider the comparison against the CNN approach considered in [11]: if the decoder map is sufficiently regular (from the theoretical analysis we derive the condition s ≥ 5  3 r + 1), POD-DL-ROMs take advantage of the basis optimality to achieve a less steep increase of complexity as the error bound E R < ǫ decreases: the behavior is indeed observed in Figure 3, in the cases when the regularity of the parameter-to-solution map is low (r = 1, 2).Moreover, differently from the CNNbased technique, we remark that the POD-DL-ROMs' algorithm does not require to learn the basis functions, thus not affecting the overall complexity of the underlying network.

1D Initial Boundary Value Problem
The present test case is designed to highlight the advantages of POD-DL-ROMs when compared to other considered approaches even when dealing with time-dependent parametrized problems.Moreover, before starting the training process, we show a priori how to choose the hyperparameters N, N s , N t , based on the analysis of E S and E P OD .In particular, we consider the following IBVP: where the initial condition is We start by analyzing the dependence of E S on N s , N t and N ; for the sake of clarity, we specify that the sampling criterion employed in the a priori analysis below is based on the theoretical analysis of the entire work: thus, we assume µ ∼ U(P) iid and that t is sampled from a uniform grid of step ∆t = 1/N t .To analyze the effect of N s on the sampling error, we fix N t = 1000 and we generate a group of datasets depending on N s ∈ {l = 2 k : k = 1, . . ., 7}: as shown in Figure 4, the decay has slope −1/4 and it is independent of the chosen value of N .Conversely, we fix N s = 100 and vary N t ∈ {l = 2 k : k = 3, . . ., 9}, validating experimentally in Figure 4 that , independently of N .We then move to the analysis of the projection error, showing in Figure 5 how E P OD decays with N and is mostly independent of N s and N t respectively.We notice that the present analysis is done before the training of the underlying neural network and allow us to know a priori how much variance is not accounted for due to the sampling (E S ) and the initial dimensionality reduction (E P OD ), allowing us to calibrate the values N, N s , N t before we start the expensive training procedure.The idea is to choose N, N s , N t to guarantee that E P OD and E S are suitably small, so that we can control the relative error E R with a strict bound, which is provided by the error decomposition of Theorem 3.1.Thus, based on the results of the present a priori analysis, we choose N s = 50, N t = 20, N = 20.
We then move our focus to the comparison of the POD-DL-ROM technique against other approaches in terms of complexity, showing the relation between the relative error E R and the number of active weights employed in the underlying neural network.Notice that, since the analytical solution of the IBVP is not available, here we are not provided with any information on the regularity of the parameter-to-solution map.Anyway, experimental results on the complexity analysis confirm our theoretical expectations: when dealing with parameter-to-solution maps arising from parametric PDEs, POD-DL-ROMs' complexity increases slower than POD+DNNs' one as the relative error decreases.Indeed, the latent representation of the POD-DL-ROM approach induces a decoder that is extremely regular, that is s ≫ 2, which enables a slow increase in network complexity, as suggested by the theoretical approximation bounds of Theorem 3.3 and validated in Figure 7. Similarly, we notice that the results relative to the comparison between POD-DL-ROMs and lin+ResNets are in agreement with the theory, demonstrating again how, lin+ResNets are outperformed in terms of complexity by POD-DL-ROMs, when it is possible for the latter to achieve an extremely regular decoder map due to an adequate latent representation.Finally, when compared to the Fourier-inspired CNN technique POD-DL-ROMs' number of active weights show a slower increase as the relative error E R decreases, as shown in Figure 7; as proved theoretically in Section 4, the magnitude of the slope is strongly linked to the optimality of the basis functions.Moreover we validate how the burden of learning the set of basis function impacts heavily on the underlying CNN complexity, which shows a remarkable difference when compared the POD-DL-ROM approach in terms of number of active weights, not only regarding the slope magnitude but also in the absolute sense.The observed behavior highlights how crucial it is in terms of complexity to consider a fixed set of optimal basis functions instead of a learnable set   Thus, this validates the theoretical considerations and concludes our comparison based on model complexity, demonstrating how POD-DL-ROMs outperform any of the considered techniques when tackling more complex problems, for which the regularity of the parameter-to-solution map is low or unknown a priori.

2D nonlinear Initial Boundary Value Problem
The last test case involves a nonlinear version of a time-dependent nonlinear parametrized diffusion equation with a non-affine source term in an unconventional domain; the strong formulation of the problem at hand takes the form where T = 0.05 and • h(x, y, µ) = 0.1 + 10y sin(µπx) represents a non-affine term, being µ ∈ P = [5,7] the parameter that regulates the spatial frequency of h = h(x, y, µ); • letting E a,b (x, y) be the ellipse of axes a and b and center (x, y), we set D 1 = E 0.2,0.2(0.5, 0.4) and D 2 = E 0.3,0.1 (1.0, 0.2); then, we can define the domain as Ω = (0, 1) × (0, 0. • a reduced network of 3 hidden dense layers of 10 units each; • an encoder and a decoder with 5 hidden dense layers of 25 units each.We then evaluate the lower bound m M ẼP OD , the upper bound due to the error decomposition formula E N N + E S + E P OD , the value relative error E R , according to the theoretical framework of Section 3.
We show both the lower bound and the upper bound results in Figure 9, displaying as well the error contributions E N N , E S , E P OD to assess the way they affect the relative error E R .We then remark again that it is crucial for POD-DL-ROMs to provide both an adequate neural network approximation of the parameter-to-solution map and a suitably large POD dimension.Indeed, we notice that in the present test case, especially for low values of N , E N N shows a marginal contribution to the upper bound value when compared to the sampling error E S and the projection error E P OD .Furthermore, as expected, we observe the strong dependence of the lower bound m M ẼP OD on the POD dimension, demonstrating again the importance of choosing an adequate value for N .Finally, we assess a posteriori that the number of samples in the training set is suitable since the sampling error E S does not heavily influence the upper bound of the relative error.

Conclusions
The main goal of this work is to suggest effective and practical strategies to set a POD-DL-ROM stemming from a rigorous analysis of the technique, to control the approximation accuracy, measured in terms of the relative error E R , which is linked to relevant features and hyperparameters that can be effectively regulated.To accomplish the task, we analyze the error E R , providing a lower bound that depends only on the projection-based nature of the method.Then, by the error decomposition formula and the upper bound result, we highlight the contribution of sampling, POD projection and neural network approximation; in particular: (i) on the basis of the analysis of the sampling error E S we propose a family of strategies to adopt in the data collection phase in order to ensure the convergence of E S → 0 in the limit of infinite data, providing also a decay estimate through Monte Carlo analysis in terms of the number of sampled snapshots (ii) we determine a practical criterion based on the eigenvalue decay to control E P OD in terms of the reduced dimension N ; (iii) starting from the approximation results proposed in [40], we estimate the complexity of the underlying neural network that is required to reach a given accuracy.
Then, relying on the aforementioned findings, we compare the POD-DL-ROM paradigm to other architectures that are widely used in the literature, namely DL-ROMs [9,12,17], POD+DNNs [6,21,36], POD-DeepONets [27], lin+ResNets [31] as well as CNNs [11], showing the strengths of the POD-DL-ROM strategy, especially when dealing with low-regularity maps.Ultimately, we demonstrate the outstanding approximation properties of POD-DL-ROMs, which motivate the excellent performance already encountered in a variety of test cases analyzed in the recent literature [14,15] and in the present work.Several working directions could stem from the present paper; for instance, more efficient sampling criteria arising from Monte Carlo analysis could be implemented: we mention variance reduction techniques and Quasi Monte Carlo methods [3], among others.On the other hand, one could consider ad hoc layers to be employed in the reconstruction of parameter-to-POD-coefficients maps instead of relying purely on dense layers; however, this latter option would require novel and precise approximation results for the considered layers.Moreover, an alternative formulation could split the time-and the parameter-dependence, avoiding to treat time as an additional parameter, similarly to what has been proposed in [23], in order to further enhance the approximation bounds proposed in this paper.

A.2 Proof of Proposition 1
Thanks to Assumption 1, trivially we obtain ∆t = T N −1 t = O(N −1 t ) and we set t i = i∆t.Letting f = f (µ, t) be the (sufficiently regular) integrand of the integral that we want to approximate, we obtain

A.3 Quasi-optimality of the discrete formulation of the POD decomposition
We base the following analysis on the results of the (P × T )-continuous problem proposed in [34].We first recall that by definition V ∞ ∈ R N h ×N (where N is the POD dimension) is optimal for the (P × T )-continuous formulation, that is with respect to the L 2 (P × T ; R N h ) norm.Formally, we set δ, ε > 0 and, by assuming u(µ, t) ∈ L 2 (P × T , R N h ), we define T : L 2 (P × T ) → R N h as T g := P×T u(µ, t)g(µ, t)d(µ, t) ∀g ∈ L 2 (P × T ).
The adjoint operator of T , namely T * , enjoys the property Moreover, recall the definition of the (continuous) correlation matrix (3) and denote by (σ 2 k,∞ , ζ k ) its eigenpairs (where {ζ k } k denotes an orthonormal basis).We thus define the HS-norm of T as Setting we denote by T N,∞ the rank-N Schmidt approximation, with and by T N = VV T T its approximation by means of the discrete POD formulation.Theorem 6.2 and Proposition 6.3 in [34] show that the rank-N Schmidt operator and therefore the set of basis V ∞ are optimal with respect to the HS-norm, namely they retain the most variability.Formally: Thus, since a.s.convergence implies convergence in probability, we derive that ∀δ > 0, ∀0 < ε < ε max , ∃N s , N t :

Figure 1 :
Figure 1: Benchmark test case: model complexity comparison between POD-DL-ROMs and POD+DNNs as the parameter-to-solution regularity r varies in {1, 2, 3}.The trends are displayed through solid lines, which fit the collected results in the least squares sense.

Figure 2 :
Figure 2: Benchmark test case: model complexity trend of POD-DL-ROMs and the lin+ResNets approach for different values regularity of the parameter solution map r.
10(2µ 3 − 3µ 2 + µ) cos(x) + (2|1 − 2µ| − 1) sin(x), while µ ∈ P = [0, 1] and T = 1.Thus, p = 1 and we can fix n = 5 = 2p + 3 to ensure an adequate representation in the latent space, according to the framework presented in the present paper.We collected synthetic data generated with an highfidelity model solved on a uniform grid of N h = 100 points: we generate a test set of N test s = 100 samples of N test t = 200 snapshots each with a Matlab-based PDE solver, sampling µ ∼ U(P) iid and t from a uniform grid of step ∆t test = T /N test t .

Figure 4 :
Figure 4: 1D IBVP test case: decay of the sampling error E S with respect to N s , N t and N .

Figure 5 :
Figure 5: 1D IBVP test case: decay of the projection error E P OD varying N s , N t and N .

Figure 6 :
Figure 6: 1D IBVP test case: comparison between the "true" solution (solid black line) and the most accurate POD-DL-ROM prediction (dashed red line) to demonstrate that the variability of the solution manifold is correctly reproduced.
4) \ (D 1 ∪ D 2 ); • the Dirichlet and the Neumann boundary are Γ D = ∂D 1 ∪ ∂D 2 and Γ N = ∂Ω \ (∂D 1 ∪ ∂D 2 ), respectively.Through this numerical experiment we aim at verifying the upper bound and lower bound results presented in Section 3. To do so, we generate the training set and the test set input-output pairs through the numerical solution of the discretized problem on a mesh of N h = 1666 dofs by means of P1-FEM, employing a Forward Euler timeadvancing scheme and the Newton method to handle nonlinearities.The training set is made by N s = 20 samples relative to µ ∼ U(P) iid of N t = 30 snapshots each, sampling t from a uniformly space time grid of step T /N t .The test set data consist of N test s = 30 samples, evaluated on the same time grid employed in the training set.Then, for each N ∈ {2 k , k = 0, . . ., 5} we train a POD-DL-ROM of latent dimension n = 2p + 1 = 5, which is composed of:

Figure 7 :
Figure 7: 1D IBVP test case: comparison between POD-DL-ROMs and other techniques in terms of number of active weights.The solid line represents the least squares fitting of the log-log data.

Figure 8 :
Figure 8: 2D IBVP test case: domain and boundary specifics (upper left), comparison between "true" solution (upper right) and POD-DL-ROM's predicted solution (lower left) and visualization of the absolute error (lower right), in the case of N = 32.

Figure 9 :
Figure 9: 2D IBVP test case: error bounds analysis varying the POD dimension N