Non-linear Manifold Reduced-Order Models with Convolutional Autoencoders and Reduced Over-Collocation Method

Non-affine parametric dependencies, nonlinearities and advection-dominated regimes of the model of interest can result in a slow Kolmogorov n-width decay, which precludes the realization of efficient reduced-order models based on linear subspace approximations. Among the possible solutions, there are purely data-driven methods that leverage autoencoders and their variants to learn a latent representation of the dynamical system, and then evolve it in time with another architecture. Despite their success in many applications where standard linear techniques fail, more has to be done to increase the interpretability of the results, especially outside the training range and not in regimes characterized by an abundance of data. Not to mention that none of the knowledge on the physics of the model is exploited during the predictive phase. In order to overcome these weaknesses, we implement the non-linear manifold method introduced by Lee and Carlberg (J Comput Phys 404:108973, 2020) with hyper-reduction achieved through reduced over-collocation and teacher–student training of a reduced decoder. We test the methodology on a 2d non-linear conservation law and a 2d shallow water models, and compare the results obtained with a purely data-driven method for which the dynamics is evolved in time with a long-short term memory network.


Introduction
In real world engineering scenarios, when performing outer loop applications such as optimization, uncertainty quantification, sensitivity analysis, parametric partial differential equations (PDEs) often need to be solved numerically numerous times. However, relying on the mathematical properties of some parametric PDEs, the computational cost for many query problems can be drastically reduced taking into account previous results on a set of training parameters: the procedure for the design of reduced-order models (MORs) is divided in an offline (training) stage, during which a set of training solutions is collected, and an online (testing or predictive) stage, which employs the compressed information from the previous step to predict the solutions of the PDE of interest for unseen parameters. This reduction is performed numerically defining a low-dimensional global basis devised in the offline stage, and can be carried out independently of the class of numerical methods chosen: finite element (FEM), spectral element (SEM), discontinuous Galerkin (DGM), and finite volumes method (FVM). One of the most employed model-order reduction method (MOR) is the reduced basis method [1,2].
Depending on the parametric dependency and mathematical nature of some PDEs, various issues may occur: the Kolmogorov n-width (KnW) is used to characterize the approximability of the solution manifold, that is the set of parameter-dependent solutions of the PDE, by a linear trial subspace. A slow decaying KnW is a symptom of the difficulties in the design of efficient ROMs: this results in the necessity of using a high number of reduced basis or proper orthogonal decomposition (POD) method's modes, corrupting the efficiency of the ROMs till the point that the gain into the computational cost becomes irrelevant. One class of PDEs where this behaviour is evident are time-dependent advection-dominated PDEs. Moreover, non-linear PDEs require hyper-reduction procedures to make the reduced equations independent of the number of degrees of freedom of the full-order model (FOM).
Recently, leveraging machine learning's advances in manifold learning, a class of ROMs that employ a non-linear trial manifold built with convolutional autoencoders (CAEs) [3] was developed by Carlberg et al. [4]. One of the benefits of this approach is the possibility to employ a small latent dimension of the ROMs, thus overcoming the slow decay of the KnW for some parametric PDEs, at the expense of introducing additional nonlinearities from the neural networks (NNs) and sometimes more substantial training costs in the offline stage. The properties of the non-linear manifold methods include the need of less stabilization mechanisms, the less intrusiveness on the FOM solvers-they are in fact equations-based rather than fully-intrusive-and the possibility to apply them for a much broader class of parametric PDEs, differently from ROMs devised specifically for advection-dominated problems.
A hyper-reduction scheme for non-linear manifold Least-Squares Petrov-Galerkin (NM-LSPG) and non-linear manifold Galerkin (NM-G), is introduced in [5]: it relies on Gauss-Newton tensor approximation (GNAT) [6] hyper-reduction method and shallow masked autoencoders to select only the degrees of freedom that explain the dynamics and therefore restrict efficiently the decoder and the discretized residuals. As we will see in our test cases, the reconstruction error of the autoencoder employed empirically bounds from below the errors of all the other non-linear manifold ROMs built upon. Therefore, we devise a method that is independent on the choice of the architecture: a sparse shallow autoencoder is not necessary anymore, and any NN architecture, like CAEs, could be in principle employed. This frees the way to imposing additional inductive biases that help to speed up the offline stage and to achieve accurate approximations of the discrete solution manifolds, a crucial requirement. Moreover, in some cases, reconstructing the residuals with GNAT is not efficient, still because of a slow decaying KnW, so we choose to employ the reduced over-collocation hyper-reduction method (ROC) [7]: in this case the equation's numerical residual is not reconstructed on a global basis, but collocated on some nodes of the mesh called collocation points.
Once the CAE reaches a satisfactory approximation of the discrete solution manifold, purely data-driven NN PDE can be trained to predict the latent dynamics: the gold standard that is being established for this task are long-short term memory networks (LSTM). Their online computational cost is low even w.r.t linear ROMs, but some new issues appear: their accuracy depends on the regularity of the latent dynamics, especially when predicting the solutions for parameters outside the training range, in the extrapolation regimes; they require hyperparameters tuning, and all the connections to the PDEs model are completely lost, resulting in a loss of interpretability of the results. The non-linear manifold hyper-reduced ROMs we develop solve these issues, at the expense of a higher computational cost in the online stage, since at each time step a physics-based residual is minimized. Moreover, a posteriori error estimates are available [4,5]. In our test cases, we compare these two approaches to enlight their differences, weak and strong points. The structure of this paper is as follows. In Sect. 2 we delve into the topic of manifold learning which has many connections with reduced-order modelling, especially since the recent entry of machine learning in the design of ROMs. We will proceed introducing the Kolmogorov n-with (KnW), and we will show that some classes of parametric PDEs suffer from the so called slow decaying Kolmogorov n-width. In Sect. 3, the non-linear manifold (NM) reduced-order models based on the work of Carlberg et al. [4] are introduced. We will focus on the non-linear manifold least-squares Petrov-Galerkin method (NM-LSPG). Afterwords, we describe our new hyper-reduced ROMs: NM-LSPG with reduced over-collocation (NM-LSPG-ROC) and NM-LSPG with reduced over-collocation and teacher-student training of a compressed decoder (NM-LSPG-ROC-TS). These two approaches, to the best of the authors' knowledge, are introduced here for the first time. In Sect. 4, the new model order reduction (MOR) methods are tested on a 2d parametric non-linear time-dependent conservation law model and a 2d parametric non-linear time-dependent shallow water equations model. In Sect. 5, a discussion on the results obtained follows, and in the conclusive Sect. 6 possible future directions of research are explored.

Manifold Learning
The subject of manifold learning, classified as a topic of machine learning, had its unique flavour in model order reduction even before nowadays breakout of scientific machine learning [8]. The workhorse of the model order reduction community is POD or SVD. A lot of real applications though, required new methods to approximate the solution manifold in a nonlinear fashion. The symptom of this behaviour is a slow decaying Kolmogorov n-width. Some approaches rely on the locality of the validity region of a linear approximation with POD, both in the parameter space and in the spatial and temporal domains, others implement non-linear dimensionality reduction methods from machine learning [9]: kernel principal component analysis (KPCA), Isomap, clustering algorithms. Non-linear MORs include approximations by rational functions, splines or other non-linear functions collected in a dictionary [10].
Interpolatory approaches of the solution manifold with respect to the parameters have been developed, sometimes combined with non-linear dimension reduction techniques like KPCA and its variants: interpolation with geodesics on the Grassmann manifold [11], inter-polation on the latent space obtained with Isomap dimensionality reduction method [12,13], interpolation with optimal transport [14], dictionary-based ROM that make use of clustering in the Grassmannian manifold and classification with neural networks (NN) [15], local kernel principal component analysis [16]. At the same time, domain decomposition approaches tackled locality in space [17,18].
One particular class of dimension reduction techniques is represented by autoencoders, and more generally by other architectures that rely on NNs. In the recent literature many achievements are brought by CAEs, and by extension by Generative Adversarial Networks (GANS), Variational Autoencoders (VAEs), Bayesian convolutional autoencoders [3]: in [19] convolutional autoencoders are utilized for dimensionality reduction and long-short Term Memory (LSTM) NNs or causal convolutional neural networks are used for time-stepping; in [20] the evolution of the dynamics and the parameter dependency is learned at the same time of the latent space with a forward NN and a CNNs on randomized SVD compressed snapshots, respectively; and in [21] spatial and temporal features are separately learned with a multi-level convolutional autoencoder.
In order to extend these architectures to datasets not structured in orthogonal grids, geometric deep learning [22] is called to the task. There are not many works on geometric deep learning applied to model order reduction for mesh-based simulations that achieve the same accuracy of CNNs, yet. Promising results are reached by an architecture that employs graph neural networks (GNNs) and a physics-informed loss [23]. In the future, the potential of GNNs will probably be leveraged extending the range of applicability of nowadays frameworks.
The setting we will base our studies on, does not depend directly on the numerical method used to discretize the parametric PDEs at hand (FVM, FEM, SEM, DGM), so the mathematical formulation will generically be founded on models represented by a parametric system of time-dependent (but also time-independent) PDEs, consisting of a non-linear parametric differential operator G and of the boundary differential operators B, B 0 that represent the boundary and initial conditions respectively, where P is the parameter space, U are the state variables and is the 2D or 3D spatial domain. This formulation includes also coupled systems of PDEs. We assume that the solutions belong to a certain Banach or Hilbert space Y , varying (µ, t) ∈ P × [0, T ]. The solution manifold M is represented by the set

Approximability by n-Dimensional Subspaces and Kolmogorov n-Width
We want to remark some results available in the literature, in order to state and comment, for our needs, the problem of solution manifold approximability. In particular, our benchmarks belong to a class of parametric PDEs for which the Kolmogorov n-width decays slowly. Thus, classical Petrov-Galerkin projection with POD needs to be overcome with non-linear methods in place of POD to achieve efficient ROMs.
Let (X , · ) X be a complex Banach space, and K ⊂ X a compact subspace, the Kolmogorov n-width (KnW) of K in X is defined as Let (Y , · Y ) be a complex Banach space and L : K ⊂⊂ X → Y . In our framework K is the parameter space, possibly infinite dimensional and L is the solution map of the system of parametric PDEs at hand, from the parameter space to the solution manifold. In order to define L we have to suppose that for each parameter in K there is a unique solution in Y . Following [24], it can be proved that for holomorphic L, thus not necessarily linear, the Kolmogorov n-width decay is one polynomial order below the Kolmogorov n-width of the parameter space K .

Theorem 1 [24, Theorem 1] Suppose u is a holomorphic mapping from an open set O ⊂ X into Y and u is uniformly bounded on O,
If K ⊂ O is a compact subset of X , then for any s > 1 and t < s − 1, In particular, if the hypothesis of the previous theorem are satisfied and if K is a finite dimensional linear subspace, the Kolmogorov n-width decay is exponential. In general, elliptic PDEs, affinely decomposable with respect to the parameters, satisfy the hypothesis of the previous theorem [25,26]. Not always nonlinearities cause a slow decaying KnW: using Theorem 1, in [24] they prove that the parametric PDE on a bounded Lipschitz domain ⊂ R 3 with homogeneous Dirichlet boundary conditions, where C α is the space of Hölder functions, satisfies the hypothesis of the previous theorem. Actually, for Hölder functions the KnW is bounded above by n −α/3 , which is not a fast convergence, but if instead a belongs to the Sobolev space W s,∞ ( ) then the upper bound is n s/3 [27]. We will consider a good KnW decay if it has a higher infinitesimal order than n −1 .
The same is not valid in the case of the simplest linear advection problem. We briefly report some results from the literature on classical hyperbolic PDEs, for (t, μ) ∈ K = [0, 1] 2 with the standard norm and Y = L 2 ([0, 1]), where, the results are respectively proven in [28] and [29]. This behaviour is not restricted only to advection-dominated problems. Intuitively also solution manifolds that are characterized by a parametrized locality in space suffer from slow decaying KnW, like elliptic problems with singular sources parametrically moving in the domain [30].
Our newly developed ROMs, should solve the issue of slow decaying KnW in the applications, guaranteeing a low latent or reduced dimension of the approximate solution manifold. This, because the linear trial manifold, frequently generated by the leading POD modes, is substituted with a non-linear trial manifold, parametrized by the decoder of an autoencoder. The test cases we present in Sect. 4, were chosen in order to be advection-dominated and particularly not suited for linear ROMs, as shown in [5] for the 2d Burgers' equation, that has a close relationship with the NCL problem.

Remark 1 (Extensions of the Kolmogorov n-width to non-linear approximations)
The autoencoders we implement in our test cases are at least continuous as composition of continuous activations and linear functions. In the literature there exist possible non-linear extensions of the KnW such as the manifold width [31], library widths [32], and entropy numbers, for more insights see [10].

Singular Values Decomposition and Discrete Spectral Decay
From the discrete point of view the same problematic in tackling the reduction of parametric PDEs with slow KnW decay is encountered: in this case the discrete solution manifold is actually the set of discrete solutions of the full-order model for a selected finite set of parameters. Singular Value Decomposition (SVD) or eigenvalue decomposition (for symmetric positive definite matrices), usually employed on the snapshots matrix for the evaluation of the reduced modes, are characterized by the fact that modes are linear combinations of the snapshots. This is not enough to approximate snapshots that are orthogonal with respect to the collection of the training set. In practice, looking at the residual energy retained by the discarded modes, is an indicator of approximability with linear subspaces. Let us assume that d is the total number of degrees of freedom of the discrete problem, N train is the number of training snapshots, and, r , the reduced dimension such that, r ≤ N train d. Then, X ∈ R d × R N train is the matrix that has the training snapshots {U i } N train i=1 as columns, V ∈ R d × R r are the reduced modes from the SVD of X , and {σ i } d i=1 are the increasingly ordered singular values. It is valid the following relationship of the residual energy (to the left) with the KnW (to the right), d i=r +1 where · F is the Frobenious norm.
Even though some problems have a slow decaying KnW, this affects only the asymptotic convergence of the ROMs w.r.t. the reduced dimension, while for some applications a satisfying accuracy of the projection error is reached with less than 100 modes, as shown in our benchmarks. So what is actually lost in MOR for problems with a slow decaying KnW is the fast convergence of the projection error w.r.t. the reduced dimension, not the possibility to perform a MOR with enough modes. Moreover, the discrete solution manifold's KnW depends on the time step and spatial discretization size, so that, especially when a coarse mesh is employed, the Knw decays faster w.r.t. the KnW of the continuous solution manifold.

Convolutional Autoencoders
We have chosen to overcome the slowly decaying KnW problem employing autoencoders [3] as non-linear dimension reduction method substituting POD. Some ROMs predict the latent dynamics on a linear trial manifold with artificial neural networks [33], so are still classified as linear ROMs and, in fact, they are still affected by the slow KnW decay. We remark that while some non-linear approaches to model order reduction are specifically tailored for advection-dominated problems [34,35], autoencoders are a more general approach. On the other hand, they are also particularly suited to advection-dominated problems with respect to local and/or partitioned ROMs that implement domain decomposition, even when nonlinear dimension reduction techniques are employed locally [16]: this is because considering a non-discrete parametric space, like the time interval of a simple linear advection problem, an infinite number of local linear and/or non-linear ROMs would be needed to counter the slow decaying KnW. As anticipated, we implement convolutional autoencoders [3] in libtorch the C++ frontend of PyTorch [36]. We remark that the procedure we developed, considering also teacher-student training of a reduced decoder in Sect. 3.4, can be generally extended to any architecture that can approximate with a sufficiently good accuracy the solution manifold through a low-dimensional latent representation. The choice of a CAE is particularly beneficial when the solution snapshots are associated to a structured mesh and when the components of the vectorial solution fields to be approximated have similar features so that the convolutional filters can be shared among the channels of the CAE.
Let us define X h ⊂ R d as the state discretization space with d the number of degrees of freedom and h the discretization step. The snapshots are divided in a training set If the problem has different states and/or vectorial states the training and test set are split in channels for each state and/or component. For example in the 2d non-linear conservation law, we consider two channels, one for each velocity component. In the shallow water test case, we consider three channels, one for each velocity component and one for the free surface height. So, in general, we reshape the snapshots such that U train , U test ⊂ (R d/c ) c where c is the number of channels.
As preprocessing step the snapshots are centered and normalized to assume values in the interval [−1, 1] where U mean , U max , U min ∈ (R d/c ) c are evaluated channel wise. The same values obtained from the training snapshots U train , are employed to center and normalize the test set U test . We define the encoder ψ : where r d is the reduced or latent dimension, as neural networks made by subsequent convolutional layers and linear layers at the end and at the beginning, respectively. For the particular architecture used in the applications we defer it to the "Appendix A". In Fig. 1 is represented the convolutional autoencoder applied for the 2d non-linear conservation law test case, with an approximate size of the filters and the actual number of layers. 1

Remark 2 (Regularity)
Regarding the regularity of the CAE, related to the choice of activation functions, it is proved in Theorem 4.2 from [4] that NM-LSTM and non-linear manifold Galerkin methods are asymptotically equivalent provided that the decoder is twice differentiable. Since we are only employing the NM-LSTM method, our only concern in the choice of activation functions is that the reconstruction error is sufficiently low, so that the accuracy of the whole procedure is not undermined.
For each batch {U i } b i=1 ⊂ U train , the loss employed is the sum of the reconstruction error, and a regularizing term for the weights: where represents the weights of the convolutional autoencoder. The choice of the relative mean squared reconstruction error is important when, varying the parameter have different orders of magnitude: for example this is the case of flows propagating from a local source on the whole domain with a constant zero state as initial condition.
The training is performed with Adam stochastic optimization method [37]. After the training the whole evolution of the dynamics is carried out with a non-linear optimization algorithm minimizing the residual on the latent domain, as described in Sect. 3. For each new parametric instance and associated initial state U 0 ∈ X h , the latent initial condition is found with a single forward of the encoder z 0 = ψ(U 0 ) after centering and normalizing U 0 . Then, for each time instant t, the decoder is used as parametrization of an approximate solution manifold as will be explained in Sect. 3, including in φ the renormalization of the output.

Remark 3 (Initial condition)
With respect to the initial implementation of Carlberg et al. [4] our approach for the definition of the latent parametrization of the solution manifold is different in the management of the initial condition. Instead of using directly the decoder φ, they define the map from the latent to the state space f : R r → R d , including the renormalization in the φ map for brevity, as so that the reconstruction is exact at the initial parametric time instances. So, supposing that the initial condition is parametrically dependent, all initial conditions coincide to ψ(0) in the latent space, and the decoder has to learn variations from the initial latent condition. We prefer instead to split the initial conditions in the latent space in order to aid the non-linear optimization algorithm, and leave to the training of the CAE the accurate approximation of the initial condition. This splitting of the initial conditions can be seen in Figs. 6 and 13. The additional cost of our implementation is the forward of the initial parametric condition through the encoder as first step.

Remark 4 (Inductive biases)
Imposing inductive biases to increase the convergence speed of a deep learning model to the desired solution has always been a winning strategy in machine learning. This is translated in the context of physical models with the possibility to include, among others, the following inductive biases: first principles (conservation laws [38], equations governing the physical phenomenon), geometrical simmetries (group invariant filters [39,40]), numerical schemes/residuals (discrete residuals, latent time advancement with Runge-Kutta schemes), latent regularity (minimize the curvature of latent trajectories), latent dynamics (linear or quadratic latent dynamics [41]). As inductive bias, we will impose the positivity of the state variables that are known to be positive throughout their trajectory with a final ReLU activation.

Evolution of the Latent Dynamics with NM-LSPG-ROC
The non-linear manifold method introduced by Carlberg et al. [4] does not perform a complete dimension reduction since at each time step the decoder reconstructs the state from the latent coordinates to the whole domain, still depending on the number of degrees of freedom of the FOM. We revisit the non-linear manifold least-squares Petrov-Galerkin method (NM-LSPG) with small modifications and introduce two novel hyper-reduction procedures: one combines teacher-student training of a compressed decoder with the reduced over-collocation method (NM-LSPG-ROC-TS), the other implements only the hyper-reduction of the residual with reduced over-collocation (NM-LSPG-ROC).

Non-linear Manifold Least-Squares Petrov-Galerkin
We assume that the numerical method of preference discretizes the system (1) in space and in time with an implicit scheme, G h,δt : where h, δt are the spatial and temporal discretization steps chosen, X h ⊂ R d is the state discretization space (d is the number of degrees of freedom), and I t is the set of past state indexes employed in the temporal numerical scheme to solve for U t h . We remark that the numerical discretization employed can differ from the one used to solve for the full-order training snapshots. The method is thus equations-based rather than fully intrusive. This will be the case for the 2d shallow water equations model in Sect. 4.2.
For each discrete time instant t the following non-linear least-squares problem is solved for the latent state z t ∈ Z , with the Levenberg-Marquardt algorithm [42] That is for each time instant the following intermediate solutions where λ is a factor that evolves during the non-linear optimization and balances between a Gauss-Newton and a steepest descent method, and finally α k is a parameter found with a trust-region method. In the implementation in Eigen [43], λ I d is scaled with respect to the diagonal elements of (dG t,k−1 dφ t,k−1 ) T dG t,k−1 dφ t,k−1 . All the tolerances for convergence are set to machine precision, and the maximum number of residual evaluations is set to 7 unless explicitly stated differently in the numerical results Sect. 4.

Remark 5 (Least-squares Petrov-Galerkin)
The method is called manifold LSPG because it refers to the LSPG method usually applied when the manifold is linear. It consists in multiplying the residual to the left with a different matrix with respect to the linear embedding of the reduced coordinates into the state space, where, ∈ R d×r is the basis of the linear reduced manifold contained in X h ⊂ R d , z ∈ R r , and is to be defined: the left subspace ∈ R d×r is used to enforce the orthogonality of the non-linear residual to a left subspace L ⊂ R d . Applying Newton's method because of the nonlinearities, the problem is translated into the iterations for k = 1, . . . , K : The step length α k is computed after a line search along the direction p k . Usually is chosen from a POD basis of the state variable z ∈ X h . For the left subspace, that imposes orthogonality constraints, different choices can be applied. In general, given the system the least squares solution is the one orthogonal to the range of dG . In the case of dG symmetric positive definite we have that ⊂< dG >=< > i.e. = and the method is called Galerkin projection, but in general if this is not true then the optimal left subspace remains = dG . For examples where the Galerkin projection is not optimal see [44,Sect. 3.4], Numerical comparison of left subspaces. This is often the case for advection-dominated discretized systems of PDEs: in these cases LSPG is preferred to Galerkin projection.

Remark 6 (Manifold Galerkin)
The manifold Galerkin method proposed in [4] assumes that the columns of the Jacobian of the decoder are good approximations of the state velocity space: if a spatial discretization is applied and the residual has the form G h (µ, U) =U − f(µ, U), where f is a generic, possibly non-linear, vector field, theṅ is solved for the latent state velocity, under the hypothesis that dφ(z) has full-rank and dφ is a good approximation of the full-order state velocity even if the autoencoder is trained only on the values of the state for different times and parameters, without considering its velocity. If φ is linear, we obtain the Galerkin method presented in the Remark 5, after having applied a temporal discretization scheme and multiplied the resulting equation to the left with dφ(z) = as is the case for linear Galerkin projection: the discretize-then-project and project-then-discretize approaches are equivalent in this case [4]. The LSPG performs better as shown in [4], even if they are asymptotically equivalent also in the case of a non-linear manifold, provided φ is twice differentiable. So we have chosen to employ only the NM-LSPG method and do not compare it with the non-linear manifold Galerkin (NM-G) method.
For the numerical tests we have performed, the numerical approximation of the Jacobian of the residual G h,δt (µ, φ(z), {φ(z s )} s∈I t ) is accurate enough. So in the implementation, at each iteration step the Jacobian of the residual with respect to the latent variable, that is dG t,k−1 dφ t,k−1 of Eq. (17), is approximated with finite differences. The step size is taken sufficiently lower than the distance between consecutive latent states.

Reduced Over-Collocation Method
At the point of Eq. (17), the model still depends on the number of degrees of freedom of the full-order model d, since at each time step and optimization step the latent reduced variable z ∈ R r is forwarded to the reconstructed state U h = φ(z) ∈ R d . A possible solution is represented by the reduced over-collocation method [7], for which the least squares problem (16) is solved only on a limited number of points r < r h << d, where P r h is the projection onto r h standard basis elements in R d associated to the overcollocation nodes or magic points and selected as described later. Afterwards the Levenberg-Marquardt algorithm is applied as described in the previous section, to solve the least squares problem (26). At this point we make the assumption that the method used to discretize the model has a local formulation so that each discrete differential operator can be restricted to the nodes/magic points of the hyper-reduction and consequently, the least-squares problem (26) reduces to = argmin where the projected residualG = P r h • G is introduced and the compressed decoderφ is defined to substitute P r h • φ with another embedding from the latent space to the hyper-reduced space in R r h , such that the whole structure of the decoder is reduced as described in Sect. 3.4 to further decrease the computational cost. The Levenberg-Marquardt method is applied also to the hyper-reduced system and as for the manifold LSPG method, the Jacobian matrix dG t,k−1 dφ t,k−1 is numerically approximated at each optimization step in the implementations.
Remark 7 (Submesh needed to define the hyper-reduced differential operators) To computẽ G h,δt in the nodes/magic points of the reduced over-collocation method, some adjacent degrees of freedom are needed by the discrete differential operators involved. So, actually, at each time step not only the values of the state variables at the magic points are needed, but also at the adjacent degrees of freedom in the mesh with possible overlappings. We represent the restriction to this submesh of magic points and adjacent degrees of freedom with the projector P s r h ∈ R s h ×d , where s h is the number of degrees of freedom of the submesh. Equation (28) becomes The stencil around each magic point to consider depends on the type of numerical scheme.
Since we are using the finite volume method we have to consider the degrees of freedom of the adjacent cells. For example, for Cartesian grids, the schemes chosen for the 2d non-linear conservation law have a stencil of 1 layer of adjacent cells, 4 additional nodes in total for a cell of the interior of the mesh. The 2d shallow water equations case requires a stencil of 2 layers instead, 12 additional nodes in total for a cell of the interior of the mesh. The two cases are shown in Fig. 2.

Over-Collocation Nodes Selection
The nodes/magic points of the over-collocation hyper-reduction method should be defined such that ..,d} of R d and T is the space of discrete solution trajectories varying with respect to µ and the intermediate optimization steps where V h,µ is the discrete space of time instants, possibly depending on h and the parameter µ, and V h,µ,t is the discrete space of optimization steps at time t. Essentially we want that the nodes/magic points approximate the residuals among all time steps, optimization steps and parameter instances. There are many possible algorithms to solve Eq. (32) for P r h . Usually they are not optimal and compromise between computational cost and accuracy, depending on the problem at hand. Among others, these algorithms are part of the hyper-reduction methods such as empirical interpolation method [45], discrete empirical interpolation method [46], Gauss-Newton tensor approximation (GNAT) [47], space-time GNAT [48] and solution-based non-linear subspace GNAT [49] (SNS-GNAT).
In particular, if G h (µ, U) =U − f(µ, U), then, following some considerations that justify SNS-GNAT [49], the training modes employed to find the nodes/magic points of the overcollocation method are represented by the state snapshots instead of the residual fields. We could in principle use the reduced fields φ(z) but in practice, for the test case we considered, the full-order state snapshots were enough, without even saving the intermediate optimization states.
The procedure is applied at the same time for all the components of the state field U ∈ X h and follows a greedy approach. We remark that the spatial discretization must not vary, so that the degrees of freedom correspond to the same spatial and physical quantity over time and for every parameter instance. Some approaches tackle also geometry deformations, but keeping the same number of degrees of freedom in a reference system [50].
The Algorithm 1 is an adaptation of GNAT from Algorithm 3 in [6] to the simpler case in which the Jacobian matrix is not considered in the hyper-reduction (since in the LM method the Jacobian matrix is approximated with finite differences from the residual, see Eq. (30)). Also, with respect to Algorithm 3 in [6], the new node/magic point at line 21 in Algorithm 1 is found without computing also the reconstruction error of the degrees of freedom associated to its stencil.

Remark 8 (Comparison between GNAT and reduced over-collocation)
We must say that the GNAT method, employed for the implementation of NM-LSPG with shallow masked autoencoders in [5], is a generalization of the reduced over-collocation method. In some Algorithm 1 Greedy nodes/magic points evaluation for ROC method.

Input:
U train := {U µ,t } μ∈P train , t∈V µ,h , training state fields, n r init nodes/magic points at the boundaries, n r h , number of nodes/magic points, N modes number of training modes. Output: r h nodes/magic points used to define P r h .
1: Extract N modes modes {φ i U } i∈{1,...,N modes } from the SVD of U train . 2: Compute the additional number of nodes to sample n a = n r h − n r init . 3: Initialize the counter for the number of working basis vectors used: n b = 0. 4: Set the number of greedy iterations to perform: n it = min (n c , n a ). 5: Compute the minimum number of working basis vectors per iteration: n ci,min = floor(n c /n it ). 6: Compute the minimum number of sample nodes to add per iteration: n ai,min = floor(n a /n c ). 7: for i = 1, . . . , n it (greedy iteration loop) 8: Compute the number of basis vectors for this iteration: n ci = n ci,min ; 9: if i ≤ n c mod n it 10: then n ci = n ci + 1.

11:
Compute the number of samples nodes to add: n ai = n ci,min ; 12: if i ≤ n a mod n c 13: then n ai = n ai + 1 14: if i = 1 then 15: Build the intermediate P from the n r init initial magic points. else 17: for q = 1, . . . , n ci (basis vector loop) 18:

19
: for j = 1, . . . , n ai (sample node loop) 21: Find the node/magic point n that maximizes the state reconstruction error among the nodes not selected yet: n = argmax n ci q=1 (U q ) 2 .

22:
Update P with the newly found node/magic point. 23: cases though, they may perform similarly. For P ∈ S, let us define and the GNAT projection operator P = (P ) † P, where is the matrix where the columns correspond to a chosen basis (it could be the FOM residual snapshots, ROM residual snapshots, ROM Jacobian and residual snapshots, see [47]). In particular, if = P T r h = P T we have that that is the reduced over-collocation projection in the chosen nodes/magic points and extended to 0 in the remaining degrees of freedom. In this sense, the GNAT method includes the reduced over-collocation one. However, for the class of problems we are considering, the GNAT method suffers from the slow decaying KnW. We have the inequalities where (t, k, µ) ∈ V h,µ × V h,µ,t × P whenever the maximum is taken. The term r − Pr is the GNAT approximation error. The rightmost term is the usual bound on the hyper-reduction error [46], where it was used the fact that P = I − P. The second equality is valid because r − V n V T n r and V n V T n r − Pr are orthogonal. The third equality is obtained from the relations where the last equality follows from Pr * = r * . Here we have supposed the nodes/magic points to be independent of V n . So in the case of slow decaying Kolmogorov n-width, minimizing the hyper-reduced residual Pr ∈< V n >⊂ R d is less efficient due to the slow convergence in n of the best approximation error r − V n V T n r 2 2 . This is one of the reasons why we employed ROC for the SWE test case; for the NCL test case GNAT and ROC performed similarly.

Compressed Decoder Teacher-Student Training
In order to make the whole methodology independent of the number of degrees of freedom, the decoder has to be substituted with a mapφ : R R → P r h (X h ) ⊂ R r h from the latent space to the space of discrete full-order solutions evaluated only at the submesh containing the magic points and the needed adjacent degrees of freedom. As architecture, we choose a feedforward neural network (FNN) with one hidden layer, but actually the only requirement is that the computational cost is low enough such that not only a theoretical dimension reduction is achieved, but also a speed-up is reached.
In the literature, the procedure for the training of the compressed decoderφ from the decoder φ is called teacher-student training [51]. In principle, the compressed decoder can be composed of layers inherited by the original decoder, such that the learning process involves only the final new additional layers. In our case, we preferred to train the compressed decoder anew: the latent projections of the training snapshots with the encoder ψ( are the inputs and the restriction of the snapshots to the submesh P s are the targets, see Eq. (31). A schematic representation of the teacher-student training is represented in Fig. 3. Moreover, to speed up the offline stage, we use the training FOM snapshots restricted to the magic points as training outputs for the teacher-student training, while usually the reconstructed snapshots from the CAE -that would additionally need to be computed-are employed. Again, for the training, we use a relative mean square loss with an additional regularizing term where˜ are the weights of the compressed decoder.

Remark 9 (Jacobian evaluation in Levenberg-Marquardt algorithm)
The main reason why finite differences approximations of Jacobians are implemented in the NM-LSPG case, as explained at the end of Sect. 3.1, is that the computational cost of evaluating the Jacobian of the full decoder is too high. In principle Jacobian evaluations of the compressed decoder are cheaper and could be employed, instead of relying again on finite differences approximations.

Remark 10 (Shallow masked autoencoders)
We are motivated to write this article to extend the results in [5] to a generic architecture composed by neural networks. They performed the hyper-reduction of the non-linear manifold method [4] with a shallow masked autoencoder, so that correctly masking the weights matrices of the decoder, its outputs correspond only to the submesh needed by the GNAT method, thus eliminating the dependence on the FOM's degrees of freedom. We want to reproduce, in some sense, this approach for an arbitrary autoencoder architecture, in this case a CAE, in order to tackle with the latest architectures developed in the literature the problem of solution manifold approximability: we think this is a major concern when trying to apply non-linear MOR to real applications. In fact, as will be clear in the numerical results Sect. 4, the reconstruction error of the autoencoder bounds from below the prediction error of our newly developed ROMs.
It can be seen that the new model order reduction is composed of two distinct procedures to achieve the independence on the number of degrees of freedom: first the residual from NM-LSPG in Eq. (16) is hyper-reduced with ROC in Eq. (26) and secondly the CAE's decoder is compressed with teacher-student training. In principle, we could substitute the use of the compressed decoder with the restriction of the final layer of the CAE's decoder into the magic points, while keeping the hyper-reduction with ROC of the residual. In this case, the whole methodology would still be dependent on the total number of degrees of freedom, but in practice a CAE's decoder forward is relatively cheap compared to the evaluation of the full residual. So, the hyper-reduction performed with ROC or GNAT only at the equations/residuals level, is already beneficial to reduce the computational cost. We will compare this variant of the NM-LSPG-ROC method with the one that employs the compressed decoder, also to verify the consistency of the teacher-student training that is omitted in the first case.
In the numerical results Sect. 4 we will adopt the acronym NM-LSPG-ROC-TS or NM-LSPG-GNAT-TS for the method that employs the compressed decoder and NM-LSPG-ROC or NM-LSPG-GNAT for the method that performs the hyper-reduction only at the equations/residuals level.

Numerical Results
We test the new methodology on two benchmarks with a relatively slow KnW: the first model is governed by a non-linear conservation law (Sect. 4.1) the second by the shallow water equations (Sect. 4.2). Both are parametric, non-linear and time-dependent, and the only other (non-temporal) parameter is a multiplicative constant of the initial condition. The mesh employed is the same: a 60 × 60 structured orthogonal grid.
All the CFD simulations are obtained by the use of an in-house open source library ITHACA-FV (In real Time Highly Advanced Computational Applications for Finite Volumes) [52], developed in a finite volume environment based on the open-source library OpenFOAM [53]. Regarding the implementation of the convolutional autoencoders and compressed decoders (CAE) we used libtorch, PyTorch C++ frontend, while for the training of the long-short term memory network (LSTM) we used PyTorch [36]. All the CFD simulations were performed on a Intel(R) Core(TM) i7-8750 H CPU with 2.20GHz and all the neural networks trainings on a GeForce GTX 1060 GPU. Further reductions in the computational costs could be achieved exploiting the parallel implementation of the training procedures in PyTorch. The details of the architectures of the neural networks that will be employed are reported in the Appendix A.
The following notations are introduced: N μ train , N μ test are the numbers of train and test parameters, respectively; N t train , N t test are the number of time instances associated to the train and test parameters, respectively. The total number of training and test snapshots is thus respectively. The accuracy of the reduced-order models devised is measured with the mean realtive L 2 -error and the maximum relative L 2 -error, where the mean and max are taken with respect to the time scale: since the test cases depend on a non-temporal parameter, for each instance of these parameters a time-series corresponding to the discrete dynamics is associated; the mean and maximum are evaluated w.r.t the elements of these time-series. Let ..N t be the predicted and true time-series N t elements long, associated to the train or test parameter μ, the mean relative L 2 -errors and maximum relative L 2 -errors are then defined as

Remark 11 (Levenberg-Marquardt parameters)
Regarding the Levenberg-Marqurdt nonlinear optimization algorithm, we remark that we approximate the Jacobians with forward finite differences, and the optimization process, for each time step, is stopped when the maximum number of residual evaluations is reached. This number is set to 7, including the evaluations related to the Jacobian computations. When 7 residual evaluations are not enough for the method to converge, it is explicitly reported.

Non-linear Conservation Law (NCL)
We test our procedure for non-linear model order reduction on a 2d non-linear conservation law model (NCL). Two main reasons are behind this choice: the slow Kolmogorov n-width decay of the continuous solution manifold, and the possibility to compare our results with a similar test case realized with an implementation of non-linear manifold based on shallow masked autoencoders and GNAT [5]. The parametrization affects the initial velocity as a scalar multiplicative constant μ ∈ [0.8, 2]: where the viscosity ν = 0.0001. We will collect N μ train = 12 equispaced training parameters from the range μ ∈ [0. 8,2] and N μ test = 16 equispaced test parameters from the range μ ∈ [0.6, 2.2]. The first two and the last two parameters will account for the extrapolation error. The time step is equal to t = 1e−3 s, but the training snapshots are collected every 4 time steps and the test snapshots every 20, thus N t train = 501, and N t test = 101. In the predictive online phase, the dynamics will be evolved with the same time step t = 1e−3. For easiness of representation, the train parameters are labelled from 1 to 12, and the test parameters are labelled from 1 to 16.
To have a qualitative view on the range of the solution manifold, we report the initial and final time snapshots for the extremal training parameters of the range μ ∈ [0.8, 2], in Fig. 4.
In this test case the GNAT method performed slightly better than the ROC method for hyper-reduction, so we employed the former to obtain the results shown.

Full-Order Model
We solve the 2d non-linear conservation law for different values of the parameter μ with OpenFoam [53] open-source software for CFD. We employ the finite volumes method (FVM) in a structured orthogonal grid of 60 × 60 cells. If we represent with M the mass matrix, with D the diffusive matrix term, and with C(U t−1 ) the advection matrix, then, at every time instant t, the discrete equation is solved for the state U t with a semi-implicit Euler method. The time step is 1e−3, the initial and final time instants are 0 and 2 s. The linear system is solved with the iterative method BiCGStab preconditioned with DILU, until a tolerance of 1e−17 on the FVM residual is reached.
The stencil of the numerical scheme at each cell involves the adjacent cells that share an interface (4 for an interior cell, 3 for a boundary cell and 2 for a corner cell): the value of the state at the interfaces is obtained with the bounded upwind method for the advection term and the surface normal gradient is obtained with central finite differences of two adjacent cell centers. So, in order to implement the reduced over-collocation method, for each node/magic point we have to consider an additional number of maximum 4 cells, that is 8 degrees of freedom to keep track of during the evolution of the latent dynamics; of course in practice they may overlap reducing the computational cost further.
The residual of the NM-LPSG methods is evaluated with the same numerical scheme of the FOM. In the SWE test case the FOM and the ROMs employ different numerical schemes (Sect. 4.2).

Manifold Learning
As first step of the procedure the discrete solution manifold is learned through the training of a convolutional autoencoder (CAE) whose specific architecture is reported in Table 7. The CAE is trained with the ADAM [37] stochastic optimization algoritm for 2000 epochs, halving the learning rate by a factor of 2 if after 200 epochs the loss does not decrease. The initial learning rate is 1e−3, its lower bound is 1e−6. The number of training snapshots is N train = 12 × 501 = 6012, the batch size 20. It could be further refined in order the increase the efficiency of the whole procedure.
We choose as latent dimension 4, 2 dimensions greater than the number of parameters (the scalar multiplying the initial condition and time). We don't perform a convergence study of the accuracy with respect to the latent dimension since our focus is on the implementation of the NM-LSPG-ROC and NM-LSPG-ROC-TS model-order reduction methods: we are satisfied as long as the accuracy is relatively high, while the reduced dimension corresponds to an inaccurate linear approximating manifold spanned by the same number of POD modes.
In Fig. 5 is shown the reconstruction error of the CAE and its decay with respect to the number of POD modes chosen [4,10,25,50,100]. To reach the same accuracy of the CAE with latent dimension 4, around 50 POD modes are needed. In order to state that the slow KnW decay problem is overcome by the CAE, the asymptotic convergence of the reconstruction error w.r.t. the latent dimension should be studied as was done for similar problems in [4,5]. Instead, we will empirically prove that we can devise an hyper-reduced ROM with latent dimension 4 and accuracy lower than the 2% for the mean relative L2-error, a task that would be impossible for a POD based ROM with the same reduced dimension, since the reconstruction error is near 20% for all the test parameters.
Without imposing any additional inductive bias a part from the regularization term in the loss from Eq. (12) and the positiveness of the velocity components, the latent trajectories reported in Fig. 6 for the odd parameters of the test set, are qualitatively smooth. An important detail to observe is that the initial conditions are well separated one from another in the latent space, see Remark 3, and that the dynamics is non-linear. It also can be noticed that the 2 extremal parameters, corresponding to the extrapolation regime and represented in the plot by the two most outer trajectories that enclose the other 6, have smooth latent dynamics analogously to the others even though the reconstruction error starts degrading, as can be seen from Fig. 5.

Hyper-Reduction and Teacher-Student Training
The selection of the magic points is carried out with the greedy Algorithm 1. The FOM snapshots employed correspond to the training parameters 1 and 12, but are sampled every 10 time step instead of every 4, as for the training snapshots of the CAE.
We perform a convergence study increasing the number of magic points from 50 to 100 and 150. The corresponding submesh sizes, i.e. the number of cells involved in the discretization of the residuals, are reported in the Table 1. The submesh size is bounded above with the total number of the cells in the mesh, that is 3600.
After the computation of the magic points, the FOM snapshots are restricted to those cells and employed as training outputs of the compressed decoder, as described in Sect. 3.4. The actual dimension of the outputs is twice the submesh size, since for each cell there are 2 degrees of freedom corresponding to the velocity components. The inputs are the 4dimensional latent coordinates of the encoded FOM training snapshots, for a total of 6012 training input-output pairs. The architecture of the compressed decoder is a feedforward neural network (FNN) with one hidden layer, whose number of nodes is reported in Table 1,  under 'HL size'. The compressed decoders architecture's specifics are also summarized in Table 7. Each compressed decoder is trained for 3000 epochs, with an initial learning rate of 1e−4, that halves if the loss from Eq. (34) does not decrease after 200 epochs. The batch size is 20. The duration of the training is reported in Table 1 under 'TS total epochs', that stands for Teacher-Student training total epochs, along with the average cost for an epoch, under 'TS avg epoch'. The accuracy of the predictions on the test snapshots restricted to the magic points is assessed in Fig. 7. The convergence with respect to the number of magic points is shown in Fig. 8. Since, especially for the NM-LSPG-GNAT-TS reduced-order model, the relative L 2 -error is not uniform along the time scale, we report both the mean and max relative L 2 -errors over the time series associated to each one of the 16 test parameters.
From Fig. 8 and Table 1 it can be seen that NM-LPSG-GNAT is more accurate than NM-LSPG-GNAT-TS even tough computationally more costly in the online stage. We underline that in the offline stage NM-LSPG-GNAT-TS requires the training of the compressed decoder. However, this could be performed at the same time of the CAE training, see the discussion Sect. 5. The NM-LSPG-GNAT reduced-order model achieves better results also in the extrapolation error, sometimes even lower than the NM-LSPG method: this remains true even when increasing the maximum residual evaluations of the LM algorithm, and it may be related to the nonlinearity of the decoder that introduces difficult to interpret correlations of the latent dynamics with the output solutions restricted to the magic points.
All the simulations of NM-LSPG, NM-LSPG-GNAT-TS and NM-LSPG-GNAT methods converge with a maximum of 7 residual evaluations of the LM algorithm, for all magic points reported, that is 50, 100, and 150 and for all the 16 test parameters. However, 3 test parameters could not converge for the method NM-LSPG-GNAT-TS with 100 magic points, so we increased the maximum function evaluations to 13 for all the test points. The higher computational cost per time step is shown in Table 1. Apart from those 3 test points not converging, the accuracy remains the same for the other 13 test parameters, so we have chosen to report the results in the case of 13 residual evaluations for all the 16 test parameters in Fig. 8.

Comparison with Data-Driven Predictions Based on a LSTM
To assess the quality of the reduced-order models devised, we compare the accuracy in the training and extrapolation regimes, and the computational cost of the offline and online stages with a purely data-driven ROM in which the solutions manifold is approximated by the same CAE, but the dynamics is evolved in time with a LSTM neural network. The architecture of the LSTM employed is reported in Table 9. The results are summarized in Fig. 9, the computational costs in Table 2.
The LSTM is trained for 10,000 epochs with the ADAM stochastic optimization algorithm and an initial learning rate of 0.001, halved if after 500 epochs the loss does not decrease. The time series used for the training are the same N train = 6012 training snapshots employed for the CAE. We remark that the LSTM cannot approximate the dynamics for an arbitrary time step, but it is fixed, depending on the training time step used, in this case 0.004 s.
The offline stage's computational cost is determined by the heavy CAE training for both the procedures, see the Discussion Sect. 5 for possible remedies. The LSTM-NN achives a speed-up close to 3 with respect to the FOM, differently from the NM-LSPG-GNAT and NM-LSPG-GNAT-TS methods. However, since the models are hyper-reduced, increasing the degrees of freedom refining the mesh should increase the computational cost of the FOM and NM-LSPG methods only, with due precautions. The average cost of the evaluation of the dynamics for the LSTM model with a time step of 0.004 s is associated to the label 'avg LSTM-NN full-dynamics'; thanks to vectorization, the dynamics for all the N μ test = 16 parameters is evaluated with a single forward of the LSTM, thus the low computational cost reported.
The CAE reconstruction error in blue in Fig. 9, lower bounds all the other models' errors. This is the reason why having a good accuracy of the CAE's solution manifold approximation is mandatory to build up non-linear manifold methods. In this sense NM-LSPG-ROC-TS and  NM-LSPG-GNAT-TS with respect to NM-LSPG-GNAT with shallow autoencoders [5] offer the possibility to choose an arbitrary architecture for the autoencoder, thus allowing a more accurate solution manifold approximation.
While in the training range from test parameter 3 to 14, the accuracy of the LSTM-NN is significantly better than NM-LSPG-GNAT and NM-LSPG-GNAT-TS models', in the extrapolation regime we observe that the predictions of the fully data-driven model degrades. The extrapolation error of the LSTM-NN model depends on the architecture chosen, regularization applied, training procedure, and hyperparameters tuning. What can be assessed from the results is that, outside the training range, the LSTM-NN's accuracy is dependent on all these factors, with sometimes a difficult interpretation of the results, while NM-LSPG- GNAT relies only on the number of magic points employed and the dynamics is evolved in time minimizing a physical residual directly related to the NCL model's equations.

Shallow Water Equations (SWE)
The second test case we present is a 2d non-linear, time-dependent, parametric model based on the shallow water equations (SWE). Also in this case, the non-temporal parameter affects the initial conditions, μ where h is the water depth, u is the velocity vector, g is the gravitational acceleration, and O is the point (0.5, 0.5) ∈ . We consider a constant bathymetry h 0 = 0, so that the free surface height h total = h + h 0 is equal to the water depth h. The time step that will be employed for the evolution of the dynamics of the FOM is 1e−4 s. The training and test snapshots are sampled every 4 time steps. The training non-temporal parameters are N  In this test case the ROC hyper-reduction is more accurate with respect to the GNAT one, so the results are reported w.r.t. this hyper-reduction method.

Full-Order Model
One detail that we didn't stress in the previous test case is that the FOM and the NM-LSPG ROM can discretize the residuals of the SWE differently: only the consistency of the discretization is required, characterizing the NM-LPSG family as equations-based rather than fully intrusive.
The FOM solutions are computed with the OpenFoam solver shallowWaterFoam [53], while the ROMs discretize the residual with a much simpler numerical scheme.
The FOM numerical scheme is the PIMPLE algorithm, a combination of PISO [54] (Pressure Implicit with Splitting of Operator) and SIMPLE [55] (Semi-Implicit Method for Pressure-Linked Equations). For the shallow water equations the free surface height h plays the role of the pressure in the Navier-Stokes equations, regarding the PIMPLE algorithm implementation. The number of outer PISO corrections is 3.
The time discretization is performed with the semi-implicit Euler method. The non-linear advection terms are discretized with the Linear-Upwind Stabilised Transport (LUST) scheme that requires a stencil with 2 layers of adjacent cells for the hyper-reduction. The gradients are linearly interpolated through Gauss formula. The solutions for hU are obtained with Gauss-Seidel iterative method, and for h with the conjugate gradient method preconditioned by the Diagonal-based Incomplete Cholesky (DIC) preconditioner. For both of them the absolute tolerance on the residual is 1e−6 and the relative tolerance of the residual w.r.t. the initial condition is 0.1.
The residual of ROMs is instead discretized as follows. If we represent with M hU , M h the mass matrices, with G(h) the discrete gradient vector of h, and with C hU ((hU ) t−1 ), C h (U t−1 ) the advection matrices, then, at every time instant t, the discrete equations are solved for the state ((hU ) t , h t ) with a semi-implicit Euler method. The same numerical schemes and linear systems iterative solvers of the FOM are employed. In principle, they could be changed.
Since now the stencil of a single cell needs two layers of adjacent cells for the discretizations, for each internal magic point, 12 additional cells need to be considered for the hyper-reduction.

Manifold Learning
The CAE architecture for the SWE model is reported in Table 6. This time one encoder and two decoders, one for the velocity U and one for the height h are trained. Moreover, to increase the generalization capabilities we converted 2 layers of the decoder for U in recurrent convolutional layers as shown in the Appendix. Even with this modification the initial time steps, from 0 to 0.01 are associated to a high reconstruction error: the relative L 2 -error is around 0.1 for every test parameter at the initial time instants, slowly decreasing towards the accuracy shown in Fig. 12 after t = 0.01, chosen as initial instant from here onward.
The CAE is trained for 500 epochs with a batch size of 20 and an initial learning rate of 1e−4, that halves if the loss from Eq. (12) does not decrease after 50 epochs. In this case the state is (U , h) so the encoder has three channels, two for the velocity components, U 1 , U 2 , and one for the height, h. The high computational cost is shown in Table 4. We have to observe that nor the architecture is parsimonious for a good approximation of the discrete solution manifold in the time interval [0.01, 0.2], neither the number of training snapshots 5010 is optimized to reach the highest efficiency with the lowest computational cost. Our focus is obtaining a satisfactory reconstruction error in order to build up our ROMs.
The solution manifold parametrized by the decoder achieves the reconstruction error of a linear manifold spanned by around 20 POD modes. In fact, the decay of the reconstruction error associated to the POD approximations is faster than the previous test case in the time interval [0.01, 0.2].
It can be seen from the representation of the latent dynamics associated to the train parameters in Fig. 13, that the initial solutions overlap. This and the low accuracy could be explained by the fact that the FOM dynamics has different scales, especially for the velocity U that from the initial constant zero solution reaches a magnitude of 10 −1 m per seconds. Further observations and possible solutions are presented in the Discussion Sect. 5.

Hyper-Reduction and Teacher-Student Training
Mimicking the structure of the CAE, the compressed decoder is split in two, one for the velocity U and one for the free surface height h. The architecture for both the decoders is a feed-forward NN with a single hidden layer; they are reported in the Table 8. The compressed decoders are trained for 1500 epochs, with a batch size of 20, and an initial learning rate of 1e−4 that halves after 100 epochs if the loss does not decrease, with a minimum value of 1e−6. As for the previous test case, the number of magic points, hidden layer sizes, training times, average training epoch computational cost are reported in Table 3. This time for each magic point correspond 3 degrees of freedom, so the actual output dimension of the compressed decoders is three times the submesh sizes.
The relative L ∞ -error of the compressed decoder is shown in Fig. 14. This time the extrapolation error is sensibly higher as already seen in the reconstruction error of the CAE.
As in the previous case, both the NM-LSPG-ROC and NM-LSPG-ROC-TS ROMs are considered. This time it's the NM-LSPG-ROC's dynamics to be less stable with 7 maximum residual evaluations of the LM algorithm. In this respect, for parameter test 4 and mp = 100  In bold the average computational cost with 13 maximum residual evaluations due to the instability in the evolution of the dynamics of the test parameter 4

Comparison with Data-Driven Predictions Based on LSTM
The same LSTM architecture of the NCL test case is trained with the same training procedure, only that now there are 5010 training parameters-latent coordinates pairs. For the architecture's specifics see Table 9. The computational costs introduced in the previous test case are reported also for the SWE model in Table 4.
With the architecture and training procedure employed, the LSTM could not achieve a good accuracy at the initial time instants after t 0 = 0.01 s: for this reason in the plot of the errors in Fig. 17  A part from this, the LSTM predictions are for almost every test parameter above only the NM-LSPG and CAE's reconstruction errors. Even in the extrapolation regimes, the predictions are more accurate than the NM-LSPG-ROC and NM-LSPG-ROC-TS reduced-order models. Regarding the computational costs, this time not only the LSTM model but also the NM-LSPG-ROC-TS ROMs achieve a little speed-up w.r.t. the FOM. The choice of doubling the decoders has repercussions in the online costs, but it was made only in an effort to reach a good reconstruction error of the CAE; maybe more light architectures could be employed for the restricted time interval [0.01, 0.2].

Discussion
We comment the numerical results obtained: • Computational cost of the CAEs training. It is evident from Tables 2 and 7 that the offline stage's computational cost is dominated by the CAEs trainings. We have to remark that the architectures were not optimized to be the most parsimonious ones in order to achieve the desired reconstruction error. Moreover, libtorch training took almost twice more time than the same architecture's training in PyTorch, due to implementation inconsistencies. The cost of the forward evaluations of the decoder are comparable instead, not changing much the online costs. The training of the CAEs could be further reduced with transfer learning [3] or preprocessing steps that enlight some features of the dynamics that are more easily learnable, as was done in [20]. Also, the number of training snapshots could be optimized further for the test cases presented. The same observations apply also for the compressed decoders. Moreover, a parallel implementation of the training procedures on more than one GPU is mandatory to achieve competitive computational costs. • Simultaneous CAE and compressed decoder/LSTM training. The additional costs of the LSTM and compressed decoder training could be cut with a unified training of the CAE and compressed decoder: after some epochs, the training of the LSTM or compressed decoder could be switched on and performed at the same time of the CAE's since the only  In a weaker sense also the NM-LSPG-ROC ROM is also independent on the number of degrees of freedom of the FOM, apart from the weights of the CAE's decoder that concur in increasing the cost of a single forward in the online stage. • Generalization error of LSTM and CAE and additional inductive biases. The generalization error of the NN employed depends on a lot of factors, from the number of training samples to the regularization term in the loss, the architectures, etc. It is thus difficult to predict how much the accuracy of the predictions will decay outside the training range. Adding a physics-informed term in the loss of the CAE does not provide an improvement of the reconstruction error if enough training data are employed as in our test cases. Some regularization properties could be imposed in the latent dynamics to facilitate the evolution of the NM-LPSG-ROC ROMs, for example imposing a linear latent dynamics could be beneficial • Purely data-driven LSTM ROM vs NM-LSPG-ROC and NM-LSPG-ROC-TS ROMs.
One crucial difference is interpretability: in the first case, the latent dynamics is obtained training the LSTM to approximate the latent coordinates, in the second case, the latent dynamics is evolved in time minimizing the hyper-reduced residual based on the physical model's equations. Regarding the LSTM predictions, we have seen in the test cases that despite the higher accuracy in the training range and the low computational online cost, there might be other issues in the extrapolation regime: in the NLC test case, the accuracy was sensitively lower in the extrapolation regime, even if it could be improved in theory increasing the layers and nodes of the LSTM in exchange for a higher training cost; in the SWE test case the initial time step from 0.01 to 0.017 s could not be well-approximated due to different scales and overlappings in the latent dynamics. The NM-LSPG-ROC and NM-LSPG-ROC-TS mitigated in some sense these issues, delegating less effort in the tuning of the hyperparameters of the LSTM and increasing the interpretability of the results. Moreover, the LSTM-NN, approximate the dynamics only every time step imposed by the training input-output pairs, while the non-linear manifold ROMs, can in principle approximate the latent dynamics with an arbitrary small time step, since the decoder provides a continuous approximation of the solution manifold and the dynamics is evolved based on a numerical scheme with changeable time step. With respect to purely data-driven methods though, non-linear manifold methods require a lower reconstruction error of the CAE since every successive ROMs' dynamics evolution depends on how well the intermediate solutions are reconstructed through the decoder. • Learn the solution manifolds with autoencoders in unstructured meshes. The natural question of how to extend the CAE architecture to 3D or unstructured meshes, is being currently studied. In the literature there are already interesting results that employ graph neural networks and their variants to find latent representations of 3d simulations [23].

Conclusions and Perspectives
We have developed two new hyper-reduced non-linear manifold ROMs: NM-LSPG-ROC and NM-LSPG-ROC-TS, that can be converted in NM-LSPG-GNAT and NM-LSPG-GNAT-TS. In NM-LPSG-ROC-TS the residuals of the NM-LSPG ROM are hyper-reduced with over-collocation, while the decoder of the CAE is approximated with teacher-student training into a compressed decoder. In NM-LSPG-ROC only the residuals' hyper-reduction is carried out. The methods perform similarly in accuracy and computational cost w.r.t. the NM-LSPG-GNAT with shallow autoencoders ROM introduced in [5], for a similar test case. The flexibility of our method permits to change the CAE architecture depending on the problem at hand, without imposing too many constraints on its structure, in order to reach the convergence faster and achieve a lower reconstruction error. With respect to purely data-driven ROMs built on the CAE's solution manifold, NM-LSPG-ROC and NM-LSPG-ROC-TS provide more interpretable and, in the extrapolation regimes, sometimes more accurate predictions, and they need less hyperparameters tuning, once the CAE is trained. Moreover, the methods developed are equations-based rather than fully-intrusive, and exploit the physics of the model to evolve the latent dynamics. It is crucial, though, that the CAE's reconstruction error is sufficiently low, since the latent dynamics needs to be computed with numerical schemes that rely on the accuracy of the reconstructed solutions through the decoder of the CAE or the compressed decoder. We base these observations on the results obtained for two parametric non-linear time-dependent benchmarks presented in the numerical results Sect. 4, that is a 2d non-linear conservation law model (NLC) and a 2d shallow water equations model. Despite the speed-up is not achieved or not significant with respect to the FOMs, we reached satisfactory results in terms of the accuracy and the latent or reduced dimension of the ROMs.
Future directions of research involve the implementation of the developed ROMs in more complex applications with higher computational costs and degrees of freedom, such that a more evident speed-up is reached. This may involve the development of more parsimonious architectures and training procedures to reduce the offline cost. The research in geometric deep learning will be crucial for the possible extensions of the present methodology to meshbased 2d and 3d simulations. More has to be done also to further improve the interpretability of the results possibly taking into account a probabilistic approach, for example using Bayesian neural networks, and adhering more tightly to the physics of the model with additional inductive biases.
Innovation-Horizon 2020 Program-in the framework of the European Research Council Executive Agency: H2020 ERC CoG 2015 AROMA-CFD project 681447 "Advanced Reduced Order Methods with Applications in Computational Fluid Dynamics" P.I. Professor Gianluigi Rozza.

Data availability
The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

Declarations
Competing interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Neural Networks' Architectures
The architectures of the CAEs are shown in Tables 5, 6, of the compressed decoders in Tables 7 and 8, and of the LSTMs in Table 9. We mainly employ the Exponential Linear Unit (ELU) and Rectified Linear Unit (ReLU) activation functions. The padding is symmetric. The labels Conv2d, ConvTr2d and ConvTr2dRec, stand for 2d convolutions, transposed 2d convolutions [36], and 2d transposed convolution associated to a recurrent layer, i.e. they are summed to the previous layer and then passed to an ELU activation function.  The decoder for U has two recurrent layers The decoder for U has two recurrent layers