1 Introduction

Modern sensor technology and data acquisition capabilities generate an ever increasing wealth of data about virtually every branch of science and social life. Machine learning offers novel techniques for extracting quantifiable information from such large data sets. While machine learning has already had a transformative impact on a diversity of application areas in the “big-data” regime, particularly in image classification and artificial intelligence, it is yet to have a similar impact in many other areas of science.

Utilizing data observations in the analysis of scientific processes differs from traditional learning in that one has the additional information that these processes are described by mathematical models—systems of partial differential equations (PDE) or integral equations—that encode the physical laws that govern the process. Such models, however, are often deficient, inaccurate, incomplete or need to be further calibrated by determining a large number of parameters in order to accurately represent an observed process. Typical guiding examples are Darcy’s equation for the pressure in ground-water flow or electron impedance tomography. Both are based on second order elliptic equations as core models. The diffusion coefficients in these examples describe premeability or conductivity, respectively. The parametric representations of the coefficients could arise, for instance, from Karhunen-Loève expansions of a random field that represent “unresolvable” features to be captured by the model. In this case the number of parameters could actually be infinite.

The use of machine learning to describe complex states of interest or even the underlying laws, solely through data, seems to bear little hope. In fact, data acquisition is often expensive or even harmful as in applications involving radiation. Thus, a severe undersampling poses principal obstructions to state or parameter estimation by solely processing observational data through standard machine learning techniques. It is therefore more natural to try to effectively combine the data information with the knowledge of the underlying physical laws represented by parameter dependent families of PDEs.

Methods that fuse together data-driven and model-based approaches fall roughly into two categories. One prototype of a data assimilation scenario arises in meteorology where data are used to stabilize otherwise chaotic dynamical systems, typically with the aid of (stochastic) filtering techniques. A second setting, in line with the above examples, uses an underlying stable continuous model to regularize otherwise ill-posed estimation tasks in a “small-data” scenario. Bayesian inversion is a prominent way of regularizing such problems. It relaxes the estimation task to asking only for posterior probabilities of states or parameters to explain given observations.

The present article reviews some recent developments on data driven state and parameter estimation that can be viewed as seeking alternatives to Bayesian inversion by placing a stronger focus on deterministic uncertainty quantification and its relation to computational complexity. The emphasis is on foundational aspects such as the optimality of algorithms (formulated in an appropriate sense) when treating estimation tasks for “small-data” problems in high-dimensional parameter regimes. Central issues concern the role of reduced modeling and the exploitation of intrinsic problem metrics provided by the variational formulation of the underlying continuous family of PDEs. This is used by the so called Parametrized Background Data-Weak (PBDW) framework, introduced in [20] and further analyzed in [4], to identify a suitable trial (Hilbert) space \(\mathbb {U}\) that accommodates the states and eventually also the data. An important point is to distinguish between the data and corresponding sensors—here linear functionals in the dual \(\mathbb {U}^{\prime }\) of \(\mathbb {U}\)—from which the data are generated. This will be seen to actually open a geometric perspective that sheds light on intrinsic estimation limits. Moreover, in the deterministic setting, a pivotal role is played by the so called solution manifold, which is the set of all states that can be attained when the parameters in the PDE traverse the whole parameter domain.

Even with full knowledge of a state in the solution manifold, to infer from it a corresponding parameter is a nonlinear severely ill-posed problem typically formulated as a non-convex optimization problem. On the other hand, state estimation from data is a linear, and hence a more benign inversion task mainly suffering under the current premises from a severe undersampling. We will, however, indicate how to reduce, under certain circumstances, the latter to the former problem so as to end up with a convex optimization problem. This motivates focusing in what follows mainly on state estimation. A central question then becomes how to best invoke knowledge on the solution manifold to regularize the estimation problem without introducing unnecessarily ambiguous bias. Our principal viewpoint is to recast state estimation as an optimal recovery problem which then naturally leads one to explore the role and potential of reduced modeling.

The layout of the paper is as follows. Section 2 describes the conceptual framework for state estimation as an optimal recovery task. This formulation allows the identification of lower bounds for the best achievable recovery accuracy.

Section 3 reviews recent developments concerning a certain affine recovery scheme and highlights the role of reduced models adapted to the recovery task. The overarching theme is to establish certified recovery bounds. When striving for optimality of such affine recovery maps, high parameter dimensionality is identified as a major challenge. We outline a recent remedy that avoids the Curse of Dimensionality by trading deterministic accuracy guarantees against analogs that hold with quantifiable high probability.

Even optimal affine reduced models can, in general, not be expected to realize the benchmarks identified in Sect. 2. To put the results in Sect. 3 in proper perspective, we comment in Sect. 4 on ongoing work that uses the results on affine reduced models and corresponding estimators as a central building block for nonlinear estimators. We also indicate briefly some ramifications on parameter estimation.

2 Models and Data

2.1 The Model

Technological design or simulating physical processes is often based on continuum models given by a family

$$\begin{aligned} \mathcal{R}(u,y)= 0, \quad y\in \mathcal {Y}, \end{aligned}$$
(2.1)

of partial differential Equations (PDEs) that depend on parameters y ranging over a parameter domain \( \mathcal {Y}\subset \mathbb {R}^{d_y}\). We will always assume uniform well-posedness of (2.1): for each \(y\in \mathcal {Y}\), there exists a unique solution \(u=u(y)\) in some trial Hilbert space \(\mathbb {U}\) which satisfies \(\mathcal{R}(u(y),y)= 0\).

Specifically, we consider only linear problems of the form \( \mathcal{B}_y u=f, \) that is,

$$\begin{aligned} \mathcal{R}(u,y)= f- \mathcal{B}_y u. \end{aligned}$$
(2.2)

Here f belongs to the dual \(\mathbb {V}^{\prime }\) of a suitable test space \(\mathbb {V}\) and \(\mathcal{B}_y\) is a linear operator acting from \(\mathbb {U}\) to \(\mathbb {V}^{\prime }\) that depends on \(y\in \mathcal {Y}\). Here, uniform well-posedness means then that \(\mathcal{B}_y\) is boundedly invertible with bounds independent of y. By the Babu\(\check{s}\)ka-Banach-Ne\(\check{c}\)as Theorem, this is equivalent to saying that the bilinear form

$$\begin{aligned} (u,v)\mapsto b_y(u,v):= (\mathcal{B}_y u)(v) \end{aligned}$$
(2.3)

satisfies the following continuity and inf-sup conditions

$$\begin{aligned} \sup _{u\in \mathbb {U}}\sup _{v\in \mathbb {V}} \frac{b_y(u,v)}{\Vert u\Vert _\mathbb {U}\Vert v\Vert _\mathbb {V}} \le C_b\quad \mathrm{and} \quad \inf _{u\in \mathbb {U}}\sup _{v\in \mathbb {V}} \frac{b_y(u,v)}{\Vert u\Vert _\mathbb {U}\Vert v\Vert _\mathbb {V}}\ge c_b>0,\quad y\in \mathcal {Y}, \end{aligned}$$
(2.4)

together with the property that \(b_y(u,v)=0,\, u\in \mathbb {U}\), implies \(v=0\) (injectivity of \(\mathcal{B}_y^*\)). The relevance of this stability notion lies in the entailed validity of the error-residual relation

$$\begin{aligned} C_b^{-1}\Vert f- \mathcal{B}_y v\Vert _{\mathbb {V}'} \le \Vert u(y)-v\Vert _\mathbb {U}\le c_b^{-1}\Vert f-\mathcal{B}_y v\Vert _{\mathbb {V}'},\quad v\in \mathbb {U},\,y\in \mathcal {Y}, \end{aligned}$$
(2.5)

where \(\Vert g\Vert _{\mathbb {V}^{\prime }}:= \sup \{g(v)\, : \, \Vert v\Vert _\mathbb {V}= 1\}\). Thus, errors in the trial norm are equivalent to residuals in the dual test norm which will be exploited in what follows.

For a wide range of problems such as space-time variational formulations, e.g. of parabolic or convection-diffusion problems, indefinite or singularly perturbed problems, the identification of a suitable pair \(\mathbb {U},\mathbb {V}\) that guarantees stability in the above sense is not entirely straightforward. In particular, trial and test space may have to differ from each other, see e.g. [6, 11, 17, 23] for examples as well as some general principles.

The simplest example, used for illustration purposes, is the elliptic family

$$\begin{aligned} \mathcal{R}(u,y) = f + \text{ div } (a(y)\nabla u), \end{aligned}$$
(2.6)

set in \(\Omega \subset \mathbb {R}^{d_x}\) where \(d_x\in \{1,2,3\}\), with boundary conditions \(u|_{\partial \Omega }=0\). Uniform well-posedness follows then for \(\mathbb {U}=\mathbb {V}= H^1_0(\Omega )\) if we have for some fixed constants \(0<r\le R<\infty \) the bounds

$$\begin{aligned} r\le a(x,y)\le R,\quad (x,y)\in \Omega \times \mathcal {Y}, \end{aligned}$$
(2.7)

readily implying (2.4).

Aside from well-posedness, a second important structural property of the model (2.1) is affine parameter dependence. By this we mean that

$$\begin{aligned} \mathcal{B}_y u= \mathcal{B}_0 u + \sum _{j=1}^{d_y} y_j \mathcal{B}_j u,\quad y= (y_j)_{j=1,\dots ,d_y}\in \mathcal {Y}, \end{aligned}$$
(2.8)

where the operators \(\mathcal{B}_j:\mathbb {U}\rightarrow \mathbb {V}^{\prime }\) are independent of y. In turn, the residual has a similar affine dependence structure

$$\begin{aligned} \mathcal{R}(u,y)= \mathcal{R}_0 (u) + \sum _{j=1}^{d_y} y_j \mathcal{R}_j u,\quad \mathcal{R}_0(u):=f-\mathcal{B}_0 u,\quad \mathcal{R}_j=-\mathcal{B}_j. \end{aligned}$$
(2.9)

For the example (2.6) such a structure is encountered for affine parametric representations of the diffusion coefficients

$$\begin{aligned} a(x,y)= a_0(x) +{ \sum _{j=1}^{d_y}} y_j\theta _j(x), \quad (x,y)\in \Omega \times \mathcal {Y}, \end{aligned}$$
(2.10)

i.e., the field a is expanded in terms of some given spatial basis functions \(\theta _j\). As indicated earlier, the pressure equation in Darcy’s law for porous media flow is an example for (2.6) where the diffusion coefficient a(y) of the form (2.10) may arise from a stochastic model for permeability via a Karhunen-Loève expansion. In this case (upon proper normalization) \(y\in [-1,1]^\mathbb {N}\) has, in principle, infinitely many entries, that is \(d_y=\infty \). However, due to (2.7), the \(\theta _j\) should then have some decay as \(j\rightarrow \infty \) which means that the parameters become less and less important when j increases. Another example is electron impedance tomography involving the same type of elliptic operator where parametric expansions represent possible variations of conductivity often modeled as piecewise constants, i.e., the \(\theta _j\) could be characteristic functions subordinate to a partition of \(\Omega \). In this case data are acquired through sensors that act through trace functionals greatly adding to ill-posedness.

A central role in the subsequent discussion is played by the solution manifold

$$\begin{aligned} \mathcal{M}=u(\mathcal {Y}) := \{u(y)\,: \, y\in \mathcal {Y}\} \end{aligned}$$
(2.11)

which is then the range of the parameter-to-solution map \(u: y\mapsto u(y)\) comprised of all states that can be attained when y traverses \(\mathcal {Y}\). Without further mention, \(\mathcal{M}\) will be assumed to be compact which actually follows under standard assumptions met in all above mentioned examples.

Estimating states in \(\mathcal{M}\) or corresponding parameters from measurements requires the efficient approximation of elements in \(\mathcal{M}\). A common challenge encountered in all such models lies in the inherent high-dimensionality of the states \(u=u(\cdot ,y)\) as functions of \(d_x\) spatial variables \(x\in \Omega \) and \(d_y\gg 1\) parametric variables \(y\in \mathcal {Y}\). In particular, when \(d_y=\infty \) any calculation, of course, has to work with finitely many “activated” parameters whose number, however, has to be coordinated with the spatial resolution of a numerical scheme to retain model-consistency. It is especially this issue that hinders standard approaches based on first discretizing the parametric model because rigorously balancing spatial and parametric uncertainties becomes then difficult.

What renders such problem scenarios nevertheless numerically tractable is a further property that will be implicitly assumed in what follows, namely that the Kolmogorov n-widths of the solution manifold

$$\begin{aligned} d_n(\mathcal{M})_\mathbb {U}:= \inf _{\mathrm{dim}\, \mathbb {U}_n =n}\sup _{u\in \mathcal{M}}\inf _{v\in \mathbb {U}_n}\Vert u-v\Vert _\mathbb {U}\end{aligned}$$
(2.12)

exhibits at least some algebraic decay

$$\begin{aligned} d_n(\mathcal{M})_\mathbb {U}\lesssim n^{-s} \end{aligned}$$
(2.13)

for some \(s>0\), see [13] for a comprehensive account.

For instance, this is known to be the case for elliptic models (2.6) with (2.7), as a consequence of the results of sparse polynomial approximation of the parameter to solution map \(y\mapsto u(y)\) established e.g. in [15]. More generally, (2.13) can be established under a general holomorphy property of the parameter to solution map, as a consequence of a similar algebraic decay assumed on the n-widths of the parameter set, see [14]. For a fixed finite number \(d_y<\infty \) of parameters, under certain structural assumptions on the parameter representations (e.g. piecewise constants on checkerboard partitions) one can even establish (sub-) exponential decay rates, see [2] for more details. Assuming s in (2.13) to have a “substantial” size for any range of \(d_y\), is therefore justified.

In summary, the results discussed below are valid and practically feasible for well posed linear models (2.4) with affine parameter dependence (2.9) whose solution manifolds have rapidly decaying n-widths (2.13).

2.2 The Data

Suppose we are given data \(\mathbf{w}= (w_1,\ldots ,w_m)^\top \in \mathbb {R}^m\) representing observations of an unknown state \(u\in \mathbb {U}\) obtained through m linearly independent linear functionals \(\ell _i\in \mathbb {U}'\), i.e.,

$$\begin{aligned} w_i = \ell _i(u),\quad i=1,\ldots ,m. \end{aligned}$$
(2.14)

Since in real applications data acquisition may be costly or harmful we assume that m is fixed. The central task to be discussed in what follows is to recover from this information an estimate for the observed unknown state u, based on the prior assumption that u belongs to \(\mathcal{M}\) or is close to \(\mathcal{M}\). Moreover, to bring out the essence of this estimation task we assume for the moment that the data are noise-free.

Following [4, 20], we first recast the data in a “compliant” metric, by introducing the Riesz representers \(\psi _i\in \mathbb {U}\), defined by

$$ ( \psi _i,v)_\mathbb {U}= \ell _i(v),\quad v\in \mathbb {U}, \quad i=1,\ldots , m, $$

The \(\psi _i\) now span the m-dimensional subspace \(\mathbb {W}\subset \mathbb {U}\) which we refer to as measurement space, and the information carried by the \(\ell _i(u)\) is equivalent to that of the orthogonal projection \(P_\mathbb {W}u\) of u to \(\mathbb {W}\). The decomposition

$$\begin{aligned} u = P_\mathbb {W}u + P_{\mathbb {W}^\perp } u,\quad u\in \mathbb {U}, \end{aligned}$$
(2.15)

thus contains a first term that is “seen” by the sensors and a second (infinite-dimensional) term which cannot be detected. The decomposition (2.15) may be seen as a sensor-induced “coordinate system” thereby opening up a geometric perspective that will prove very useful in what follows. State estimation can then be viewed as learning from samples \(w:=P_\mathbb {W}u\) the unknown “labels” \(P_{\mathbb {W}^\perp } u\in \mathbb {W}^\perp \).

In this article, we are interested in how well we can approximate u from the information that \(u\in \mathcal{M}\) and \(P_\mathbb {W}u=w\) with w given to us. Any such approximation is given by a mapping \(A:w\rightarrow A(w)\in \mathbb {U}\). The overall performance of recovery on all of \(\mathcal{M}\) by the mapping A is typically measured in the worst case setting, that is,

$$\begin{aligned} E_{\mathrm{wc}}(A,\mathcal{M},\mathbb {W})= \sup _{u\in \mathcal{M}} \Vert u-A(P_\mathbb {W}u)\Vert _\mathbb {U}. \end{aligned}$$
(2.16)

The optimal recovery error on \(\mathcal{M}\) is then defined as

$$\begin{aligned} E_{\mathrm{wc}}(\mathcal{M},\mathbb {W}) := \inf _AE_{\mathrm{wc}}(A,\mathcal{M},\mathbb {W}), \end{aligned}$$
(2.17)

where the infimum is over all possible recovery maps. Let us observe that the construction of recovery maps can be restricted to be of the form

$$\begin{aligned} A :w\rightarrow A(w),\quad A(w)= w+ B(w),\quad \text {with}\,\, B:\mathbb {W}\rightarrow \mathbb {W}^\perp . \end{aligned}$$
(2.18)

Indeed, given any recovery mapping A, we can write \(A(w)= P_\mathbb {W}A(w)+P_{\mathbb {W}^\perp } A(w)\) and the performance of the recovery can only be improved if we replace the first term by w. In other words, A(w) should belong to the affine space

$$\begin{aligned} \mathbb {U}_w:=w+\mathbb {W}^\perp , \end{aligned}$$
(2.19)

that contains u. The mappings B are commonly referred to as liftings into \(\mathbb {W}^\perp \).

2.3 Optimality Criteria and Numerical Recovery

Finding a best recovery map A attaining (2.17) is known as optimal recovery. The best mapping has a well-known simple theoretical description, see e.g. [21], that we now describe. Note first that a precise recovery of the unknown state u from the given information is generally impossible. Indeed, the best we can say about u is that it lies in the manifold slice

$$\begin{aligned} \mathcal{M}_w:= \{u\in \mathcal{M}: P_\mathbb {W}u =w\} = \mathcal{M}\cap \mathbb {U}_w, \end{aligned}$$
(2.20)

which is comprised of all elements in \(\mathcal{M}\) sharing the same measurement \(w\in \mathbb {W}\). The Chebyshev ball \(B(\mathcal{M}_w)\) is the smallest ball in \(\mathbb {U}\) that contains \(\mathcal{M}_w\). The best recovery algorithm is then given by the mapping

$$\begin{aligned} A^*(w): =\mathrm{cen}(\mathcal{M}_w), \end{aligned}$$
(2.21)

that assigns to each \(w\in \mathcal{M}\) the center \(\mathrm{cen}(\mathcal{M}_w)\) of \(B(\mathcal{M}_w)\), called the Chebyshev center of \(\mathcal{M}_w\). Then, the radius \(\mathrm{rad}(\mathcal{M}_w)\) of \(B(\mathcal{M}_w)\) is the best worst case error over the class \(\mathcal{M}_w\). The best worst case error over \(\mathcal{M}\), which is achieved by \(A^*\), is thus given by

$$\begin{aligned} E_{\mathrm{wc}}(\mathcal{M},W) = E_\mathrm{wc}(A^*,\mathcal{M},\mathbb {W}) = \max _{w\in \mathbb {W}} \mathrm{rad}(\mathcal{M}_w). \end{aligned}$$
(2.22)

While the above mapping \(A^*\) gives a nice theoretical description of the optimal recovery algorithm, it is typically not numerically implementable since the Chebyshev center \(\mathrm{cen}(\mathcal{M}_w)\) is not easily found. Moreover, such an optimal algorithm is highly nonlinear and possibly discontinuous. The purpose of this section is to formulate a more modest goal for the performance of a recovery algorithm with the hope that this more modest goal can be met with a numerically realizable algorithm. The remaining sections of the paper introduce numerically implementable recovery mappings, analyze their performance, and evaluate the numerical cost in constructing these mappings.

The search for a numerically realizable algorithm must out of necessity lessen the performance criteria. A first possibility is to weaken the performance criteria to near best algorithms. This means that we search for an algorithm A such that

$$\begin{aligned} E_{\mathrm{wc}}(A,\mathcal{M},\mathbb {W}) \le C_0E_{\mathrm{wc}}(\mathcal{M},\mathbb {W}), \end{aligned}$$
(2.23)

with a reasonable value of \(C_0>1\). For example, any mapping A which takes w into an element in the Chebyshev ball of \(\mathcal{M}_w\) is near best with constant \(C_0=2\). However, finding near best mappings A also seems to be numerically out of reach.

In order to formulate a more attainable performance criterion, we return to our earlier observations about uncertainty in both the model class \(\mathcal{M}\) and in the measurements w. The former is a modeling error while the latter is an inherent measurement error. Both of these uncertainties can be quantified by introducing for each \(\varepsilon >0\), the \(\varepsilon \)-neighborhood of the manifold

$$\begin{aligned} \mathcal{M}^\varepsilon := \{v\in \mathbb {U}: \text {dist}\,(v,\mathcal{M})_\mathbb {U}\le \varepsilon \}. \end{aligned}$$
(2.24)

The uncertainty in the model can be thought of as saying the sought after u is in \(\mathcal{M}^\varepsilon \) rather than \(u\in \mathcal{M}\). Also, we may formulate uncertainty (noise) in the measurements as saying that they are not measurements of a \(u\in \mathcal{M}\) but rather some \(u\in \mathcal{M}^\varepsilon \). Here the value of \(\varepsilon \) quantifies these uncertainties.

Our new goal is to numerically construct a recovery map A that is near-optimal on \(\mathcal{M}^\varepsilon \), for some given \(\varepsilon >0\). Let us note that \(\mathcal{M}^\varepsilon \) is not compact. An algorithm A is worst-case near optimal for \(\mathcal{M}^\varepsilon \) if and only if its performance is bounded by a constant multiple of the diameter

$$\begin{aligned} \delta _\varepsilon (\mathcal{M},\mathbb {W}) := \max \,\{\Vert u-v\Vert _\mathbb {U}: u,v\in \mathcal{M}^\varepsilon ,\, P_\mathbb {W}(u-v)=0\}. \end{aligned}$$
(2.25)

Notice that \(\varepsilon =0\) gives the performance criterion for near optimal recovery over \(\mathcal{M}\). One can show that the function \(\varepsilon \mapsto \delta _\varepsilon (\mathcal{M},\mathbb {W})\) is monotone non-decreasing in \(\varepsilon \), continuous from the right, and \(\lim _{\varepsilon \rightarrow 0^+} \delta _\varepsilon (\mathcal{M},\mathbb {W})= \delta _0(\mathcal{M},\mathbb {W})\). The speed at which \(\delta _\varepsilon (\mathcal{M},\mathbb {W})\) approaches \(\delta _0(\mathcal{M},\mathbb {W})\) reflects the “condition” of the estimation problem depending on \(\mathcal{M}\) and \(\mathbb {W}\). While the practical realization of worst-case near-optimality for \(\mathcal{M}^\varepsilon \) is already a challenge, quantifying corresponding computational cost would require assumptions on the condition of the problem.

One central theme, guiding subsequent discussions, is therefore to find recovery maps \(A_\varepsilon \) that realize an error bound of the form

$$\begin{aligned} E_{wc}(A_\varepsilon ,\mathcal{M},\mathbb {W}) \le C_0 \delta _\varepsilon (\mathcal{M},\mathbb {W}). \end{aligned}$$
(2.26)

Any a priori information on measurement accuracy and model bias might be used to choose a viable tolerance \(\varepsilon \).

High parametric dimensionality poses particular challenges to estimation tasks when the targeted error bounds are in the above worst case sense. These challenges can be somewhat mitigated when adopting a Bayesian point of view [24]. The prior information on u is then described by a probability distribution p on \(\mathbb {U}\), which is supported on \(\mathcal{M}\). Such a measure is typically induced by a probability distribution on \(\mathcal {Y}\) that may or may not be known. In the latter case, sampling \(\mathcal{M}\), i.e., computing snapshots \(u(y^i)\), \(i=1,\ldots ,N\), for i.i.d. samples \(y^i\in \mathcal {Y}\), provides labeled data \((w_i,w_i^\perp )= (P_\mathbb {W}u(y^i),P_{\mathbb {W}^\perp } u(y^i))\) according to the sensor-based decomposition (2.15). This puts us into the setting of regression in machine learning asking for an estimator that predicts for any new measurement \(w\in \mathbb {W}\) its lifting \(w^\perp =B(w)\). It is then natural to measure the performance of an algorithm in an averaged sense. The best estimator A that minimizes the mean-square risk

$$\begin{aligned} E_{\mathrm{ms}}(A,p,\mathbb {W})=\mathbb {E}(\Vert u-A(P_\mathbb {W}u)\Vert ^2)=\intop \limits _\mathbb {U}\Vert u-A(P_\mathbb {W}u)\Vert ^2 dp(u) \end{aligned}$$
(2.27)

is given by the conditional expectation

$$\begin{aligned} A(w) = \mathbb {E}(u| P_\mathbb {W}u=w). \end{aligned}$$
(2.28)

Since always \(E_{\mathrm{ms}}(A,p,\mathbb {W})\le E_\mathrm{wc}(A,\mathcal{M},\mathbb {W})\), the optimality benchmarks are somewhat weaker. In the rest of this paper, we adhere to the worst case error in the deterministic setting that only assumes membership of u to \(\mathcal{M}\) or \(\mathcal{M}^\varepsilon \).

The following section is concerned with an important building block on a pathway towards achieving (2.26) at quantifiable computational cost. This building block, referred to as one-space method is a linear (affine) scheme which is, in principle, simple and easy to numerically implement. It depends on suitably chosen subspaces. We highlight the regularizing property of these subspaces as well as ways to optimize them. This will reveal certain intrinsic obstructions caused by parameter dimensionality. The one-space method by itself will generally not achieve (2.26) but, as indicated earlier, can be used as a building block in a nonlinear recovery scheme that may indeed meet the goal (2.26).

3 The One-Space Method

3.1 Subspace Regularization

The one space method can be viewed as a simple regularizer for state estimation. The resulting recovery map is induced by an n-dimensional subspace \(\mathbb {U}_n\) of \(\mathbb {U}\) for \(n\le m\). Assume that, for each \(n\ge 0\), we are given a subspace \(\mathbb {U}_n\subset \mathbb {U}\) of dimension n whose distance from \(\mathcal{M}\) can be assessed

$$\begin{aligned} \text {dist}(\mathcal{M},\mathbb {U}_n)_\mathbb {U}:= \max _{u\in \mathcal{M}} \text {dist}(u,\mathbb {U}_n)_\mathbb {U}\le \varepsilon _n. \end{aligned}$$
(3.1)

Then the cylinder

$$\begin{aligned} \mathcal{{K}}(\mathbb {U}_n,\varepsilon _n) := \{ u\in \mathbb {U}: \text {dist}(u,\mathbb {U}_n)_\mathbb {U}\le \varepsilon _n\} \end{aligned}$$
(3.2)

contains \(\mathcal{M}\) and likewise the cylinder \(\mathcal{{K}}(\mathbb {U}_n,\varepsilon _n+\varepsilon )\) contains \(\mathcal{M}^\varepsilon \). Our prior assumption that the observed state belongs to \(\mathcal{M}\) or \(\mathcal{M}^\varepsilon \) can then be relaxed by assuming membership to these larger but simpler sets.

Remarkably, one can now realize an optimal recovery map quite easily that meets the relaxed benchmark \(E_\mathrm{wc}(\mathcal{{K}}(\mathbb {U}_n,\varepsilon _n),\mathbb {W})\): in [4] it was shown that the Chebyshev center of the slice

$$\begin{aligned} \mathcal{{K}}_w(U_n,\varepsilon _n):=\mathcal{{K}}(\mathbb {U}_n,\varepsilon _n)\cap \mathbb {U}_w, \end{aligned}$$
(3.3)

is exactly given by the state in \(\mathbb {U}_w\) that is closest to \(\mathbb {U}_n\), that is

$$\begin{aligned} u^*= u^*(w) := \mathop {\text {argmin}}_{u\in \mathbb {U}_w}\Vert u - P_{\mathbb {U}_n}u\Vert _\mathbb {U}. \end{aligned}$$
(3.4)

This minimizer exists and can be shown to be unique as long as \(\mathbb {U}_n\cap \mathbb {W}^\perp =\{0\}\). The corresponding optimal recovery map

$$\begin{aligned} A_{\mathbb {U}_n} : w \mapsto u^*(w) \end{aligned}$$
(3.5)

was first introduced in [20] as the Parametrized Background Data Weak (PBDW) algorithm, and is referred to as the one-space method in [4]. Due to its above minimizing property, it is readily checked that this map is linear and can be determined with the aid of the singular value decomposition of the cross-Gramian between any pair of orthonormal basis for \(\mathbb {U}_n\) and \(\mathbb {W}\).

The worst case error \(E_\mathrm{wc}(\mathcal{{K}}(\mathbb {U}_n,\varepsilon _n),\mathbb {W})\) can be described more precisely by introducing

$$\begin{aligned} \mu (\mathbb {U}_n,\mathbb {W}) := \sup _{v\in \mathbb {U}_n}\frac{\Vert v\Vert _\mathbb {U}}{\Vert P_\mathbb {W}v\Vert _\mathbb {U}} \end{aligned}$$
(3.6)

which is finite if and only if \(\mathbb {U}_n\cap \mathbb {W}^\perp =\{0\}\). This quantity, also introduced in a related but slightly different context in [1], is therefore related to the angle between the spaces \(\mathbb {U}_n\) and \(\mathbb {W}\). It becomes large when \(\mathbb {U}_n\) contains elements that are nearly perpendicular to \(\mathbb {W}\). It is actually computable: one has \(\mu (\mathbb {U}_n,\mathbb {W})= \beta (\mathbb {U}_n,\mathbb {W})^{-1}\) where

$$\begin{aligned} \beta (\mathbb {U}_n,\mathbb {W}) := \inf _{v\in \mathbb {U}_n}\sup _{w\in \mathbb {W}} \frac{\langle v,w\rangle _\mathbb {U}}{\Vert v\Vert _\mathbb {U}\Vert w\Vert _\mathbb {U}}, \end{aligned}$$
(3.7)

and \(\beta (\mathbb {U}_n,\mathbb {W})\) is the smallest singular value of the cross-Gramian between any pair of orthonormal bases for \(\mathbb {W}\) and \(\mathbb {U}_n\). It has been shown in [4, 20] that the worst case error bound over \(\mathcal{{K}}(\mathbb {U}_n,\varepsilon _n)\) is given by

$$\begin{aligned} E_\mathrm{wc}(A_{\mathbb {U}_n},\mathcal{{K}}(\mathbb {U}_n,\varepsilon _n),\mathbb {W}) =E_\mathrm{wc}(\mathcal{{K}}(\mathbb {U}_n,\varepsilon _n),\mathbb {W})= \mu (\mathbb {U}_n,\mathbb {W})\varepsilon _n. \end{aligned}$$
(3.8)

The quantity \(\mu (\mathbb {U}_n,\mathbb {W})\) also coincides with the norm of the linear recovery map \(A_{\mathbb {U}_n}\). Relaxing the prior \(u\in \mathcal{M}\) by exploiting information on \(\mathcal{M}\) solely through approximability of \(\mathcal{M}\) by \(\mathbb {U}_n\), thus implicitly regularizes the estimation task: whenever \(\mu (\mathbb {U}_n,\mathbb {W})\) is finite, the optimal recovery map \(A_{\mathbb {U}_n}\) is bounded and hence Lipschitz.

One important observation is that the map \(A_{\mathbb {U}_n}\) is actually independent of \(\varepsilon _n\). In particular, it achieves optimality for the smallest possible containment cylinder

$$\begin{aligned} \mathcal{{K}}(\mathbb {U}_n): = \mathcal{{K}}(\mathbb {U}_n,\text {dist}(\mathcal{M},\mathbb {U}_n)_\mathbb {U}), \end{aligned}$$
(3.9)

and therefore, since \(E_\mathrm{wc}(A_{\mathbb {U}_n},\mathcal{M},\mathbb {W})\le E_\mathrm{wc}(A_{\mathbb {U}_n},\mathcal{{K}}(\mathbb {U}_n),\mathbb {W}) =E_\mathrm{wc}(\mathcal{{K}}(\mathbb {U}_n),\mathbb {W})\),

$$\begin{aligned} E_\mathrm{wc}(A_{\mathbb {U}_n},\mathcal{M},\mathbb {W})\le \mu (\mathbb {U}_n,\mathbb {W})\text {dist}\,(\mathcal{M},\mathbb {U}_n)_\mathbb {U}. \end{aligned}$$
(3.10)

Likewise, the containment \(\mathcal{M}^\varepsilon \subset \mathcal{{K}}(\mathbb {U}_n,\text {dist}\,(\mathcal{M},\mathbb {U}_n)_\mathbb {U}+\varepsilon )\) implies that

$$\begin{aligned} E_\mathrm{wc}(A_{\mathbb {U}_n},\mathcal{M}^\varepsilon ,\mathbb {W}) \le \mu (\mathbb {U}_n,\mathbb {W})(\text {dist}\,(\mathcal{M},\mathbb {U}_n)_\mathbb {U}+\varepsilon ). \end{aligned}$$
(3.11)

On the other hand, the recovery map \(A_{\mathbb {U}_n}\) may be far from optimal over the sets \(\mathcal{M}\) or \(\mathcal{M}^\varepsilon \). This is due to the fact that the cylinders \(\mathcal{{K}}(\mathbb {U}_n,\varepsilon _n)\) and \(\mathcal{{K}}(\mathbb {U}_n,\varepsilon _n+\varepsilon )\) may be much larger than \(\mathcal{M}\) or \(\mathcal{M}^\varepsilon \). In particular, it is quite possible that for a particular observation w, one has \(\mathrm{rad} (\mathcal{M}_w)\ll \mathrm{rad} (\mathcal{{K}}_w(\mathbb {U}_n,\varepsilon _n))\). Therefore, we cannot generally expect that the one space method achieves our goal (2.26). In particular, the condition \(n\le m\), which is necessary to avoid that \(\mu (\mathbb {U}_n,\mathbb {W})=\infty \), limits the dimension of an approximating subspace \(\mathbb {U}_n\) and therefore \(\varepsilon _n\) itself is inherently bounded from below. The “dimension budget” has therefore to be used wisely in order to obtain good performance bounds. This typically rules out “generic approximation spaces” such as finite element spaces, and raises the question which subspace \(\mathbb {U}_n\) yields the best estimator when applying the above method.

3.2 Optimal Affine Recovery

The results of the previous section bring forward the question as to what is the best choice of the space \(\mathbb {U}_n\) for the given \(\mathcal{M}\). On the one hand, proximity to \(\mathcal{M}\) is desirable since \(\text {dist}\,(\mathcal{M},\mathbb {U}_n)_\mathbb {U}\) enters the error bound. However, favoring proximity, may increase \(\mu (\mathbb {U}_n,\mathbb {W})\). Before addressing this question systematically, it is important to note that the above results carry over verbatim when \(\mathbb {U}_n\) is replaced by an affine space \(\mathbb {U}_n = \bar{u} + \widetilde{\mathbb {U}}_n\) where \(\widetilde{\mathbb {U}}_n\subset \mathbb {U}\) is a linear space. This means the reduced model \(\mathcal{{K}}(\mathbb {U}_n,\varepsilon _n)\) is of the form

$$ \mathcal{{K}}(\mathbb {U}_n,\varepsilon _n):= \bar{u} + \mathcal{{K}}(\widetilde{\mathbb {U}}_n,\varepsilon _n). $$

The best worst-case recovery bound is now given by

$$\begin{aligned} E_\mathrm{wc}( \mathcal{{K}}(\mathbb {U}_n,\varepsilon _n),\mathbb {W})= \mu (\widetilde{\mathbb {U}}_n,\mathbb {W})\varepsilon _n. \end{aligned}$$
(3.12)

Intuitively, this may help to better control the angle between \(\mathbb {W}\) and \(\mathbb {U}_n\) by anchoring the affine space at a suitable location (typically near or on \(\mathcal{M}\)). More importantly, it helps in localizing models via parameter domain decompositions that will be discussed later.

The one-space algorithm discussed in the previous section confines the “dimensionality” budget of the approximation spaces \(\mathbb {U}_n\) to \(n\le m\). In view of (3.10), to obtain an overall good estimation accuracy, this space can clearly not be chosen arbitrarily but should be well adapted both to the solution manifold \(\mathcal{M}\) and to measurement space W, that is, to the given observation functionals giving rise to the data.

A simple way of adapting a recovery space to \(\mathbb {W}\) is as follows: suppose for a moment that we were able to construct for \(n=1,\dots ,m\), a hierarchy of spaces \( \mathbb {U}^\mathrm{nb}_1\subset \mathbb {U}^\mathrm{nb}_2\subset \cdots \subset \mathbb {U}^\mathrm{nb}_m, \) that approximate \(\mathcal{M}\) in a near-best way, namely

$$\begin{aligned} \text {dist}\,(\mathcal{M},\mathbb {U}_n^\mathrm{nb})_\mathbb {U}\le C d_n(\mathcal{M})_\mathbb {U}. \end{aligned}$$
(3.13)

We may compute along the way the quantities \(\mu (\mathbb {U}^\mathrm{nb}_j,\mathbb {W})\), then choose

$$\begin{aligned} n^* = \mathop {\text {argmin}}_{n\le m} \mu (\mathbb {U}^\mathrm{nb}_n,\mathbb {W})\text {dist}\,(\mathcal{M},\mathbb {U}^\mathrm{nb}_n)_\mathbb {U}, \end{aligned}$$
(3.14)

and take the map \(A_{\mathbb {U}^\mathrm{nb}_{n^*}}\). We sometimes refer to this choice as “poor man’s algorithm”. It is not clear though whether \(\mathbb {U}^\mathrm{nb}_{n^*}\) is indeed a near-best choice for state recovery by the one-space method. In other words, one may question whether

$$\begin{aligned} E_{\mathrm{wc}} (A_{\mathbb {U}^\mathrm{nb}_{n^*}},\mathcal{M},\mathbb {W}) \le C \inf _{\mathrm{dim}\widetilde{\mathbb {U}}\le m}E_{\mathrm{wc}}(A_{\widetilde{\mathbb {U}}},\mathcal{M},\mathbb {W}), \end{aligned}$$
(3.15)

holds with a uniform constant \(C<\infty \). In fact, numerical tests strongly suggest otherwise, which motivated in [12] the following alternative to the poor man’s algorithm.

Recall that a given linear space \(\mathbb {U}_n\) determines the linear recovery map \(A_{\mathbb {U}_n}\). Likewise a given affine space \(\mathbb {U}_n\) determines an affine recovery map \(A_{\mathbb {U}_n}\). Conversely, it can be checked that an affine recovery map A determines an affine space \(\mathbb {U}_n\) that allows one to interpret the recovery scheme as a one-space method in the sense that \(A=A_{\mathbb {U}_n}\). Denoting by \(\mathcal{A}\) the class of all affine mappings of the form

$$\begin{aligned} A (w) = w+ z +Bw, \end{aligned}$$
(3.16)

where \(z\in \mathbb {W}^\perp \) and \(B\in \mathcal{L}(\mathbb {W},\mathbb {W}^\perp )\) is linear, we might thus as well directly look for a mapping that minimizes

$$\begin{aligned} E_{\mathrm{wc}}(A,\mathcal{M},\mathbb {W}):= \sup _{u\in \mathcal{M}}\Vert u- A(P_\mathbb {W}u)\Vert _\mathbb {U}= \sup _{u\in \mathcal{M}} \Vert P_{\mathbb {W}^\perp } u - { z}- { B} P_\mathbb {W}u\Vert _\mathbb {U}=:{\mathcal{E}(z,B)} \end{aligned}$$
(3.17)

over \(\mathcal{A}\), i.e., over all \({ (z,B)}\in \mathbb {W}^\perp \times \mathcal{L}(\mathbb {W},\mathbb {W}^\perp )\). It can be shown that indeed a minimizing pair \((z^*,B^*)\) exists, i.e.,

$$ \mathcal{E}(z^*,B^*) = \min _{A\in \mathcal{A}} E_\mathrm{wc}(A,\mathcal{M},\mathbb {W}) =: E_{\mathrm{wc},\mathcal{A}}(\mathcal{M},\mathbb {W}), $$

see [12]. However, the minimization of \(E_{\mathrm{wc}}(A,\mathcal{M},\mathbb {W})\) over \((z,B)\in \mathbb {W}^\perp {\times } \mathcal{L}(\mathbb {W},\mathbb {W}^\perp )\) is far from practically feasible. In fact, each evaluation of \(E_{\mathrm{wc}}(A,\mathcal{M},\mathbb {W})\) requires exploring \(\mathcal{M}\) and B can have a range in the infinite dimensional space \(\mathbb {W}^\perp \). In order to arrive at a computationally tractable problem, one needs to

  1. (i)

    Replace \(\mathcal{M}\) by a finite set \(\widetilde{\mathcal{M}}\subset \mathcal{M}\), that should be sufficiently dense. Denseness can be quantified by requiring that \(\widetilde{\mathcal{M}}=\widetilde{\mathcal{M}}^\delta \) is a \(\delta \)-net for \(\mathcal{M}\) for some \(\delta >0\), i.e., for any \(u\in \mathcal{M}\), there exists \(\tilde{u}\in \widetilde{\mathcal{M}}^\delta \) such that \(\Vert u-\tilde{u}\Vert _\mathbb {U}\le \delta \).

  2. (ii)

    Choose a finite dimensional space \(\mathbb {U}_L\subset \mathbb {U}\) that approximates \(\mathcal{M}\) to a desired precision \(\text {dist}\,(\mathcal{M},\mathbb {U}_L)_\mathbb {U}\le \eta \), and replace \(\mathbb {W}^\perp \) by the finite dimensional complement

    $$\begin{aligned} \widetilde{\mathbb {W}}^\perp := \mathbb {U}_L \ominus \mathbb {W}\end{aligned}$$
    (3.18)

    of \(\mathbb {W}\) in \(\mathbb {U}_L\).

The resulting optimization problem

$$\begin{aligned} (\tilde{z},\widetilde{B}) = \mathop {\text {argmin}}_{(z,B)\in \widetilde{\mathbb {W}}^\perp \times \mathcal{L}(\mathbb {W},\widetilde{\mathbb {W}}^\perp )} \sup _{u\in \widetilde{\mathcal{M}}^\delta } \Vert P_{\mathbb {W}^\perp } u - { z}- { B} P_\mathbb {W}u\Vert _\mathbb {U}. \end{aligned}$$
(3.19)

can be solved by primal-dual splitting methods providing a O(1/k) convergence rate, [12].

Due to the perturbations (i) and (ii) of the ideal minimization problem, the resulting \((\tilde{z},\widetilde{B})\) is no longer optimal. However, one can show that

$$\begin{aligned} E_\mathrm{wc}(\widetilde{A},\mathcal{M},\mathbb {W}) \le E_{\mathrm{wc},\mathcal{A}}(\mathcal{M},\mathbb {W}) + \eta +C\delta , \end{aligned}$$
(3.20)

where the constant C is the operator norm of B minimizing (3.17). On the other hand, since the range of any affine mapping A is an affine space of dimension at most m, therefore contained in a linear space of dimension at most \(m+1\), one always has \(E_{\mathrm{wc},\mathcal{A}}(\mathcal{M},\mathbb {W})\ge d_{m+1}(\mathcal{M})_\mathbb {U}\). Therefore \((\tilde{z},\widetilde{B})\) satisfies a near-optimal bound

(3.21)

whenever \(\eta \) and \(\delta \) are picked such that

(3.22)

The numerical tests in [12] for a model problem of the type (2.6) with piecewise constant checkerboard diffusion coefficients and \(d_y\) up to \(d_y=64\) show that this recovery map exhibits significantly better accuracy than the method based on (3.14). It even yields smaller error bounds than the affine mean square estimator (2.27). The following section discusses the numerical cost entailed by conditions like (3.22).

3.3 Rate-Optimal Reduced Bases

To keep the dimension L of the space \(\mathbb {U}_L\) in (3.18) small, a near-best subspace \(\mathbb {U}_L^{\mathrm{nb}}\) in the sense of (3.13) would be highly desirable. Likewise the poor man’s scheme (3.14) would benefit from such subspaces. Unfortunately, such near-best subspaces are not practically accessible. The reduced basis method aims to construct subspaces which come close to near-optimality in a sense that we further explain next. The main idea is to generate theses subspaces by a sequence of elements picked from the manifold \(\mathcal{M}\) itself, by means of a weak-greedy algorithm introduced and studied in [8]. In an idealized form, this algorithm proceeds as follows: given a current space \(\mathbb {U}^\mathrm{wg}_n=\mathrm{span}\{u_1,\dots ,u_n\}\), one takes \(u_{n+1}= u(y_{n+1})\) such that, for some fixed \(\gamma \in ]0,1]\), \( \Vert u_{n+1}- P_{\mathbb {U}_n}u_{n+1}\Vert _\mathbb {U}\ge \gamma \max _{u\in \mathcal{M}}\Vert u-P_{\mathbb {U}_n}u\Vert _\mathbb {U}, \) or equivalently

$$\begin{aligned} \Vert u(y_{n+1})- P_{\mathbb {U}_n}u(y_{n+1})\Vert _\mathbb {U}\ge \gamma \max _{y\in \mathcal {Y}}\Vert u(y)-P_{\mathbb {U}_n}u(y)\Vert _\mathbb {U}. \end{aligned}$$
(3.23)

Then, one defines \(\mathbb {U}^\mathrm{wg}_{n+1}=\mathrm{span}\{u_1,\dots ,u_{n+1}\}\). While unfortunately, the weak greedy algorithm does in general not produce spaces satisfying (3.13), it does come close. Namely, it has been shown in [3, 19] that the spaces \(\mathbb {U}^\mathrm{wg}_n\) are rate-optimal in the following sense:

  1. (i)

    For any \(s>0\) one has

    $$\begin{aligned} d_n(\mathcal{M})_\mathbb {U}\le C (n+1)^{-s}, \; n\ge 0 \implies \text {dist}\,(\mathcal{M},\mathbb {U}^\mathrm{wg}_n)_\mathbb {U}\le \widetilde{C} (n+1)^{-s}, \; n\ge 0, \end{aligned}$$
    (3.24)

    where \(\widetilde{C}\) depends on \(C,s,\gamma \).

  2. (ii)

    For any \(\beta >0\), one has

    $$\begin{aligned} d_n(\mathcal{M})_\mathbb {U}\le C e^{-c n^\beta }, \; n\ge 0 \implies \text {dist}\,(\mathcal{M},\mathbb {U}^\mathrm{wg}_n)_\mathbb {U}\le \widetilde{C} e^{-\tilde{c} n^\beta }, \; n\ge 0, \end{aligned}$$
    (3.25)

    where the constants \(\tilde{c}, \widetilde{C}\) depend on \(c,C,\beta ,\gamma \).

In the form described above, the weak-greedy concept seems infeasible since it would, in principle, require computing the solution u(y) for all values of \(y\in \mathcal {Y}\) exactly, exploring the whole exact solution manifold. However, its practical applicability is facilitated when there exists a tight surrogate \(R(y,\mathbb {U}_n)\), satisfying

$$\begin{aligned} c_R R(y,\mathbb {U}_n) \le \Vert u(y)-P_{\mathbb {U}_n}u(y)\Vert _\mathbb {U}=\text {dist}\,(u(y),\mathbb {U}_n)\le C_R R(y,\mathbb {U}_n),\quad y\in \mathcal {Y}, \end{aligned}$$
(3.26)

for uniform constants \(0<c_R\le C_R<\infty \), which can be evaluated at affordable cost. Then, maximization of \(R(y,\mathbb {U}_n)\) over \(\mathcal {Y}\) amounts to the weak-greedy step (3.23) with \(\gamma :=\frac{c_R}{C_R}\). According to [18], the validity of the following two conditions indeed allows one to derive computable surrogates that satisfy (3.26):

  1. (i)

    The underlying parametric family of PDEs (2.1) permits a uniformly stable variational formulation (2.4), and one has affine parameter dependence (2.9);

  2. (ii)

    The discrete projection \(\Pi _{\mathbb {U}_n}\) (of Galerkin or Petrov-Galerkin type) has the best approximation property, i.e., resulting errors are uniformly comparable to the best approximation error.

Conditions (i) and (ii) ensure, in view of (2.5), that \(\Vert u(y)- P_{\mathbb {U}_n}u(y)\Vert _\mathbb {U}\sim \Vert \mathcal{R}(y,\) \(\Pi _{\mathbb {U}_n}u(y))\Vert _{\mathbb {V}'}\) holds uniformly in \(y\in \mathcal {Y}\). Thus,

$$\begin{aligned} R(y,\mathbb {U}_n) := \Vert \mathcal{R}(y,\Pi _{\mathbb {U}_n}u(y))\Vert _{\mathbb {V}'} = \sup _{v\in \mathbb {V}} \frac{\mathcal{R}(y,\Pi _{\mathbb {U}_n}u(y)) (v)}{\Vert v\Vert _\mathbb {V}} \end{aligned}$$
(3.27)

satisfies (3.26) and is therefore a tight surrogate for \(\text {dist}\,(\mathcal{M},\mathbb {U}_n)_\mathbb {U}\). In the elliptic case (2.6) under assumption (2.7), (i) and (ii) hold and the above comments reflect standard practice. For the wider scope of stable but unsymmetric variational formulations [6, 16, 23] the inf-sup conditions (2.4) imply (i), but the Galerkin projection in (ii) needs to be replaced by a stable Petrov-Galerkin projection with respect to suitable test spaces \(\mathbb {V}_n\) accompanying the reduced trial spaces \(\mathbb {U}_n\). It has been shown in [18] how to generate such test spaces with the aid of a double-greedy strategy, see also [16].

The main pay-off of using the surrogate \(R(y,\mathbb {U}_n)\) is that one no longer needs to compute u(y) but only the low-dimensional projection \(\Pi _{\mathbb {U}_n}u(y)\) by solving for each y an \(n\times n\) system, which itself can be rapidly assembled thanks to the affine parameter dependence [22]. However, one still faces the problem of its exact maximization over \(y\in \mathcal {Y}\). A standard approach is to maximize instead over a discrete training set \(\widetilde{\mathcal {Y}}_{n}\subset Y\), which in turn induces a discretization of the solution manifold

$$\begin{aligned} \widetilde{\mathcal{M}}_n=\{u(y)\, :\, y\in \widetilde{\mathcal {Y}}_n\}. \end{aligned}$$
(3.28)

The resulting weak-greedy algorithm can be shown to remain rate optimal in the sense of (3.24) and (3.25) if the discretization is fine enough so that \(\widetilde{\mathcal{M}}_n\) constitutes an \(\varepsilon _n\)-approximation net of \(\mathcal{M}\) where \(\varepsilon _n\) does not exceed \(c\text {dist}\,(\mathcal{M},\mathbb {U}^\mathrm{wg}_n)_\mathbb {U}\) for a suitable constant \(0<c<1\). In the current regime of large or even infinite parameter dimensionality, this becomes prohibitive because \(\#\widetilde{\mathcal {Y}}_n\) would then typically scale like \(O\big (\varepsilon _n^{-cd_y}\big )\), [10].

As a remedy it has been proposed in [10] to use training sets \(\widetilde{\mathcal {Y}}_n\) that are generated by randomly sampling \(\mathcal {Y}\), and ask that the objective of rate optimality is met with high probability. This turns out to be achievable with training sets of much less prohibitive size. In an informal and simplified manner the main result can be stated as follows.

Theorem 1

Given any target accuracy \(\varepsilon >0\) and some \(0<\eta <1\), then the weak greedy reduced basis algorithm based on choosing at each step \(N = N(\varepsilon ,\eta )\sim |\ln \eta |+|\ln \varepsilon |\) randomly chosen training points in \(\mathcal {Y}\) has the following properties with probability at least \(1-\eta \): it terminates with \(\text {dist}\,(\mathcal{M},\mathbb {U}_{n(\varepsilon )})_\mathbb {U}\le { \varepsilon }\) as soon as the maximum of the surrogate over the current training set falls below \(c\varepsilon ^{1+a}\) for some \(c,a>0\). Moreover, if \(d_n(\mathcal{M})_\mathbb {U}\le Cn^{-s}\), then . The constants cab depend on the constants in (3.26), as well as on the rate r of polynomial approximability of the parameter to solution map \(y\mapsto u(y)\). The larger s and r, the smaller a and b, and the closer the performance becomes to the ideal one.

4 Nonlinear Models

4.1 Piecewise Affine Reduced Models

As already noted, schemes based on linear or affine reduced models of the form \(\mathcal{{K}}(\mathbb {U}_n,\varepsilon )\) can, in general, not be expected to realize the benchmark (2.26), discussed earlier in Sect. 2. The convexity of the containment set \(\mathcal{{K}}(\mathbb {U}_n,\varepsilon )\) may cause the reconstruction error to be significantly larger than \(\delta _\varepsilon (\mathcal{M},\mathbb {W})\). Another way of understanding this limitation is that in order to make \(\varepsilon \) small, one is enforced to raise the dimension n of \(\mathbb {U}_n\), making the quantity \(\mu (\mathbb {U}_n,\mathbb {W})\) larger and eventually infinite if \(n>m\).

To overcome this principal limitation one needs to resort to nonlinear models that better capture the non-convex geometry of \(\mathcal{M}\). One natural approach consists in replacing the single space \(\mathbb {U}_n\) by a family \((\mathbb {U}^k)_{k=1,\dots ,K}\) of affine spaces

$$\begin{aligned} \mathbb {U}^k=\overline{u}_k+\widetilde{\mathbb {U}}^k, \quad \dim (\widetilde{\mathbb {U}}^k)=n_k \le m, \end{aligned}$$
(4.1)

each of which aims to approximate a portion \(\mathcal{M}_k\) of \(\mathcal{M}\) to a prescribed target accuracy simultaneously controlling \(\mu (\mathbb {U}^k,\mathbb {W})\): fixing \(\varepsilon >0\), we assume that we have at hand a partition of \(\mathcal{M}\) into portions

$$\begin{aligned} \mathcal{M}= \bigcup _{k=1}^{K} \mathcal{M}_k \end{aligned}$$
(4.2)

such that

$$\begin{aligned} \text {dist}\,(\mathcal{M}_k,\mathbb {U}^k)_\mathbb {U}\le \varepsilon _k, \quad \mathrm{and}\quad \mu (\widetilde{\mathbb {U}}^k,\mathbb {W})\varepsilon _k \le \varepsilon , \quad k=1,\ldots ,K. \end{aligned}$$
(4.3)

One way of obtaining such a partition is through a greedy splitting procedure of the domain \(\mathcal {Y}= [-1,1]^{d_y}\) which is detailed in [9]. The procedure terminates when for each cell \(\mathcal {Y}_k\) the corresponding portion of the manifold \(\mathcal{M}_k\) can be associated to an affine \(\mathbb {U}_k\) satisfying these properties. We are ensured that this eventually occurs since for a sufficiently fine cell \(\mathcal {Y}_k\) one has \(\mathrm{rad}(\mathcal{M}_k)\le \varepsilon \) which means that we could then use a zero dimensional affine space \(\mathbb {U}_k=\{\bar{u}_k\}\) for which we know that \(\mu (\widetilde{\mathbb {U}}^k,\mathbb {W})=1\). In this piecewise affine model, the containment property is now

$$\begin{aligned} \mathcal{M}\subset \bigcup _{k=1}^K \mathcal{{K}}(\mathbb {U}_k,\varepsilon _k). \end{aligned}$$
(4.4)

and the cardinality K of the partition depends on the prescribed \(\varepsilon \).

For a given measurement \(w\in \mathbb {W}\), we may now compute the state estimates

$$\begin{aligned} u^*_k(w)= A_{\mathbb {U}^k}(w), \quad k=1,\dots ,K, \end{aligned}$$
(4.5)

by the affine variant of the one-space method from (3.4). Since \(u\in \mathcal{M}_{k_0}\) for some value \(k_0\), we are ensured that

$$\begin{aligned} \Vert u-u^*_{k_0}(w)\Vert _\mathbb {U}\le \varepsilon , \end{aligned}$$
(4.6)

for this particular choice. However \(k_0\) is unknown to us and one has to rely on the data w in order to decide which one among the affine models is most appropriate for the recovery. One natural model selection criterion can be derived if for any \(\overline{u}\in \mathbb {U}\) we have at our disposal a computable surrogate \(S(\overline{u})\) that is equivalent to the distance from \(\overline{u}\) to \(\mathcal{M}\), that is

$$\begin{aligned} c S(\bar{u}) \le \text {dist}\,(\bar{u},\mathcal{M})_\mathbb {U}\le C S(\bar{u}), \quad \text {dist}\,(\bar{u},\mathcal{M})_\mathbb {U}=\min _{y\in \mathcal {Y}} \Vert \overline{u}-u(y)\Vert _\mathbb {U}, \end{aligned}$$
(4.7)

for some fixed \(0<c\le C\). We give an instance of such a computable surrogate in Sect. 4.2 below. The selection criterion then consists in picking \(k^*\) minimizing this surrogate between the different available state estimates, that is,

$$\begin{aligned} u^*(w) := u^*_{k^*}(w)=\mathop {\text {argmin}}\,\{S(u_k^*(w)) \,:\, k=1,\dots ,K\}. \end{aligned}$$
(4.8)

The following result, established in [9], shows that this estimator now realizes the benchmark (2.26) up to a multiplication of \(\varepsilon \) by \(\kappa := C/c\), where cC are the constants from (4.7).

Theorem 2

Assume that (4.2) and (4.3) hold. For any \(u\in \mathcal{M}\), if \(w= P_\mathbb {W}u\), one has

$$\begin{aligned} \Vert u- u^*(w)\Vert \le \delta _{\kappa \varepsilon }(\mathcal{M},\mathbb {W}), \end{aligned}$$
(4.9)

where \(\delta _\varepsilon (\mathcal{M},\mathbb {W})\) is given by (2.25).

4.2 Approximate Metric Projection and Parameter Estimation

A practically affordable realization of the surrogate \(S(\overline{u})\), providing a near-metric projection distance to \(\mathcal{M}\), is a key ingredient of the above nonlinear recovery scheme. Since it has further useful implications we add a few comments on that matter.

As already observed in (2.5), whenever (2.1) admits a stable variational formulation with respect to a suitable pair \((\mathbb {U},\mathbb {V})\) of trial and test spaces, the distance of any \(\overline{u}\in \mathbb {U}\) to any \(u(y)\in \mathcal{M}\) is uniformly equivalent to the residual of the PDE in \(\mathbb {V}'\)

$$\begin{aligned} c\Vert \mathcal{R}(\bar{u},y)\Vert _{\mathbb {V}^{\prime }} \le \Vert u(y)-\bar{u}\Vert _\mathbb {U}\le C\Vert \mathcal{R}(\bar{u},y)\Vert _{\mathbb {V}^{\prime }}, \end{aligned}$$
(4.10)

with \(c=C_b^{-1}, C=c_b^{-1}\) from (2.5). Assume in addition that \(\mathcal{R}(u,y)\) depends affinely on \(y\in \mathcal {Y}\), according to (2.9). Then, minimizing \( \Vert \mathcal{R}(\bar{u},y)\Vert _{\mathbb {V}^{\prime }}\) over y is equivalent to solving a constrained least squares problem

$$\begin{aligned} \bar{y} = \mathop {\text {argmin}}_{y\in \mathcal {Y}}\Vert \mathbf{g}- \mathbf{M}y\Vert _2, \end{aligned}$$
(4.11)

where \(\mathbf{M}\) is a matrix of size \(d_y\times d_y\) resulting from Riesz-lifts of the functionals \(\mathcal{R}_j (\bar{u})\).

The solution to this problem therefore satisfies

$$\begin{aligned} \Vert \bar{u} - u(\bar{y})\Vert _\mathbb {U}\le \kappa \inf _{y\in \mathcal {Y}} \Vert \bar{u} - u(y)\Vert _\mathbb {U}=\kappa \text {dist}\,(\bar{u},\mathcal{M})_\mathbb {U}. \end{aligned}$$
(4.12)

where \(\kappa = C/c =C_b/c_b\) is the quotient between the equivalence constants in (4.10). The surrogate

$$\begin{aligned} S(\bar{u}):= \Vert \mathcal{R}(\bar{u},y)\Vert _{\mathbb {V}^{\prime }} \end{aligned}$$
(4.13)

for the metric projection distance of \(\overline{u}\) onto \(\mathcal{M}\) obviously satisfies (4.7). It is indeed computable at affordable cost using (an approximation to) its Riesz-lifted version \(\Vert e(\bar{u},y)\Vert _\mathbb {V}=\Vert \mathcal{R}(\bar{u},y)\Vert _{\mathbb {V}'}\) (in \(\mathbb {V}_h\subset \mathbb {V}\)) assembled from the Riesz-lifts of the components \(\mathcal{R}_j (\bar{u})\), see [9] for details in the affine expansion (2.9).

Since solving the above problem provides an admissible parameter value \(\overline{y}\in \mathcal {Y}\), this also has some immediate bearing on parameter estimation. Suppose we wish to estimate from \(w= P_\mathbb {W}u(y^*)\) the unknown parameter \(y^*\in \mathcal {Y}\). Assume further that A is any given linear or nonlinear recovery map. Computing along the above lines

$$ \bar{y}_w = \mathop {\text {argmin}}_{y\in \mathcal {Y}}\Vert \mathcal{R}(A(w),y)\Vert _{\mathbb {V}^{\prime }} $$

we have

$$\begin{aligned}&\Vert u(y^*)- u(\bar{y}_w)\Vert _\mathbb {U}\le \Vert u(y^*)- A(w)\Vert _\mathbb {U}+ \Vert A(w)- u(\bar{y}_w)\Vert _\mathbb {U}\nonumber \\&\qquad \quad \le E_\mathrm{wc}(A,\mathcal{M},\mathbb {W}) + \kappa \text {dist}\,(A(w),\mathcal{M})_\mathbb {U}\le (1+\kappa ) E_\mathrm{wc}(A,\mathcal{M},\mathbb {W}).\qquad \quad \end{aligned}$$
(4.14)

We consider now the specific elliptic model (2.6) with affine diffusion coefficients a(y) given by (2.10). For this model, it was established in [5] that for strictly positive f and certain regularity assumptions on a(y) as functions of \(x\in \Omega \), parameters may be estimated by states. Specifically, when \(a(y)\in H^1(\Omega )\) uniformly in \(y\in \mathcal {Y}\), one has an inverse stability estimate of the form

$$\begin{aligned} \Vert a(y)- a(\tilde{y})\Vert _{L_2(\Omega )}\le C\Vert u(y)- u(\tilde{y})\Vert ^{1/6}_\mathbb {U}. \end{aligned}$$
(4.15)

Thus, whenever the recovery map A satisfies (4.9) for some prescribed \(\varepsilon >0\), we obtain a parameter estimation bound of the form

$$ \Vert a(y^*)- a(\bar{y}_w)\Vert _{L_2(\Omega )}\le C \delta _{\kappa \varepsilon }(\mathcal{M},\mathbb {W})^{1/6}, $$

Note that when the basis functions \(\theta _j\) are \(L_2\)-orthogonal, \(\Vert a(y^*)- a(\bar{y}_w)\Vert _{L_2(\Omega )}\) is equivalent to a (weighted) \(\ell _2\) norm of \(y^*-\bar{y}_w\).

4.3 Concluding Remarks

The affine or piecewise affine recovery scheme hinges on the ability to approximate a solution manifold effectively by linear or affine spaces, globally or locally. As explained earlier this is true for problems of elliptic or parabolic type that may include convective terms as long as they are dominated by diffusion. This may however no longer be the case when dealing with pure transport equations or models involving strongly dominating convection.

An interesting alternative would then be to adopt a stochastic model according to (2.27) and (2.28) that allows one to view the construction of the recovery map as a regression problem. In particular, when dealing with transport models, a natural candidate for parametrizing a reduced model are deep neural networks. However, properly adapting the architecture, regularization and training principles pose wide open questions addressed in current work in progress.