Abstract
Given input–output pairs of an elliptic partial differential equation (PDE) in three dimensions, we derive the first theoretically rigorous scheme for learning the associated Green’s function G. By exploiting the hierarchical low-rank structure of G, we show that one can construct an approximant to G that converges almost surely and achieves a relative error of \(\mathcal {O}(\varGamma _\epsilon ^{-1/2}\log ^3(1/\epsilon )\epsilon )\) using at most \(\mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\) input–output training pairs with high probability, for any \(0<\epsilon <1\). The quantity \(0<\varGamma _\epsilon \le 1\) characterizes the quality of the training dataset. Along the way, we extend the randomized singular value decomposition algorithm for learning matrices to Hilbert–Schmidt operators and characterize the quality of covariance kernels for PDE learning.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Can one learn a differential operator from pairs of solutions and righthand sides? If so, how many pairs are required? These two questions have received significant research attention [17, 31, 34, 43]. From data, one hopes to eventually learn physical laws of nature or conservation laws that elude scientists in the biological sciences [63], computational fluid dynamics [49], and computational physics [45]. The literature contains many highly successful practical schemes based on deep learning techniques [38, 48]. However, the challenge remains to understand when and why deep learning is effective theoretically. This paper describes the first theoretically justified scheme for discovering scalar-valued elliptic partial differential equations (PDEs) in three variables from input–output data and provides a rigorous learning rate. While our novelties are mainly theoretical, we hope to motivate future practical choices in PDE learning.
We suppose that there is an unknown second-order uniformly elliptic linear PDE operatorFootnote 1\(\mathcal {L}:\mathcal {H}^2(D)\cap \mathcal {H}_0^1(D) \rightarrow L^2(D)\) with a bounded domain \(D\subset \mathbb {R}^3\) with Lipschitz smooth boundary [16], which takes the form
Here, for every \(x\in D\), we have that \(A(x)\in \mathbb {R}^{3\times 3}\) is a symmetric positive definite matrix with bounded coefficient functions so thatFootnote 2\(A_{ij}\in L^{\infty }(D)\), \(c\in L^r(D)\) with \(r\ge 3\), \(d\in L^s(D)\) for \(s\ge 3/2\), and \(d(x)\ge 0\) [28]. We emphasize that the regularity requirements on the variable coefficients are quite weak.
The goal of PDE learning is to discover the operator \(\mathcal {L}\) from \(N\ge 1\) input–output pairs, i.e., \(\{(f_j,u_j)\}_{j=1}^N\), where \(\mathcal {L}u_j = f_j\) and \(u_j|_{\partial D} = 0\) for \(1\le j\le N\). There are two main types of PDE learning tasks: (1) Experimentally determined input–output pairs, where one must do the best one can with the predetermined information and (2) algorithmically determined input–output pairs, where the data-driven learning algorithm can select \(f_1,\ldots ,f_N\) for itself. In this paper, we focus on the PDE learning task where we have algorithmically determined input–output pairs. In particular, we suppose that the functions \(f_1,\ldots ,f_N\) are generated at random and are drawn from a Gaussian process (GP) (see Sect. 2.3). To keep our theoretical statements manageable, we restrict our attention to PDEs of the form:
Lower-order terms in Eq. (1) should cause few theoretical problems [3], though our algorithm and our bounds get far more complicated.
The approach that dominates the PDE learning literature is to directly learn \(\mathcal {L}\) by either (1) learning parameters in the PDE [4, 64], (2) using neural networks to approximate the action of the PDE on functions [45,46,47,48,49], or (3) deriving a model by composing a library of operators with sparsity considerations [9, 35, 52, 53, 59, 60]. Instead of trying to learn the unbounded, closed operator \(\mathcal {L}\) directly, we follow [6, 17, 18] and discover the Green’s function associated with \(\mathcal {L}\). That is, we attempt to learn the function \(G:D\times D\rightarrow \mathbb {R}^+\cup \{\infty \}\) such that [16]
Seeking G, as opposed to \(\mathcal {L}\), has several theoretical benefits:
-
1.
The integral operator in Eq. (3) is compact [15], while \(\mathcal {L}\) is only closed [14]. This allows G to be rigorously learned by input–output pairs \(\{(f_j,u_j)\}_{j=1}^N\), as its range can be approximated by finite-dimensional spaces (see Theorem 3).
-
2.
It is known that G has a hierarchical low-rank structure [3, Theorem 2.8]: for \(0<\epsilon <1\), there exists a function \(G_k(x,y) = \sum _{j=1}^k g_j(x)h_j(y)\) with \(k = \mathcal {O}(\log ^4(1/\epsilon ))\) such that [3, Theorem 2.8]
$$\begin{aligned} \left\| G - G_k\right\| _{L^2(X\times Y)}\le \epsilon \left\| G\right\| _{L^2(X\times {\hat{Y}})}, \end{aligned}$$where \(X,Y\subseteq D\) are sufficiently separated domains, and \(Y\subseteq {\hat{Y}}\subseteq D\) denotes a larger domain than Y (see Theorem 4 for the definition). The further apart X and Y, the faster the singular values of G decay. Moreover, G also has an off-diagonal decay property [19, 25]:
$$\begin{aligned} G(x,y) \le \frac{c}{\Vert x - y\Vert _2}\Vert G\Vert _{L^2(D\times D)}, \qquad x\ne y\in D, \end{aligned}$$where c is a constant independent of x and y. Exploiting these structures of G leads to a rigorous algorithm for constructing a global approximant to G (see Sect. 4).
-
3.
The function G is smooth away from its diagonal, allowing one to efficiently approximate it [19].
Once a global approximation \({\tilde{G}}\) has been constructed for G using input–output pairs, given a new righthand side f one can directly compute the integral in Eq. (3) to obtain the corresponding solution u to Eq. (1). Usually, numerically computing the integral in Eq. (3) must be done with sufficient care as G possesses a singularity when \(x = y\). However, our global approximation \({\tilde{G}}\) has an hierarchical structure and is constructed as 0 near the diagonal. Therefore, for each fixed \(x\in D\), we simply recommend that \(\int _{D} {\tilde{G}}(x,y) f_j(y)d y\) is partitioned into the panels that corresponds to the hierarchical decomposition, and then discretized each panel with a quadrature rule.
1.1 Main Contributions
There are two main contributions in this paper: (1) the generalization of the randomized singular value decomposition (SVD) algorithm for learning matrices from matrix-vector products to Hilbert–Schmidt (HS) operators and (2) a theoretical learning rate for discovering Green’s functions associated with PDEs of the form Eq. (2). These contributions are summarized in Theorem 1 and 3.
Theorem 1 says that, with high probability, one can recover a near-best rank k HS operator using \(k+p\) operator-function products, for a small integer p. In the bound of the theorem, a quantity, denoted by \(0< \gamma _k \le 1\), measures the quality of the input–output training pairs (see Sects. 3.1 and 3.4). We then combine Theorem 1 with the theory of Green’s functions for elliptic PDEs to derive a theoretical learning rate for PDEs.
In Theorem 3, we show that Green’s functions associated with uniformly elliptic PDEs in three dimensions can be recovered using \(N=\mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\) input–output pairs \((f_j,u_j)_{j=1}^N\) to within an accuracy of \(\mathcal {O}(\varGamma _\epsilon ^{-1/2}\log ^3(1/\epsilon )\epsilon )\) with high probability, for \(0<\epsilon <1\). Our learning rate associated with uniformly elliptic PDEs in three variables is therefore \(\mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\). The quantity \(0< \varGamma _\epsilon \le 1\) (defined in Sect. 4.4.2) measures the quality of the GP used to generate the random functions \(\{f_j\}_{j=1}^N\) for learning G. We emphasize that the number of training pairs is small only if the GP’s quality is high. The probability bound in Theorem 3 implies that the constructed approximation is close to G with high probability and converges almost surely to the Green’s function as \(\epsilon \rightarrow 0\).
1.2 Organization of Paper
The paper is structured as follows. In Sect. 2, we briefly review HS operators and GPs. We then generalize the randomized SVD algorithm to HS operators in Sect. 3. Next, in Sect. 4, we characterize the learning rate for PDEs of the form of Eq. (2) (see Theorem 3). Finally, we conclude and discuss potential further directions in Sect. 5.
2 Background Material
We begin by reviewing quasimatrices (see Sect. 2.1), HS operators (see Sect. 2.2), and GPs (see Sect. 2.3).
2.1 Quasimatrices
Quasimatrices are an infinite-dimensional analogue of tall-skinny matrices [57]. Let \(D_1,D_2\subseteq \mathbb {R}^d\) be two domains with \(d\ge 1\) and denote by \(L^2(D_{1})\) the space of square-integrable functions defined on \(D_{1}\). Many of results in this paper are easier to state using quasimatrices. We say that \(\varvec{\Omega }\) is a \(D_1\times k\) quasimatrix, if \(\varvec{\Omega }\) is a matrix with k columns where each column is a function in \(L^2(D_1)\). That is,
Quasimatrices are useful to define analogues of matrix operations for HS operators [11, 56,57,58]. For example, if \(\mathscr {F}:L^2(D_1)\rightarrow L^2(D_2)\) is a HS operator, then we write \(\mathscr {F}\varvec{\Omega }\) to denote the quasimatrix obtained by applying \(\mathscr {F}\) to each column of \(\varvec{\Omega }\). Moreover, we write \(\varvec{\Omega }^*\varvec{\Omega }\) and \(\varvec{\Omega }\varvec{\Omega }^*\) to mean the following:
where \(\langle \cdot , \cdot \rangle \) is the \(L^2(D_1)\) inner-product. Many operations for rectangular matrices in linear algebra can be generalized to quasimatrices [57].
2.2 Hilbert–Schmidt Operators
HS operators are an infinite-dimensional analogue of matrices acting on vectors. Since \(L^2(D_{1})\) is a separable Hilbert space, there is a complete orthonormal basis \(\{e_j\}_{j=1}^\infty \) for \(L^2(D_{1})\). We call \(\mathscr {F}: L^2(D_1)\rightarrow L^2(D_2)\) a HS operator [23, Ch. 4] with HS norm \(\Vert \mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}\) if \(\mathscr {F}\) is linear and
The archetypical example of an HS operator is an HS integral operator \(\mathscr {F}:L^2(D_1)\rightarrow L^2(D_2)\) defined by
where \(G\in L^2(D_2\times D_1)\) is the kernel of \(\mathscr {F}\) and \(\Vert \mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}=\Vert G\Vert _{L^2(D_2\times D_1)}\). Since HS operators are compact operators, they have an SVD [23, Theorem 4.3.1]. That is, there exists a nonnegative sequence \(\sigma _1\ge \sigma _2\ge \cdots \ge 0\) and an orthonormal basis \(\{q_{j}\}_{j=1}^\infty \) for \(L^2(D_2)\) such that for any \(f\in L^2(D_1)\) we have
where the equality holds in the \(L^2(D_2)\) sense. Note that we use the complete SVD, which includes singular functions associated with the kernel of \(\mathscr {F}\). Moreover, one finds that \(\Vert \mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}^2 = \sum _{j=1}^\infty \sigma _j^2\), which shows that the HS norm is an infinite-dimensional analogue of the Frobenius matrix norm \(\Vert \cdot \Vert _{F }\). In the same way that truncating the SVD after k terms gives a best rank k matrix approximation, truncating Eq. (4) gives a best approximation in the HS norm. That is, [23, Theorem 4.4.7]
In this paper, we are interested in constructing an approximation to G in Eq. (3) from input–output pairs \(\{(f_j,u_j)\}_{j=1}^N\) such that \(u_j = \mathscr {F}f_j\).
Throughout this paper, the HS operator denoted by \(\varvec{\Omega }\varvec{\Omega }^*\mathscr {F}: L^2(D_1)\rightarrow L^2(D_2)\) is given by \(\varvec{\Omega }\varvec{\Omega }^*\mathscr {F}f = \sum _{j=1}^k \langle \omega _j,\mathscr {F}f\rangle \omega _j\). If we consider the operator \(\varvec{\Omega }^*\mathscr {F}: L^2(D_1)\rightarrow \mathbb {R}^k\), then \(\Vert \varvec{\Omega }^*\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}^2 = \sum _{j=1}^\infty \Vert \mathscr {F}e_j\Vert _{2}^2\). Similarly, for \(\mathscr {F}\varvec{\Omega }: \mathbb {R}^k\rightarrow L^2(D_2)\) we have \(\Vert \mathscr {F}\varvec{\Omega }\Vert _{{{\,\mathrm{HS}\,}}}^2 = \sum _{j=1}^k \Vert \mathscr {F}{\tilde{e}}_j\Vert _{L^2(D_2)}^2\), where \(\{{\tilde{e}}_j\}_{j=1}^k\) is an orthonormal basis of \(\mathbb {R}^k\). Moreover, if \(\varvec{\Omega }\) has full column rank then \(\mathbf {P}_{\varvec{\Omega }}\mathscr {F}= \varvec{\Omega }(\varvec{\Omega }^*\varvec{\Omega })^{\dagger }\varvec{\Omega }^*\mathscr {F}\) is the orthogonal projection of the range of \(\mathscr {F}\) onto the column space of \(\varvec{\Omega }\). Here, \((\varvec{\Omega }^*\varvec{\Omega })^{\dagger }\) is the pseudo-inverse of \(\varvec{\Omega }^*\varvec{\Omega }\).
2.3 Gaussian Processes
A GP is an infinite-dimensional analogue of a multivariate Gaussian distribution and a function drawn from a GP is analogous to a randomly generated vector. If \(K: D\times D\rightarrow \mathbb {R}\) is a continuous symmetric positive semidefinite kernel, where \(D\subseteq \mathbb {R}^d\) is a domain, then a GP is a stochastic process \(\{X_t,\,t\ge 0\}\) such that for every finite set of indices \(t_1,\ldots ,t_n\ge 0\) the vector of random variables \((X_{t_1},\ldots ,X_{t_n})\) is a multivariate Gaussian distribution with mean \((0,\ldots ,0)\) and covariance \(K_{ij} = K(t_i,t_j)\) for \(1\le i,j\le n\). We denote a GP with mean \((0,\ldots ,0)\) and covariance K by \(\mathcal {GP}(0,K)\).
Since K is a continuous symmetric positive semidefinite kernel, it has nonnegative eigenvalues \(\lambda _1\ge \lambda _2\ge \cdots \ge 0\) and there is an orthonormal basis of eigenfunctions \(\{\psi _j\}_{j=1}^\infty \) of \(L^2(D)\) such that [23, Theorem 4.6.5]:
where the infinite sum is absolutely and uniformly convergent [39]. In addition, we define the trace of the covariance kernel K by \(\smash {{{\,\mathrm{Tr}\,}}(K):=\sum _{j=1}^\infty \lambda _j}<\infty \). The eigendecomposition of K gives an algorithm for generating functions from \(\smash {\mathcal {GP}(0,K)}\). In particular, if \(\omega \sim \sum _{j=1}^{\infty } \sqrt{\lambda _j} c_j\psi _j\), where the coefficients \(\{c_j\}_{j=1}^\infty \) are independent and identically distributed standard Gaussian random variables, then \(\omega \sim \mathcal {GP}(0,K)\) [26, 33]. We also have
where the last equality is analogous to the fact that the trace of a matrix is equal to the sum of its eigenvalues. In this paper, we restrict our attention to GPs with positive definite covariance kernels so that the eigenvalues of K are strictly positive.
In Fig. 1, we display the squared-exponential kernel defined as \(K_{\text {SE}}(x,y)=\exp (-|x-y|^2/(2\ell ^2))\) for \(x,y\in [-1,1]\) [50, Chapt. 4] with parameters \(\ell = 1,0.1,0.01\) together with sampled functions from \(\mathcal {GP}(0,K_{\text {SE}})\). We observe that the functions become more oscillatory as the length-scale parameter \(\ell \) decreases and hence the numerical rank of the kernel increases or, equivalently, the associated eigenvalues \(\{\lambda _j\}\) decay more slowly to zero.
3 Low-Rank Approximation of Hilbert–Schmidt Operators
In a landmark paper, Halko, Martinsson, and Tropp proved that one could learn the column space of a finite matrix—to high accuracy and with a high probability of success—by using matrix-vector products with standard Gaussian random vectors [22]. We now set out to generalize this from matrices to HS operators. Alternative randomized low-rank approximation techniques such as the generalized Nyström method [42] might also be generalized in a similar manner. Since the proof is relatively long, we state our final generalization now.
Theorem 1
Let \(D_1,D_2\subseteq \mathbb {R}^d\) be domains with \(d\ge 1\) and \(\mathscr {F}:L^2(D_1)\rightarrow L^2(D_2)\) be a HS operator. Select a target rank \(k\ge 1\), an oversampling parameter \(p\ge 2\), and a \(D_1\times (k+p)\) quasimatrix \(\varvec{\Omega }\) such that each column is drawn from \(\mathcal {GP}(0,K)\), where \(K:D_1\times D_1\rightarrow \mathbb {R}\) is a continuous symmetric positive definite kernel with eigenvalues \(\lambda _1\ge \lambda _2\ge \cdots >0\). If \(\mathbf {Y}=\mathscr {F}\varvec{\Omega }\), then
where \(\gamma _k = k/(\lambda _1{{\,\mathrm{Tr}\,}}(\mathbf {C}^{-1}))\) with \(\mathbf {C}_{ij}=\int _{D_1\times D_1}v_i(x)K(x,y)v_j(y)d xd y\) for \(1\le i,j\le k\). Here, \(\mathbf {P}_\mathbf {Y}\) is the orthogonal projection onto the vector space spanned by the columns of \(\mathbf {Y}\), \(\sigma _j\) is the jth singular value of \(\mathscr {F}\), and \(v_j\) is the jth right singular vector of \(\mathscr {F}\).
Assume further that \(p\ge 4\), then for any \(s,t\ge 1\), we have
with probability \(\ge 1 - t^{-p}-[s e^{-(s^2-1)/2}]^{k+p}\).
We remark that the term \([s e^{-(s^2-1)/2}]^{k+p}\) in the statement of Theorem 1 is bounded by \(e^{-s^2}\) for \(s\ge 2\) and \(k+p\ge 5\). In the rest of the section, we prove this theorem.
3.1 Three Caveats that Make the Generalization Non-Trivial
One might imagine that the generalization of the randomized SVD algorithm from matrices to HS operators is trivial, but this is not the case due to three caveats:
-
1.
The randomized SVD on finite matrices always uses matrix-vector products with standard Gaussian random vectors [22]. However, for GPs, one must always have a continuous kernel K in \(\mathcal {GP}(0,K)\), which discretizes to a non-standard multivariate Gaussian distribution. Therefore, we must extend [22, Theorem 10.5] to allow for non-standard multivariate Gaussian distributions. The discrete version of our extension is the following:
Corollary 1
Let \(\mathbf {A}\) be a real \(n_2\times n_1\) matrix with singular values \(\sigma _1\ge \cdots \ge \sigma _{\min \{n_1,n_2\}}\). Choose a target rank \(k\ge 1\) and an oversampling parameter \(p\ge 2\). Draw an \(n_1\times (k+p)\) Gaussian matrix, \(\varvec{\Omega }\), with independent columns where each column is from a multivariate Gaussian distribution with mean \((0,\ldots ,0)^\top \) and positive definite covariance matrix \(\mathbf {K}\). If \(\mathbf {Y}=\mathbf {A}\varvec{\Omega }\), then the expected approximation error is bounded by
where \(\lambda _1 \ge \cdots \ge \lambda _{n_1} > 0\) are the eigenvalues of \(\mathbf {K}\) and \(\mathbf {P}_\mathbf {Y}\) is the orthogonal projection onto the vector space spanned by the columns of \(\mathbf {Y}\). Assume further that \(p\ge 4\), then for any \(s,t\ge 1\), we have
with probability \(\ge 1 - t^{-p}-[s e^{-(s^2-1)/2}]^{k+p}\).
Choosing a covariance matrix \(\mathbf {K}\) with sufficient eigenvalue decay so that \(\lim _{n_1\rightarrow \infty }\sum _{j=1}^{n_1} \lambda _j <\infty \) allows \(\mathbb {E}[\Vert \varvec{\Omega }\Vert _{F }^2]\) to remain bounded as \(n_1\rightarrow \infty \). This is of interest when applying the randomized SVD algorithm to extremely large matrices and is critical for HS operators. A stronger statement of this result [8, Theorem 2] shows that prior information on \(\mathbf {A}\) can be incorporated into the covariance matrix to achieve lower approximation error than the randomized SVD with standard Gaussian vectors.
-
2.
We need an additional essential assumption. The kernel in \(\mathcal {GP}(0,K)\) is “reasonable” for learning \(\mathscr {F}\), where reasonableness is measured by the quantity \(\gamma _k\) in Theorem 1. If the first k right singular functions of the HS operator \(v_1,\ldots ,v_k\) are spanned by the first \(k+m\) eigenfunctions of K \(\psi _1,\ldots ,\psi _{k+m}\), for some \(m\in \mathbb {N}\), then (see Eq. (11) and Lemma 2)
$$\begin{aligned} \frac{1}{k}\sum _{j=1}^k\frac{\lambda _1}{\lambda _j}\le \frac{1}{\gamma _k} \le \frac{1}{k}\sum _{j=m+1}^{k+m}\frac{\lambda _1}{\lambda _j}. \end{aligned}$$In the matrix setting, this assumption always holds with \(m=n_1-k\) (see Corollary 1) and one can have \(\gamma _k = 1\) when \(\lambda _1=\cdots =\lambda _{n_1}\) [22, Theorem 10.5].
-
3.
Probabilistic error bounds for the randomized SVD in [22] are derived using tail bounds for functions of standard Gaussian matrices [30, Sect. 5.1]. Unfortunately, we are not aware of tail bounds for non-standard Gaussian quasimatrices. This results in a slightly weaker probability bound than [22, Theorem 10.7].
3.2 Deterministic Error Bound
Apart from the three caveats, the proof of Theorem 1 follows the outline of the argument in [22, Theorem 10.5]. We define two quasimatrices \(\mathbf {U}\) and \(\mathbf {V}\) containing the left and right singular functions of \(\mathscr {F}\) so that the jth column of \(\mathbf {V}\) is \(v_j\). We also denote by \(\varvec{\Sigma }\) the infinite diagonal matrix with the singular values of \(\mathscr {F}\), i.e., \(\sigma _1\ge \sigma _2\ge \cdots \ge 0\), on the diagonal. Finally, for a fixed \(k\ge 1\), we define the \(D_1\times k\) quasimatrix as the truncation of \(\mathbf {V}\) after the first k columns and \(\mathbf {V}_2\) as the remainder. Similarly, we split \(\varvec{\Sigma }\) into two parts:
We are ready to prove an infinite-dimensional analogue of [22, Theorem 9.1] for HS operators.
Theorem 2
(Deterministic error bound) Let \(\mathscr {F}:L^2(D_1)\rightarrow L^2(D_2)\) be a HS operator with SVD given in Eq. (4). Let \(\varvec{\Omega }\) be a \(D_1\times \ell \) quasimatrix and \(\mathbf {Y}=\mathscr {F}\varvec{\Omega }\). If \(\varvec{\Omega }_1=\mathbf {V}_1^*\varvec{\Omega }\) and \(\varvec{\Omega }_2=\mathbf {V}_2^*\varvec{\Omega }\), then assuming \(\varvec{\Omega }_1\) has full rank, we have
where \(\mathbf {P}_\mathbf {Y}=\mathbf {Y}(\mathbf {Y}^*\mathbf {Y})^\dagger \mathbf {Y}^*\) is the orthogonal projection onto the space spanned by the columns of \(\mathbf {Y}\) and \(\smash {\varvec{\Omega }_1^{\dagger } = (\varvec{\Omega }_1^*\varvec{\Omega }_1)^{-1}\varvec{\Omega }_1^*}\).
Proof
First, note that because \(\mathbf {U}\mathbf {U}^*\) is the orthonormal projection onto the range of \(\mathscr {F}\) and \(\mathbf {U}\) is a basis for the range, we have
By Parseval’s theorem [51, Theorem 4.18], we have
Moreover, we have the equality \(\Vert \mathscr {F}-\mathbf {P}_{\mathbf {Y}}\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}=\Vert (\mathbf {I}-\mathbf {P}_{\mathbf {U}^*\mathbf {Y}})\mathbf {U}^*\mathscr {F}\mathbf {V}\Vert _{{{\,\mathrm{HS}\,}}}\) because the inner product \(\langle \sum _{j=1}^\infty \alpha _j u_j,\sum _{j=1}^\infty \beta u_j\rangle =0\) if and only if \(\sum _{j=1}^\infty \alpha _j \beta _j=0\). We now take \(\mathbf {A}=\mathbf {U}^*\mathscr {F}\mathbf {V}\), which is a bounded infinite matrix such that \(\Vert \mathbf {A}\Vert _{F }=\Vert \mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}<\infty \). The statement of the theorem immediately follows from the proof of [22, Theorem 9.1]. \(\square \)
This theorem shows that the bound on the approximation error \(\Vert \mathscr {F}-\mathbf {P}_\mathbf {Y}\mathscr {F}\Vert _{{{\,\mathrm{HS}\,}}}\) depends on the singular values of the HS operator and the test matrix \(\varvec{\Omega }\).
3.3 Probability Distribution of \(\varvec{\Omega }_1\)
If the columns of \(\varvec{\Omega }\) are independent and identically distributed as \(\mathcal {GP}(0,K)\), then the matrix \(\varvec{\Omega }_1\) in Theorem 2 is of size \(k\times \ell \) with entries that follow a Gaussian distribution. To see this, note that
If \(\omega \sim \mathcal {GP}(0,K)\) with K given in Eq. (5), then we find that \(\langle v,\omega \rangle \sim \mathcal {N}(0,\sum _{j=1}^\infty \lambda _j \langle v,\psi _j\rangle ^2)\) so we conclude that \(\varvec{\Omega }_1\) has Gaussian entries with zero mean. Finding the covariances between the entries is more involved.
Lemma 1
With the same setup as Theorem 2, suppose that the columns of \(\varvec{\Omega }\) are independent and identically distributed as \(\mathcal {GP}(0,K)\). Then, the matrix \(\varvec{\Omega }_1=\mathbf {V}_1^*\varvec{\Omega }\) in Theorem 2 has independent columns and each column is identically distributed as a multivariate Gaussian with positive definite covariance matrix \(\mathbf {C}\) given by
where \(v_i\) is the ith column of \(\mathbf {V}_1\).
Proof
We already know that the entries are Gaussian with mean 0. Moreover, the columns are independent because \(\omega _1,\ldots ,\omega _\ell \) are independent. Therefore, we focus on the covariance matrix. Let \(1\le i,i'\le k\), \(1\le j,j'\le \ell \), then since \(\mathbb {E}\!\left[ \langle v_i,\omega _j\rangle \right] = 0\) we have
where \(X_{ij} = \langle v_i,\omega _j\rangle \). Since \(\langle v_i,\omega _j\rangle \sim \sum _{n=1}^\infty \sqrt{\lambda _n}c_n^{(j)} \langle v_i,\psi _n\rangle \), where \(c_n^{(j)}\sim \mathcal {N}(0,1)\), we have
We first show that \(\lim _{m_1,m_2\rightarrow \infty }\left| \mathbb {E}\!\left[ \!X_{ij}^{m_1}X_{i'j'}^{m_2}\right] -\mathbb {E}\!\left[ X_{ij}X_{i'j'}\right] \right| =0\). For any \(m_1,m_2\ge 1\), we have by the triangle inequality,
where the last inequality follows from the Cauchy–Schwarz inequality. We now set out to show that both terms in the last inequality converge to zero as \(m_1,m_2\rightarrow \infty \). The terms \(\smash {\mathbb {E}[|X_{i'j'}^{m_2}|^2]}\) and \(\smash {\mathbb {E}[|X_{ij}|^2]}\) are bounded by \(\sum _{n=1}^\infty \lambda _n<\infty \), using the Cauchy–Schwarz inequality. Moreover, we have
because \(X_{ij} - X_{ij}^{m_1}\sim \mathcal {N}(0,\sum _{n=m_1+1}^\infty \lambda _n\langle v_i,\psi _n\rangle ^2)\). Therefore, we find that \({{\,\mathrm{cov}\,}}(X_{ij}, X_{i'j'}) = \lim _{m_1,m_2\rightarrow \infty }\mathbb {E}[X_{ij}^{m_1}X_{i'j'}^{m_2}]\) and we obtain
The latter expression is zero if \(n\ne n'\) or \(j\ne j'\) because then \(c_n^{(j)}\) and \(c_{n'}^{(j')}\) are independent random variables with mean 0. Since \(\mathbb {E}[(c_n^{(j)})^2] = 1\), we have
The result follows as the infinite sum is equal to the integral in Eq. (9). To see that \(\mathbf {C}\) is positive definite, let \(a\in \mathbb {R}^k\), then \(a^* \mathbf {C} a=\mathbb {E}[Z_a^2]\ge 0\), where \(Z_a\sim \mathcal {N}(0,\sum _{n=1}^\infty \lambda _n \langle a_1 v_1+\cdots +a_k v_k,\psi _n\rangle ^2)\). Moreover, \(a^* \mathbf {C} a = 0\) implies that \(a=0\) because \(v_1,\ldots ,v_k\) are orthonormal and \(\{\psi _n\}\) is an orthonormal basis of \(L^2(D_1)\). \(\square \)
Lemma 1 gives the distribution of the matrix \(\varvec{\Omega }_1\), which is essential to prove Theorem 1 in Sect. 3.6. In particular, \(\varvec{\Omega }_1\) has independent columns that are each distributed as a multivariate Gaussian with covariance matrix given in Eq. (9).
3.4 Quality of the Covariance Kernel
To investigate the quality of the kernel, we introduce the Wishart distribution, which is a family of probability distributions over symmetric and nonnegative-definite matrices that often appear in the context of covariance matrices [61]. If \(\varvec{\Omega }_1\) is a \(k\times \ell \) random matrix with independent columns, where each column is a multivariate Gaussian distribution with mean \((0,\ldots ,0)^\top \) and covariance \(\mathbf {C}\), then \(\mathbf {A}=\varvec{\Omega }_1 \varvec{\Omega }_1^*\) has a Wishart distribution [61]. We write \(\mathbf {A}\sim W_k(\ell ,\mathbf {C})\). We note that \(\Vert \varvec{\Omega }_1^{\dagger }\Vert _{F }^2={{\,\mathrm{Tr}\,}}[(\varvec{\Omega }_1^{\dagger })^*\varvec{\Omega }_1^{\dagger }]={{\,\mathrm{Tr}\,}}(\mathbf {A}^{-1})\), where the second equality holds with probability one because the matrix \(\mathbf {A}=\varvec{\Omega }_1\varvec{\Omega }_1^*\) is invertible with probability one (see [41, Theorem 3.1.4]). By [41, Theorem 3.2.12] for \(\ell -k\ge 2\), we have \(\mathbb {E}[\mathbf {A}^{-1}]=\frac{1}{\ell -k-1}\mathbf {C}^{-1}\), \(\mathbb {E}[{{\,\mathrm{Tr}\,}}(\mathbf {A}^{-1})]={{\,\mathrm{Tr}\,}}(\mathbf {C}^{-1})/(\ell -k-1)\), and conclude that
The quantity \(\gamma _k\) can be viewed as measuring the quality of the covariance kernel K for learning the HS operator \(\mathscr {F}\) (see Theorem 1). First, \(1\le \gamma _k<\infty \) as \(\mathbf {C}\) is symmetric positive definite. Moreover, for \(1\le j\le k\), the jth largest eigenvalue of \(\mathbf {C}\) is bounded by the jth largest eigenvalue of K as \(\mathbf {C}\) is a principal submatrix of \(\mathbf {V}^*K\mathbf {V}\) [27, Sect. III.5]. Therefore, the following inequality holds,
and the harmonic mean of the first k scaled eigenvalues of K is a lower bound for \(1/\gamma _k\). In the ideal situation, the eigenfunctions of K are the right singular functions of \(\mathscr {F}\), i.e., \(\psi _n=v_n\), \(\mathbf {C}\) is a diagonal matrix with entries \(\lambda _1,\ldots ,\lambda _k\), and \(\gamma _k=k/(\sum _{j=1}^k \lambda _1/\lambda _j)\) is as small as possible.
We now provide a useful upper bound on \(\gamma _k\) in a more general setting.
Lemma 2
Let \(\mathbf {V}_1\) be a \(D_1\times k\) quasimatrix with orthonormal columns and assume that there exists \(m\in \mathbb {N}\) such that the columns of \(\mathbf {V}_1\) are spanned by the first \(k+m\) eigenvectors of the continuous positive definite kernel \(K:D_1\times D_1\rightarrow \mathbb {R}\). Then
where \(\lambda _1\ge \lambda _2\ge \cdots > 0\) are the eigenvalues of K. This bound is tight in the sense that the inequality can be attained as an equality.
Proof
Let \(\mathbf {Q} = \left[ v_1\,|\,\cdots \,|\,v_k\,|\,q_{k+1}\,|\cdots \,|\,q_{k+m}\right] \) be a quasimatrix with orthonormal columns whose columns form an orthonormal basis for \(\mathrm{Span}(\psi _1,\ldots ,\psi _{k+m})\). Then, \(\mathbf {Q}\) is an invariant space of K and \(\mathbf {C}\) is a principal submatrix of \(\mathbf {Q}^*K\mathbf {Q}\), which has eigenvalues \(\lambda _1\ge \cdots \ge \lambda _{k+m}\). By [27, Theorem 6.46] the k eigenvalues of \(\mathbf {C}\), denoted by \(\mu _1,\ldots ,\mu _{k}\), are greater than the first \(k+m\) eigenvalues of K: \(\mu _{j}\ge \lambda _{m+j}\) for \(1\le j\le k\), and the result follows as the trace of a matrix is the sum of its eigenvalues. \(\square \)
3.5 Probabilistic Error Bounds
As discussed in Sect. 3.1, we need to extend the probability bounds of the randomized SVD to allow for non-standard Gaussian random vectors. The following lemma is a generalization of [22, Theorem A.7].
Lemma 3
Let \(k,\ell \ge 1\) such that \(\ell -k\ge 4\) and \(\varvec{\Omega }_1\) be a \(k\times \ell \) random matrix with independent columns such that each column has mean \((0,\ldots ,0)^\top \) and positive definite covariance \(\mathbf {C}\). For all \(t\ge 1\), we have
Proof
Since \(\varvec{\Omega }_1\varvec{\Omega }_1^*\sim W_{k}(\ell ,\mathbf {C})\), the reciprocals of its diagonal elements follow a scaled chi-square distribution [41, Theorem 3.2.12], i.e.,
Let \(Z=\Vert \varvec{\Omega }_1^\dagger \Vert _{F }^2={{\,\mathrm{Tr}\,}}[(\varvec{\Omega }_1\varvec{\Omega }_1^*)^{-1}]\) and \(q=(\ell -k)/2\). Following the proof of [22, Theorem A.7], we have the inequality
Moreover, by the Minkowski inequality, we have
where the last inequality is from [22, Lemma A.9]. The result follows from the argument in the proof of [22, Theorem A.7]. \(\square \)
Under the assumption of Lemma 2, we find that Lemma 3 gives the following bound:
In particular, in the finite-dimensional case when \(\lambda _1=\cdots =\lambda _n=1\), we recover the probabilistic bound found in [22, Theorem A.7].
To obtain the probability statement found in Eq. (13), we require control of the tail of the distribution of a Gaussian quasimatrix with non-standard covariance kernel (see Sect. 3.6). In the theory of the randomized SVD, one relies on the concentration of measure results [22, Prop. 10.3]. However, we need to employ a different strategy and instead directly bound the HS norm of \(\varvec{\Omega }_2\). One difficulty is that the norm of this matrix must be controlled for large dimensions n, which leads to a weaker probability bound than [22]. While it is possible to apply Markov’s inequality to obtain deviation bounds, we highlight that Lemma 4 provides a Chernoff-type bound, i.e., exponential decay of the tail distribution of \(\Vert \varvec{\Omega }_2\Vert _{{{\,\mathrm{HS}\,}}}\), which is crucial to approximate Green’s functions (see Sect. 4.4.3).
Lemma 4
With the same notation as in Theorem 2, let \(\ell \ge k \ge 1\). For all \(s\ge 1\), we have
Proof
We first remark that
where the \(Z_j\) are independent and identically distributed (i.i.d) because \(\omega _j\sim \mathcal {GP}(0,K)\) are i.i.d. For \(1\le j\le \ell \), we have (c.f. Sect. 2.3),
where \(c_m^{(j)}\sim \mathcal {N}(0,1)\) are i.i.d for \(m\ge 1\) and \(1\le j\le \ell \). First, since the series in Eq. (12) converges absolutely, we have
where the \(X_m\) are independent random variables and \(X_m \sim \lambda _m\chi ^2\) for \(1\le m\le N\). Here, \(\chi ^2\) denotes the chi-squared distribution [40, Chapt. 4.3].
Let \(N\ge 1\) and \(0<\theta <1/(2{{\,\mathrm{Tr}\,}}(K))\), we can bound the moment generating function of \(\sum _{m=1}^N X_m\) as
because \(X_m/\lambda _m\) are independent random variables that follow a chi-squared distribution. Using the monotone convergence theorem, we have
Let \({\tilde{s}}\ge 0\) and \(0<\theta <1/(2{{\,\mathrm{Tr}\,}}(K))\), by the Chernoff bound [10, Theorem 1], we obtain
We can minimize this upper bound over \(0<\theta <1/(2{{\,\mathrm{Tr}\,}}(K))\) by choosing \(\theta = {\tilde{s}}/(2(1+{\tilde{s}}){{\,\mathrm{Tr}\,}}(K))\), which gives
Choosing \(s=\sqrt{1+{\tilde{s}}}\ge 1\) concludes the proof. \(\square \)
Lemma 4 can be refined further to take into account the interaction between the Hilbert–Schmidt operator \(\mathscr {F}\) and the covariance kernel K (see [8, Lemma 7]).
3.6 Randomized SVD Algorithm for HS Operators
We first prove an intermediary result, which generalizes [22, Prop. 10.1] to HS operators. Note that one may obtain sharper bounds using a suitably chosen covariance kernels that yields a lower approximation error [8].
Lemma 5
Let \(\varvec{\Sigma }_2\), \(\mathbf {V}_2\), and \(\varvec{\Omega }\) be defined as in Theorem 2, and \(\mathbf {T}\) be an \(\ell \times k\) matrix, where \(\ell \ge k\ge 1\). Then,
where \(\lambda _1\) is the first eigenvalue of K.
Proof
Let \(\mathbf {T} = \mathbf {U}_\mathbf {T} \mathbf {D}_\mathbf {T} \mathbf {V}_\mathbf {T}^*\) be the SVD of \(\mathbf {T}\). If \(\{v_{\mathbf {T},i}\}_{i=1}^k\) are the columns of \(\mathbf {V}_\mathbf {T}\), then
where \(\varvec{\Omega }_2=\mathbf {V}_2^*\varvec{\Omega }\). Therefore, we have
Moreover, using the monotone convergence theorem for non-negative random variables, we have
where \(\sigma _{k+1},\sigma _{k+2},\ldots \) are the diagonal elements of \(\varvec{\Sigma }_2\). Then, the quasimatrix \(\varvec{\Omega }_2\) has independent columns and, using Lemma 1, we have
where \(v_{k+n}\) is the nth column of \(\mathbf {V}_2\). Then, \(\mathbb {E}\left[ |\varvec{\Omega }_2(n,j)|^2\right] \le \lambda _1\), as \(\mathbb {E}\left[ |\varvec{\Omega }_2(n,j)|^2\right] \) is written as a Rayleigh quotient. Finally, we have
by orthonormality of the columns on \(\mathbf {U}_{\mathbf {T}}\). \(\square \)
We are now ready to prove Theorem 1, which shows that the randomized SVD can be generalized to HS operators.
Proof of Theorem 1
Let \(\varvec{\Omega }_1, \varvec{\Omega }_2\) be the quasimatrices defined in Theorem 2. The \(k\times (k+p)\) matrix \(\varvec{\Omega }_1\) has full rank with probability one and by Theorem 2, we have
where the last inequality follows from Cauchy–Schwarz inequality. Then, using Lemma 5 and Eq. (10), we have
where \(\gamma _k\) is defined in Sect. 3.4. The observation that \(\Vert \varvec{\Sigma }_2\Vert _{{{\,\mathrm{HS}\,}}}^2=\sum _{j=k+1}^\infty \sigma _j^2\) concludes the proof of Eq. (6).
For the probabilistic bound in Eq. (7), we note that by Theorem 2 we have,
where the second inequality uses the submultiplicativity of the HS norm. The bound follows from bounding \(\Vert \varvec{\Omega }_1^\dagger \Vert _{F }^2\) and \(\Vert \varvec{\Omega }_2\Vert _{{{\,\mathrm{HS}\,}}}^2\) using Lemma 3 and 4, respectively. \(\square \)
4 Recovering the Green’s Function from Input–Output Pairs
It is known that the Green’s function associated with Eq. (2) always exists, is unique, is a nonnegative function \(G : D \times D \rightarrow \mathbb {R}^+ \cup \{\infty \}\) such that
and for each \(y\in \Omega \) and any \(r>0\), we have \(G(\cdot , y) \in \mathcal {H}^1(D \setminus B_r(y))\cap \mathcal {W}_0^{1,1}(D)\) [19].Footnote 3 Since the PDE in Eq. (2) is self-adjoint, we also know that for almost every \(x,y\in D\), we have \(G(x,y) = G(y,x)\) [19].
We now state Theorem 3, which shows that if \(N = \mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\) and one has N input–output pairs \(\{(f_j,u_j)\}_{j=1}^N\) with algorithmically selected \(f_j\), then the Green’s function associated with \(\mathcal {L}\) in Eq. (2) can be recovered to within an accuracy of \(\mathcal {O}(\varGamma _\epsilon ^{-1/2}\log ^3(1/\epsilon )\epsilon )\) with high probability. Here, the quantity \(0<\varGamma _\epsilon \le 1\) measures the quality of the random input functions \(\{f_j\}_{j=1}^N\) (see Sect. 4.4.2).
Theorem 3
Let \(0<\epsilon <1\), \(D\subset \mathbb {R}^3\) be a bounded Lipschitz domain, and \(\mathcal {L}\) given in Eq. (2). If G is the Green’s function associated with \(\mathcal {L}\), then there is a randomized algorithm that constructs an approximation \({\tilde{G}}\) of G using \(\mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\) input–output pairs such that, as \(\epsilon \rightarrow 0\), we have
with probability \(\ge 1 - \mathcal {O}(\epsilon ^{\log (1/\epsilon )-6})\). The term \(\varGamma _\epsilon \) is defined by Eq. (25).
Our algorithm that leads to the proof of Theorem 3 relies on the extension of the randomized SVD to HS operator (see Sect. 3) and a hierarchical partition of the domain of G into “well-separated” domains.
4.1 Recovering the Green’s Function on Admissible Domains
Roughly speaking, as \(\Vert x-y\Vert _2\) increases G becomes smoother about (x, y), which can be made precise using so-called admissible domains [1, 2, 21]. Let \({{\,\mathrm{diam}\,}}X:=\sup _{x,y\in X}\Vert x-y\Vert _2\) be the diameter of X, \({{\,\mathrm{dist}\,}}(X,Y):=\inf _{x\in X,y\in Y}\Vert x-y\Vert _2\) be the shortest distance between X and Y, and \(\rho >0\) be a fixed constant. If \(X,Y\subset \mathbb {R}^3\) are bounded domains, then we say that \(X\times Y\) is an admissible domain if \({{\,\mathrm{dist}\,}}(X,Y)\ge \rho \max \{{{\,\mathrm{diam}\,}}X, {{\,\mathrm{diam}\,}}Y\}\); otherwise, we say that \(X\times Y\) is non-admissible. There is a weaker definition of admissible domains as \({{\,\mathrm{dist}\,}}(X,Y)\ge \rho \min \{{{\,\mathrm{diam}\,}}X, {{\,\mathrm{diam}\,}}Y\}\) [21, p. 59], but we do not consider it.
4.1.1 Approximation Theory on Admissible Domains
It turns out that the Green’s function associated with Eq. (2) has rapidly decaying singular values when restricted to admissible domains. Roughly speaking, if \(X,Y\subset D\) are such that \(X\times Y\) is an admissible domain, then G is well-approximated by a function of the form [3]
for some functions \(g_1,\ldots ,g_k\in L^2(X)\) and \(h_1,\ldots ,h_k\in L^2(Y)\). This is summarized in Theorem 4, which is a corollary of [3, Theorem 2.8].
Theorem 4
Let G be the Green’s function associated with Eq. (2) and \(\rho >0\). Let \(X,Y\subset D\) such that \({{\,\mathrm{dist}\,}}(X,Y)\ge \rho \max \{{{\,\mathrm{diam}\,}}X, {{\,\mathrm{diam}\,}}Y\}\). Then, for any \(0<\epsilon <1\), there exists \(k\le k_\epsilon :=\lceil c(\rho ,{{\,\mathrm{diam}\,}}D,\kappa _C)\rceil \lceil \log (1/\epsilon )\rceil ^{4}+\lceil \log (1/\epsilon )\rceil \) and an approximant, \(G_k\), of G in the form given in Eq. (14) such that
where \(\kappa _C = \lambda _{\max }/\lambda _{\min }\) is the spectral condition number of the coefficient matrix A(x) in Eq. (2)Footnote 4 and c is a constant that only depends on \(\rho \), \({{\,\mathrm{diam}\,}}D\), \(\kappa _C\).
Proof
In [3, Theorem 2.8], it is shown that if \(Y={\tilde{Y}}\cap D\) and \({\tilde{Y}}\) is convex, then there exists \(k\le c_{\rho /2}^3\lceil \log (1/\epsilon )\rceil ^{4}+\lceil \log (1/\epsilon )\rceil \) and an approximant, \(G_k\), of G such that
where \({\hat{Y}}:=\{y\in D,\, {{\,\mathrm{dist}\,}}(y,Y)\le \frac{\rho }{2}{{\,\mathrm{diam}\,}}Y\}\) and \(c_{\rho /2}\) is a constant that only depends on \(\rho \), \({{\,\mathrm{diam}\,}}Y\), and \(\kappa _C\). As remarked by [3], \({\tilde{Y}}\) can be included in a convex of diameter \({{\,\mathrm{diam}\,}}D\) that includes D to obtain the constant \(c(\rho ,{{\,\mathrm{diam}\,}}D,\kappa _C)\). The statement follows by integrating the error bound in Eq. (15) over X. \(\square \)
Since the truncated SVD of G on \(X\times Y\) gives the best rank \(k_\epsilon \ge k\) approximation to G, Theorem 4 also gives bounds on singular values:
where \(\sigma _{j,X\times Y}\) is the jth singular value of G restricted to \(X\times Y\). Since \(k_\epsilon = \mathcal {O}(\log ^4(1/\epsilon ))\), we conclude that the singular values of G restricted to admissible domains \(X\times Y\) rapidly decay to zero.
4.1.2 Randomized SVD for Admissible Domains
Since G has rapidly decaying singular values on admissible domains \(X\times Y\), we use the randomized SVD for HS operators to learn G on \(X\times Y\) with high probability (see Sect. 3).
We start by defining a GP on the domain Y. Let \(\mathcal {R}_{Y\times Y}K\) be the restrictionFootnote 5 of the covariance kernel K to the domain \(Y\times Y\), which is a continuous symmetric positive definite kernel so that \(\mathcal {GP}(0,\mathcal {R}_{Y\times Y}K)\) defines a GP on Y. We choose a target rank \(k\ge 1\), an oversampling parameter \(p\ge 2\), and form a quasimatrix \(\varvec{\Omega } = \begin{bmatrix}f_1\,|\,\cdots \,|\, f_{k+p}\end{bmatrix}\) such that \(f_j\in L^2(Y)\) and \(f_j\sim \mathcal {GP}(0,\mathcal {R}_{Y\times Y} K)\) are identically distributed and independent. We then extend by zero each column of \(\varvec{\Omega }\) from \(L^2(Y)\) to \(L^2(D)\) by \(\mathcal {R}_Y^*\varvec{\Omega }=\begin{bmatrix}\mathcal {R}_Y^* f_1\,|\,\cdots \,|\, \mathcal {R}_Y^* f_{k+p}\end{bmatrix}\), where \(\mathcal {R}_Y^* f_j\sim \mathcal {GP}(0,\mathcal {R}_{Y\times Y}^*\mathcal {R}_{Y\times Y}K)\). The zero extension operator \(\mathcal {R}_Y^*:L^2(Y)\rightarrow L^2(D)\) is the adjoint of \(\mathcal {R}_Y:L^2(D)\rightarrow L^2(Y)\).
Given the training data, \(\mathbf {Y} = \begin{bmatrix}u_1\,|\,\cdots \,|\, u_{k+p} \end{bmatrix}\) such that \(\mathcal {L}u_j = \mathcal {R}_Y^* f_j\) and \(u_j|_{\partial D} = 0\), we now construct an approximation to G on \(X\times Y\) using the randomized SVD (see Sect. 3). Following Theorem 1, we have the following approximation error for \(t\ge 1\) and \(s\ge 2\):
with probability greater than \(1-t^{-p}-e^{-s^2(k+p)}\). Here, \(\lambda _1\ge \lambda _2\ge \cdots >0\) are the eigenvalues of K, \({\tilde{G}}_{X\times Y} = \mathbf {P}_{\mathcal {R}_{X}\mathbf {Y}}\mathcal {R}_{X}\mathscr {F}\mathcal {R}_{Y}^*\) and \(\mathbf {P}_{\mathcal {R}_{X}\mathbf {Y}} = \mathcal {R}_{X}\mathbf {Y}((\mathcal {R}_{X}\mathbf {Y})^*\mathcal {R}_{X}\mathbf {Y})^{\dagger }(\mathcal {R}_{X}\mathbf {Y})^*\) is the orthogonal projection onto the space spanned by the columns of \(\mathcal {R}_{X}\mathbf {Y}\). Moreover, \(\gamma _{k,X\times Y}\) is a measure of the quality of the covariance kernel of \(\mathcal {GP}(0,\mathcal {R}_{Y\times Y}^*\mathcal {R}_{Y\times Y}K)\) (see Sect. 3.4) and, for \(1\le i,j\le k\), defined as \( \gamma _{k,X\times Y} = k/(\lambda _1{{\,\mathrm{Tr}\,}}(\mathbf {C}_{X\times Y}^{-1}))\), where
and \(v_{1,X\times Y},\ldots , v_{k,X\times Y}\in L^2(Y)\) are the first k right singular functions of G restricted to \(X\times Y\).
Unfortunately, there is a big problem with the formula \({\tilde{G}}_{X\times Y} = \mathbf {P}_{\mathcal {R}_X\mathbf {Y}}\mathcal {R}_{X}\mathscr {F}\mathcal {R}_{Y}^*\). It cannot be formed because we only have access to input–output data, so we have no mechanism for composing \(\mathbf {P}_{\mathcal {R}_X\mathbf {Y}}\) on the left of \(\mathcal {R}_{X}\mathscr {F}\mathcal {R}_{Y}^*\). Instead, we note that since the partial differential operator in Eq. (2) is self-adjoint, \(\mathscr {F}\) is self-adjoint, and G is itself symmetric. That means we can use this to write down a formula for \({\tilde{G}}_{Y\times X}\) instead. That is,
where we used the fact that \(\mathbf {P}_{\mathcal {R}_X\mathbf {Y}}\) is also self-adjoint. This means we can construct \({\tilde{G}}_{Y\times X}\) by asking for more input–output data to assess the quasimatrix \(\mathscr {F}(\mathcal {R}_X^*\mathcal {R}_X\mathbf {Y})\). Of course, to compute \({\tilde{G}}_{X\times Y}\), we can swap the roles of X and Y in the above argument.
With a target rank of \(k=k_\epsilon = \lceil c(\rho ,{{\,\mathrm{diam}\,}}D,\kappa _C)\rceil \lceil \log (1/\epsilon )\rceil ^{4}+\lceil \log (1/\epsilon )\rceil \) and an oversampling parameter of \(p = k_\epsilon \), we can combine Theorem 4 and Eq. (16) and (17) to obtain the bound
with probability greater than \(1-t^{-k_{\epsilon }}-e^{-2s^2 k_\epsilon }\). A similar approximation error holds for \({\tilde{G}}_{Y\times X}\) without additional evaluations of \(\mathscr {F}\). We conclude that our algorithm requires \(N_{\epsilon , X\times Y} \!= 2(k_\epsilon +p) = \mathcal {O}\!\left( \log ^4(1/\epsilon )\right) \) input–output pairs to learn an approximant to G on \(X\times Y\) and \(Y\times X\).
4.2 Ignoring the Green’s Function on Non-Admissible Domains
When the Green’s function is restricted to non-admissible domains, its singular values may not decay. Instead, to learn G we take advantage of the off-diagonal decay property of G. It is known that for almost every \(x\ne y\in D\) then
where \(c_{\kappa _C}\) is an implicit constant that only depends on \(\kappa _C\) (see [19, Theorem 1.1]).Footnote 6
If \(X\times Y\) is a non-admissible domain, then for any \((x,y)\in X\times Y\), we find that
because \({{\,\mathrm{dist}\,}}(X,Y)<\rho \max \{{{\,\mathrm{diam}\,}}X,{{\,\mathrm{diam}\,}}Y\}\). This means that \(x\in B_r(y)\cap D\), where \(r=(2+\rho )\max \{{{\,\mathrm{diam}\,}}X,{{\,\mathrm{diam}\,}}Y\}\). Using Eq. (18), we have
Noting that \({{\,\mathrm{diam}\,}}(Y)\le r/(2+\rho )\) and \(\int _Y 1 d y\le 4\pi (\mathrm{diam}(Y)/2)^3/3\), we have the following inequality for non-admissible domains \(X\times Y\):
where \(r=(2+\rho )\max \{{{\,\mathrm{diam}\,}}X,{{\,\mathrm{diam}\,}}Y\}\). We conclude that the Green’s function restricted to a non-admissible domain has a relatively small norm when the domain itself is small. Therefore, in our approximant \({\tilde{G}}\) for G, we ignore G on non-admissible domains by setting \({\tilde{G}}\) to be zero.
4.3 Hierarchical Admissible Partition of Domain
We now describe a hierarchical partitioning of \(D\times D\) so that many subdomains are admissible domains, and the non-admissible domains are all small. For ease of notion, we may assume—without loss of generality—that \({{\,\mathrm{diam}\,}}D = 1\) and \(D\subset [0,1]^3\); otherwise, one should shift and scale D. Moreover, partitioning \([0,1]^3\) and restricting the partition to D is easier than partitioning D directly (Fig. 2). For the definition of admissible domains, we find it convenient to select \(\rho = 1/\sqrt{3}\).
Let \(I = [0,1]^3\). The hierarchical partitioning for n levels is defined recursively as:
-
\(I_{1\times 1\times 1}:=I_1\times I_1\times I_1=[0,1]^3\) is the root for level \(L = 0\).
-
At a given level \(0\le L\le n-1\), if \(I_{j_1\times j_2\times j_3}:=I_{j_1}\times I_{j_2}\times I_{j_3}\) is a node of the tree, then it has 8 children defined as
$$\begin{aligned} \{I_{2 j_1+n_j(1)}\times I_{2 j_2+n_j(2)} \times I_{2j_3+n_j(3)}\mid n_j\in \{0,1\}^3\}. \end{aligned}$$Here, if \(I_j=[a,b]\), \(0\le a<b\le 1\), then \(I_{2j} = \left[ a,\frac{a+b}{2}\right] \) and \(I_{2j+1} = \left[ \frac{a+b}{2},b\right] \).
The set of non-admissible domains can be given by this unwieldy expression
where \(\wedge \) is the logical “and” operator. The set of admissible domains is given by
where \(P_{\text {non-adm}}(L)\) is the set of non-admissible domain for a hierarchical level of L and
Using Eqs. (20)–(21), the number of admissible and non-admissible domains are precisely \(|P_{\text {non-adm}}| = (3\times 2^n-2)^3\) and \(|P_{\text {adm}}| = \sum _{\ell =1}^n 2^{6}(3\times 2^{L-1}-2)^3-(3\times 2^{L}-2)^3\). In particular, the size of the partition at the hierarchical level \(0\le L\le n\) is equal to \(8^L\) and the tree has a total of \((8^{n+1}-1)/7\) nodes (see Fig. 3).
Finally, the hierarchical partition of \(D\times D\) can be defined via the partition \(P=P_{\text {adm}}\cup P_{\text {non-adm}}\) of \([0,1]^3\) by doing the following:
The sets of admissible and non-admissible domains of \(D\times D\) are denoted by \(P_{\text {adm}}\) and \(P_{\text {non-adm}}\) in the next sections.
4.4 Recovering the Green’s Function on the Entire Domain
We now show that we can recover G on the entire domain \(D\times D\).
4.4.1 Global Approximation on the Non-Admissible Set
Let \(n_\epsilon \) be the number of levels in the hierarchical partition \(D\times D\) (see Sect. 4.3). We want to make sure that the norm of the Green’s function on all non-admissible domains is small so that we can safely ignore that part of G (see Sect. 4.2). As one increases the hierarchical partitioning levels, the volume of the non-admissible domains get smaller (see Fig. 4).
Let \(X\times Y\in P_{\text {non-adm}}\) be a non-admissible domain, the two domains X and Y have diameter bounded by \(\sqrt{3}/2^{n_\epsilon }\) because they are included in cubes of side length \(1/2^{n_\epsilon }\) (see Sect. 4.3). Combining this with Eq. (19) yields
Therefore, the \(L^2\)-norm of G on the non-admissible domain \(P_{\text {non-adm}}\) satisfies
where we used \(|P_{\text {non-adm}}| = (3\times 2^{n_\epsilon }-2)^3\le 27(2^{3n_\epsilon })\). This means that if we select \(n_\epsilon \) to be
then we guarantee that \(\Vert G\Vert _{L^2(P_{\text {non-adm}})}\le \epsilon \Vert G\Vert _{L^2(D\times D)}\). We can safely ignore G on non-admissible domains—by taking the zero approximant—while approximating G to within \(\epsilon \).
4.4.2 Learning Rate of the Green’s Function
Following Sect. 4.1.2, we can construct an approximant \({\tilde{G}}_{X\times Y}\) to the Green’s function on an admissible domain \(X\times Y\) of the hierarchical partitioning using the HS randomized SVD algorithm, which requires \(N_{\epsilon ,X\times Y}=\smash {\mathcal {O}(\log ^4(1/\epsilon ))}\) input–output training pairs (see Sect. 4.1.2). Therefore, the number of training input–output pairs needed to construct an approximant to G on all admissible domains is given by
where \(|P_{\text {adm}}|\) denotes the total number of admissible domains at the hierarchical level \(n_\epsilon \), which is given by Eq. (22). Then, we have (see Sect. 4.3):
and, using Eq. (22), we obtain \(|P_{\text {adm}}| = \mathcal {O}(1/\epsilon ^6)\). This means that the total number of required input–output training pairs to learn G with high probability is bounded by
4.4.3 Global Approximation Error
We know that with \(N_\epsilon = \mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\) input–output training pairs, we can construct an accurate approximant to G on each admissible and non-admissible domain. Since the number of admissible and non-admissible domains depends on \(\epsilon \), we now check that this implies a globally accurate approximant that we denote by \({\tilde{G}}\).
Since \({\tilde{G}}\) is zero on non-admissible domains and \(P_{\text {adm}}\cap P_{\text {non-adm}}\) has measure zero, we have
Following Sect. 4.4.2, if \(X\times Y\) is admissible then the approximation error satisfies
with probability greater than \(1-t^{-k_{\epsilon }}-e^{-2s^2 k_\epsilon }\). Here, \({\hat{Y}}=\{y\in D,\, {{\,\mathrm{dist}\,}}(y,Y)\le {{\,\mathrm{diam}\,}}Y/2\sqrt{3}\}\) (see Theorem 4 with \(\rho =1/\sqrt{3}\)). To measure the worst \(\gamma _{k_\epsilon , X\times Y}\), we define
From Eq. (11), we know that \(0<\varGamma _\epsilon \le 1\) and that \(1/\varGamma _\epsilon \) is greater than the harmonic mean of the first \(k_\epsilon \) scaled eigenvalues of the covariance kernel K, i.e.,
Now, one can see that \(X\times {\hat{Y}}\) is included in at most \(5^3=125\) neighbors including itself. Assuming that all the probability bounds hold on the admissible domains, this implies that
We then choose \(t=e\) and \(s=k_\epsilon ^{1/4}\) so that the approximation bound on each admissible domain holds with probability of failure less than \(2e^{-\sqrt{k_\epsilon }}\). Finally, using Eq. (24) we conclude that as \(\epsilon \rightarrow 0\), the approximation error on \(D\times D\) satisfies
with probability \(\ge (1-2e^{-\sqrt{k_\epsilon }})^{6^3 2^{3n_\epsilon }}=1-\mathcal {O}(\epsilon ^{\log (1/\epsilon )-6})\), where \(n_\epsilon \) is given by Eq. (22). We conclude that the approximant \({\tilde{G}}\) is a good approximation to G with very high probability.
5 Conclusions and Discussion
This paper rigorously learns the Green’s function associated with a PDE rather than the partial differential operator (PDO). By extending the randomized SVD to HS operators, we can identify a learning rate associated with elliptic PDOs in three dimensions and bound the number of input–output training pairs required to recover a Green’s function approximately. One practical outcome of this work is a measure for the quality of covariance kernels, which may be used to design efficient kernels for PDE learning tasks.
There are several possible future extensions of these results related to the recovery of hierarchical matrices, the study of other partial differential operators, and practical deep learning applications, which we discuss further in this section.
5.1 Fast and Stable Reconstruction of Hierarchical Matrices
We described an algorithm for reconstructing Green’s function on admissible domains of a hierarchical partition of \(D\times D\) that requires performing the HS randomized SVD \(\mathcal {O}(\epsilon ^{-6})\) times. We want to reduce it to a factor that is \(\mathcal {O}(\text {polylog}(1/\epsilon ))\).
For \(n\times n\) hierarchical matrices, there are several existing algorithms for recovering the matrix based on matrix-vector products [5, 32, 36, 37]. There are two main approaches: (1) The “bottom-up” approach: one begins at the lowest level of the hierarchy and moves up and (2) The “top-down” approach: one updates the approximant by peeling off the off-diagonal blocks and going down the hierarchy. The bottom-up approach requires \(\mathcal {O}(n)\) applications of the randomized SVD algorithm [36]. There are lower complexity alternatives that only require \(\mathcal {O}(\log (n))\) matrix-vector products with random vectors [32]. However, the algorithm in [32] is not yet proven to be theoretically stable as errors from low-rank approximations potentially accumulate exponentially, though this is not observed in practice. For symmetric positive semi-definite matrices, it may be possible to employ a sparse Cholesky factorization [54, 55]. This leads us to formulate the following challenge:
If one can design such an algorithm and it can be extended to HS operators, then the \(\mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))\) term in Theorem 3 may improve to \(\mathcal {O}(\text {polylog}(1/\epsilon ))\). This means that the learning rate of partial differential operators of the form of Eq. (2) will be a polynomial in \(\log (1/\epsilon )\) and grow sublinearly with respect to \(1/\epsilon \).
5.2 Extension to Other Partial Differential Operators
Our learning rate for elliptic PDOs in three variables (see Sect. 4) depends on the decay of the singular values of the Green’s function on admissible domains [3]. We expect that one can also find the learning rate for other PDOs.
It is known that the Green’s functions associated to elliptic PDOs in two dimensions exist and satisfy the following pointwise estimate [12]:
where \(d_x={{\,\mathrm{dist}\,}}(x,\partial D)\), \(\gamma \) is a constant depending on the size of the domain D, and C is an implicit constant. One can conclude that \(G(x,\cdot )\) is locally integrable for all \(x\in D\) with \(\Vert G(x,\cdot )\Vert _{L^p(B_r(x)\cap D)}<\infty \) for \(r>0\) and \(1\le p<\infty \). We believe that the pointwise estimate in Eq. (27) implies the off-diagonal low-rank structure of G here, as suggested in [3]. Therefore, we expect that the results in this paper can be extended to elliptic PDOs in two variables.
PDOs in four or more variables are far more challenging since we rely on the following bound on the Green’s function on non-admissible domains [19]:
where \(D\subset \mathbb {R}^d\), \(d\ge 3\) is the dimension, and c is a constant depending only on d and \(\kappa _C\). This inequality implies that the \(L^p\)-norm of G on non-admissible domains is finite when \(0\le p < d/(d-2)\). However, for a dimension \(d\ge 4\), we have \(p<2\) and one cannot ensure that the \(L^2\) norm of G is finite. Therefore, the Green’s function may not be compatible with the HS randomized SVD.
It should also be possible to characterize the learning rate for elliptic PDOs with lower order terms (under reasonable conditions) [13, 24, 28] and many parabolic operators [29] as the associated Green’s functions have similar regularity and pointwise estimates. The main task is to extend [3, Theorem 2.8] to construct separable approximations of the Green’s functions on admissible domains. In contrast, we believe that deriving a theoretical learning rate for hyperbolic PDOs remains a significant research challenge for many reasons. The first roadblock is that the Green’s function associated with hyperbolic PDOs do not necessarily lie in \(L^2(D\times D)\). For example, the Green’s function associated with the wave equation in three variables, i.e., \(\L =\partial _t^2-\nabla ^2\), is not square-integrable as
where \(\delta (\cdot )\) is the Dirac delta function.
5.3 Connection with Neural Networks
There are many possible connections between this work and neural networks (NNs) from practical and theoretical viewpoints. The proof of Theorem 3 relies on the construction of a hierarchical partition of the domain \(D\times D\) and the HS randomized SVD algorithm applied on each admissible domain. This gives an algorithm for approximating Green’s functions with high probability. However, there are more practical approaches that currently do not have theoretical guarantees [17, 18].
A promising opportunity is to design a NN that can learn and approximate Green’s functions using input–output training pairs \(\{(f_j,u_j)\}_{j=1}^N\) [6]. Once a neural network \(\mathcal {N}\) has been trained such that \(\Vert \mathcal {N}-G\Vert _{L^2}\le \epsilon \Vert G\Vert _{L^2}\), the solution to \(\L u = f\) can be obtained by computing the following integral:
Therefore, this may give an efficient computational approach for discovering operators since a NN is only trained once. Incorporating a priori knowledge of the Green’s function into the network architecture design could be particularly beneficial. One could also wrap the selection of the kernel in the GP for generating random functions and training data into a Bayesian framework.
Finally, we wonder how many parameters in a NN are needed to approximate a Green’s function associated with elliptic PDOs within a tolerance of \(0<\epsilon <1\). Can one exploit the off-diagonal low-rank structure of Green’s functions to reduce the number of parameters? We expect the recent work on the characterization of ReLU NNs’ approximation power is useful [20, 44, 62]. The use of NNs with high approximation power such as rational NNs might also be of interest to approximate the singularities of the Green’s function near the diagonal [7].
Notes
Here, \(L^2(D)\) is the space of square-integrable functions defined on D, \(\mathcal {H}^k(D)\) is the space of k times weakly differentiable functions in the \(L^2\)-sense, and \(\mathcal {H}^1_0(D)\) is the closure of \(\mathcal {C}_c^\infty (D)\) in \(\mathcal {H}^1(D)\). Here, \(\mathcal {C}_c^\infty (D)\) is the space of infinitely differentiable compactly supported functions on D. Roughly speaking, \(\mathcal {H}^1_0(D)\) are the functions in \(\mathcal {H}^1(D)\) that are zero on the boundary of D.
For \(1\le r\le \infty \), we denote by \(L^r(D)\) the space of functions defined on the domain D with finite \(L^r\) norm, where \(\Vert f\Vert _{r}\) = \((\int _D|f|^rd x)^{1/r}\) if \(r<\infty \), and \(\Vert f\Vert _{\infty }=\inf \{C>0:|f(x)|\le C \text { for almost every }x\in D\}\).
Here, \(B_r(y) = \{z\in \mathbb {R}^3 : \Vert z - y\Vert _2 < r\}\), \(\mathcal {W}^{1,1}(D)\) is the space of weakly differentiable functions in the \(L^1\)-sense, and \(\mathcal {W}^{1,1}_0(D)\) is the closure of \(\mathcal {C}_c^\infty (D)\) in \(\mathcal {W}^{1,1}(D)\).
Here, \(\lambda _{\max }\) is defined as \(\sup _{x\in D}\lambda _{\max }(A(x))\) and \(\lambda _{\min } = \inf _{x\in D}\lambda _{\min }(A(x))>0\).
We denote the restriction operator by \(\mathcal {R}_{Y\times Y}:L^2(D\times D)\rightarrow L^2(Y\times Y)\).
Note that we have normalized [19, Eq. 1.8] to highlight the dependence on \(\Vert G\Vert _{L^2(D\times D)}\).
References
Ballani, J., Kressner, D.: Matrices with hierarchical low-rank structures. In: Exploiting Hidden Structure in Matrix Computations: Algorithms and Applications, pp. 161–209. Springer (2016)
Bebendorf, M.: Hierarchical Matrices: A Means to Efficiently Solve Elliptic Boundary Value Problems. Lecture Notes in Computational Science and Engineering. Springer-Verlag (2008)
Bebendorf, M., Hackbusch, W.: Existence of \(\mathscr {H}\)-matrix approximants to the inverse FE-matrix of elliptic operators with \(L^\infty \)-coefficients. Numer. Math. 95(1), 1–28 (2003)
Bonito, A., Cohen, A., DeVore, R., Petrova, G., Welper, G.: Diffusion coefficients estimation for elliptic partial differential equations. SIAM J. Math. Anal. 49(2), 1570–1592 (2017)
Boukaram, W., Turkiyyah, G., Keyes, D.: Randomized GPU algorithms for the construction of hierarchical matrices from matrix-vector operations. SIAM J. Sci. Comput. 41(4), C339–C366 (2019)
Boullé, N., Earls, C.J., Townsend, A.: Data-driven discovery of physical laws with human-understandable deep learning. arXiv preprint arXiv:2105.00266 (2021)
Boullé, N., Nakatsukasa, Y., Townsend, A.: Rational neural networks. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 14243–14253 (2020)
Boullé, N., Townsend, A.: A generalization of the randomized singular value decomposition. arXiv preprint arXiv:2105.13052 (2021)
Brunton, S.L., Proctor, J.L., Kutz, J.N.: Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. USA 113(15) (2016)
Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. pp. 493–507 (1952)
de Boor, C.: An alternative approach to (the teaching of) rank, basis, and dimension. Lin. Alg. Appl. 146, 221–229 (1991)
Dong, H., Kim, S.: Green’s matrices of second order elliptic systems with measurable coefficients in two dimensional domains. Trans. Am. Math. Soc. 361(6), 3303–3323 (2009)
Dong, H., Kim, S.: Green’s function for nondivergence elliptic operators in two dimensions. SIAM J. Math. Anal. 53(4), 4637–4656 (2021). https://doi.org/10.1137/20M1323618
Edmunds, D.E., Evans, W.D.: Spectral theory and differential operators. Oxford University Press (2018)
Edmunds, D.E., Kokilashvili, V.M., Meskhi, A.: Bounded and compact integral operators. Springer Science & Business Media (2013)
Evans, L.C.: Partial Differential Equations. American Mathematical Society, Providence, R.I. (2010)
Feliu-Faba, J., Fan, Y., Ying, L.: Meta-learning pseudo-differential operators with deep neural networks. J. Comput. Phys. 408, 109309 (2020)
Gin, C.R., Shea, D.E., Brunton, S.L., Kutz, J.N.: DeepGreen: Deep Learning of Green’s Functions for Nonlinear Boundary Value Problems. arXiv preprint arXiv:2101.07206 (2020). https://doi.org/10.1038/s41598-021-00773-xl
Grüter, M., Widman, K.O.: The Green function for uniformly elliptic equations. Manuscripta Math. 37(3), 303–342 (1982)
Gühring, I., Kutyniok, G., Petersen, P.: Error bounds for approximations with deep ReLU neural networks in \({W}^{s,p}\) norms. Anal. Appl. 18(05), 803–859 (2020)
Hackbusch, W.: Hierarchical Matrices: Algorithms and Analysis. Springer (2015)
Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Hsing, T., Eubank, R.: Theoretical foundations of functional data analysis, with an introduction to linear operators. John Wiley & Sons (2015)
Hwang, S., Kim, S.: Green’s function for second order elliptic equations in non-divergence form. Potential Anal. 52(1), 27–39 (2020)
Kang, K., Kim, S.: Global pointwise estimates for Green’s matrix of second order elliptic systems. J. Differ. Equ. 249(11), 2643–2662 (2010)
Karhunen, K.: Über lineare methoden in der wahrscheinlichkeitsrechnung. Ann. Acad. Science Fenn., Ser. A. I. 37, 3–79 (1946)
Kato, T.: Perturbation Theory for Linear Operators. Springer Science & Business Media (2013)
Kim, S., Sakellaris, G.: Green’s function for second order elliptic equations with singular lower order coefficients. Commun. Partial. Differ. Equ. 44(3), 228–270 (2019)
Kim, S., Xu, L.: Green’s function for second order parabolic equations with singular lower order coefficients. Commun. Pure Appl. Anal. 21(1), 1–21 (2022). https://doi.org/10.3934/cpaa.2021164l
Ledoux, M.: The concentration of measure phenomenon. Math. Surveys. Monog. 89. AMS, Providence, RI (2001)
Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., Anandkumar, A.: Fourier neural operator for parametric partial differential equations. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=c8P9NQVtmnO
Lin, L., Lu, J., Ying, L.: Fast construction of hierarchical matrix representation from matrix–vector multiplication. J. Comput. Phys. 230(10), 4071–4087 (2011)
Loève, M.: Fonctions aleatoire de second ordre. Rev. Sci. 84, 195–206 (1946)
Long, Z., Lu, Y., Ma, X., Dong, B.: PDE-NET: Learning PDEs from data. In: International Conference on Machine Learning, pp. 3208–3216. PMLR (2018)
Maddu, S., Cheeseman, B.L., Sbalzarini, I.F., Müller, C.L.: Stability selection enables robust learning of partial differential equations from limited noisy data. arXiv preprint arXiv:1907.07810 (2019)
Martinsson, P.G.: A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix. SIAM J. Matrix Anal. Appl. 32(4), 1251–1274 (2011)
Martinsson, P.G.: Compressing rank-structured matrices via randomized sampling. SIAM J. Sci. Comput. 38(4), A1959–A1986 (2016)
Meng, X., Li, Z., Zhang, D., Karniadakis, G.E.: PPINN: Parareal physics-informed neural network for time-dependent PDEs. Comput. Methods Appl. Mech. Eng. 370, 113250 (2020)
Mercer, J.: Functions of positive and negative type, and their connection the theory of integral equations. Philos. T. R. Soc. A 209(441-458), 415–446 (1909)
Mood, A.M., Graybill, F.A., Boes, D.C.: Introduction to the Theory of Statistics, 3rd edn. McGraw-Hill (1974)
Muirhead, R.J.: Aspects of multivariate statistical theory. John Wiley & Sons (2009)
Nakatsukasa, Y.: Fast and stable randomized low-rank matrix approximation. arXiv preprint arXiv:2009.11392 (2020)
Pang, G., Yang, L., Karniadakis, G.E.: Neural-net-induced Gaussian process regression for function approximation and PDE solution. J. Comput. Phys. 384, 270–288 (2019)
Petersen, P., Voigtlaender, F.: Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Netw. 108, 296–330 (2018)
Raissi, M.: Deep hidden physics models: Deep learning of nonlinear partial differential equations. J. Mach. Learn. Res. 19(1), 932–955 (2018)
Raissi, M., Karniadakis, G.E.: Hidden physics models: Machine learning of nonlinear partial differential equations. J. Comput. Phys. 357, 125–141 (2018)
Raissi, M., Perdikaris, P., Karniadakis, G.E.: Multistep neural networks for data-driven discovery of nonlinear dynamical systems. arXiv preprint arXiv:1801.01236 (2018)
Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019)
Raissi, M., Yazdani, A., Karniadakis, G.E.: Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science 367(6481), 1026–1030 (2020)
Rasmussen, C.E., Williams, C.: Gaussian processes for machine learning. MIT Press (2006)
Rudin, W.: Real and complex analysis, 3rd edn. McGraw-Hill (1986)
Rudy, S.H., Brunton, S.L., Proctor, J.L., Kutz, J.N.: Data-driven discovery of partial differential equations. Sci. Adv. 3(4), e1602614 (2017)
Schaeffer, H.: Learning partial differential equations via data discovery and sparse optimization. Proc. Math. Phys. Eng. Sci. 473(2197), 20160446 (2017)
Schäfer, F., Owhadi, H.: Sparse recovery of elliptic solvers from matrix-vector products. arXiv preprint arXiv:2110.05351 (2021)
Schäfer, F., Sullivan, T.J., Owhadi, H.: Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity. Multiscale Model. Sim. 19(2), 688–730 (2021)
Stewart, G.W.: Matrix Algorithms: Volume 1: Basic Decompositions. SIAM (1998)
Townsend, A., Trefethen, L.N.: Continuous analogues of matrix factorizations. P. Roy. Soc. A 471(2173), 20140585 (2015)
Trefethen, L.N., Bau III, D.: Numerical linear algebra. SIAM (1997)
Voss, H.U., Timmer, J., Kurths, J.: Nonlinear dynamical system identification from uncertain and indirect measurements. Int. J. Bifurc. Chaos Appl. Sci. Eng. 14(06), 1905–1933 (2004)
Wang, Z., Huan, X., Garikipati, K.: Variational system identification of the partial differential equations governing the physics of pattern-formation: inference under varying fidelity and noise. Comput. Methods Appl. Mech. Eng. 356, 44–74 (2019)
Wishart, J.: The generalised product moment distribution in samples from a normal multivariate population. Biometrika pp. 32–52 (1928)
Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017)
Yazdani, A., Lu, L., Raissi, M., Karniadakis, G.E.: Systems biology informed deep learning for inferring parameters and hidden dynamics. PLoS Comput. Biol. 16(11), e1007575 (2020)
Zhao, H., Storey, B.D., Braatz, R.D., Bazant, M.Z.: Learning the physics of pattern formation from images. Phys. Rev. Lett. 124(6), 060201 (2020)
Acknowledgements
We want to thank Max Jenquin and Tianyi Shi for discussions. We also thank Matthew Colbrook, Abinand Gopal, Daniel Kressner, and Yuji Nakatsukasa for their feedback and suggestions on the paper. We are indebted to Christopher Earls for telling us about the idea of using Green’s functions and Gaussian processes for PDE learning. We are grateful to Joel Tropp, whose suggestions led to sharper bounds for the randomized SVD, and the anonymous referees for their comments which improved the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Arieh Iserles.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is supported by the EPSRC Centre for Doctoral Training in Industrially Focused Mathematical Modelling (EP/L015803/1) in collaboration with Simula Research Laboratory and by the National Science Foundation Grants DMS-1818757, DMS-1952757, and DMS-2045646.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Boullé, N., Townsend, A. Learning Elliptic Partial Differential Equations with Randomized Linear Algebra. Found Comput Math 23, 709–739 (2023). https://doi.org/10.1007/s10208-022-09556-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-022-09556-w
Keywords
- Data-driven discovery of PDEs
- Randomized SVD
- Green’s function
- Hilbert–Schmidt operators
- Low-rank approximation