Learning Elliptic Partial Differential Equations with Randomized Linear Algebra

Given input–output pairs of an elliptic partial differential equation (PDE) in three dimensions, we derive the first theoretically rigorous scheme for learning the associated Green’s function G. By exploiting the hierarchical low-rank structure of G, we show that one can construct an approximant to G that converges almost surely and achieves a relative error of O(Γϵ-1/2log3(1/ϵ)ϵ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\varGamma _\epsilon ^{-1/2}\log ^3(1/\epsilon )\epsilon )$$\end{document} using at most O(ϵ-6log4(1/ϵ))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\epsilon ^{-6}\log ^4(1/\epsilon ))$$\end{document} input–output training pairs with high probability, for any 0<ϵ<1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0<\epsilon <1$$\end{document}. The quantity 0<Γϵ≤1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0<\varGamma _\epsilon \le 1$$\end{document} characterizes the quality of the training dataset. Along the way, we extend the randomized singular value decomposition algorithm for learning matrices to Hilbert–Schmidt operators and characterize the quality of covariance kernels for PDE learning.


Introduction
Can one learn a differential operator from pairs of solutions and righthand sides?If so, how many pairs are required?These two questions have received significant research attention [17,31,34,43].From data, one hopes to eventually learn physical laws of nature or conservation laws that elude scientists in the biological sciences [63], computational fluid dynamics [49], and computational physics [45].The literature contains many highly successful practical schemes based on deep learning techniques [38,48].However, the challenge remains to understand when and why deep learning is effective theoretically.This paper describes the first theoretically-justified scheme for discovering scalar-valued elliptic partial differential equations (PDEs) in three variables from input-output data and provides a rigorous learning rate.While our novelties are mainly theoretical, we hope to motivate future practical choices in PDE learning.
We suppose that there is an unknown second-order uniformly elliptic linear PDE operator 1 L : H 2 pDqXH 1 0 pDq Ñ L 2 pDq with a bounded domain D Ă R 3 with Lipschitz smooth boundary [16], which takes the form Lupxq " ´∇ ¨pApxq∇uq `cpxq ¨∇u `dpxqu, x P D, u| BD " 0. ( Here, for every x P D, we have that Apxq P R 3ˆ3 is a symmetric positive definite matrix with bounded coefficient functions so that 2 A ij P L 8 pDq, c P L r pDq with r ě 3, d P L s pDq for s ě 3{2, and dpxq ě 0 [28].We emphasize that the regularity requirements on the variable coefficients are quite weak.The goal of PDE learning is to discover the operator L from N ě 1 input-output pairs, i.e., tpf j , u j qu N j"1 , where Lu j " f j and u j | BD " 0 for 1 ď j ď N .There are two main types of PDE learning tasks: (1) Experimentally-determined input-output pairs, where one must do the best one can with the predetermined information and (2) Algorithmically-determined input-output pairs, where the data-driven learning algorithm can select f 1 , . . ., f N for itself.In this paper, we focus on the PDE learning task where we have algorithmically-determined input-output pairs.In particular, we suppose that the functions f 1 , . . ., f N are generated at random and are drawn from a Gaussian process (GP) (see Section 2.3).To keep our theoretical statements manageable, we restrict our attention to PDEs of the form: Lu " ´∇ ¨pApxq∇uq , x P D, u| BD " 0. ( Lower-order terms in Eq. ( 1) should cause few theoretical problems [3], though our algorithm and our bounds get far more complicated.The approach that dominates the PDE learning literature is to directly learn L by either (1) Learning parameters in the PDE [4,64], (2) Using neural networks to approximate the action of the PDE on functions [45,46,47,48,49], or (3) Deriving a model by composing a library of operators with sparsity considerations [10,35,52,53,59,60].Instead of trying to learn the unbounded, closed operator L directly, we follow [7,17,18] and discover the Green's function associated with L. That is, we attempt to learn the function G : D ˆD Ñ R `Y t8u such that [16] u j pxq " ż D Gpx, yqf j pyq dy, x P D, Seeking G, as opposed to L, has several theoretical benefits: 1.The integral operator in Eq. ( 3) is compact [15], while L is only closed [14].This allows G to be rigorously learned by input-output pairs tpf j , u j qu N j"1 , as its range can be approximated by finite-dimensional spaces (see Theorem 3). 2. It is known that G has a hierarchical low-rank structure [3, Thm.2.8]: for 0 ă ă 1, there exists a function G k px, yq " ř k j"1 g j pxqh j pyq with k " Oplog 4 p1{ qq such that [3,Thm. 2.8] }G ´Gk } L 2 pXˆY q ď }G} L 2 pXˆŶ q , where X, Y Ď D are sufficiently separated domains, and Y Ď Ŷ Ď D denotes a larger domain than Y (see Theorem 4 for the definition).The further apart X and Y , the faster the singular values of G decay.Moreover, G also has an off-diagonal decay property [19,25]: where c is a constant independent of x and y.Exploiting these structures of G leads to a rigorous algorithm for constructing a global approximant to G (see Section 4). 3. The function G is smooth away from its diagonal, allowing one to efficiently approximate it [19].
Once a global approximation G has been constructed for G using input-output pairs, given a new righthand side f one can directly compute the integral in Eq. ( 3) to obtain the corresponding solution u to Eq. (1).Usually, numerically computing the integral in Eq. ( 3) must be done with sufficient care as G possesses a singularity when x " y.However, our global approximation G has an hierarchical structure and is constructed as 0 near the diagonal.Therefore, for each fixed x P D, we simply recommend that ş D Gpx, yqf j pyq dy is partitioned into the panels that corresponds to the hierarchical decomposition, and then discretized each panel with a quadrature rule.

Main contributions
There are two main contributions in this paper: (1) The generalization of the randomized singular value decomposition (SVD) algorithm for learning matrices from matrix-vector products to Hilbert-Schmidt (HS) operators and (2) A theoretical learning rate for discovering Green's functions associated with PDEs of the form Eq. ( 2).These contributions are summarized in Theorems 1 and 3.
Theorem 1 says that, with high probability, one can recover a near-best rank k HS operator using k `p operator-function products, for a small integer p.In the bound of the theorem, a quantity, denoted by 0 ă γ k ď 1, measures the quality of the input-output training pairs (see Sections 3.1 and 3.4).We then combine Theorem 1 with the theory of Green's functions for elliptic PDEs to derive a theoretical learning rate for PDEs.
In Theorem 3, we show that Green's functions associated with uniformly elliptic PDEs in three dimensions can be recovered using N " Op ´6 log 4 p1{ qq input-output pairs pf j , u j q N j"1 to within an accuracy of OpΓ ´1{2 log 3 p1{ q q with high probability, for 0 ă ă 1.Our learning rate associated with uniformly elliptic PDEs in three variables is therefore Op ´6 log 4 p1{ qq.The quantity 0 ă Γ ď 1 (defined in Section 4.4.2) measures the quality of the GP used to generate the random functions tf j u N j"1 for learning G.We emphasize that the number of training pairs is small only if the GP's quality is high.The probability bound in Theorem 3 implies that the constructed approximation is close to G with high probability and converges almost surely to the Green's function as Ñ 0.

Organization of paper
The paper is structured as follows.In Section 2, we briefly review HS operators and GPs.We then generalize the randomized SVD algorithm to HS operators in Section 3. Next, in Section 4, we characterize the learning rate for PDEs of the form of Eq. (2) (see Theorem 3).Finally, we conclude and discuss potential further directions in Section 5.

Background material
We begin by reviewing quasimatrices (see Section 2.1), HS operators (see Section 2.2), and GPs (see Section 2.3).

Quasimatrices
Quasimatrices are an infinite dimensional analogue of tall-skinny matrices [57].Let D 1 , D 2 Ď R d be two domains with d ě 1 and denote by L 2 pD 1 q the space of square-integrable functions defined on D 1 .Many of results in this paper are easier to state using quasimatrices.We say that Ω is a D 1 ˆk quasimatrix, if Ω is a matrix with k columns where each column is a function in L 2 pD 1 q.That is, Quasimatrices are useful to define analogues of matrix operations for HS operators [5,56,57,58].For example, if F : L 2 pD 1 q Ñ L 2 pD 2 q is a HS operator, then we write F Ω to denote the quasimatrix obtained by applying F to each column of Ω.Moreover, we write Ω ˚Ω and ΩΩ ˚to mean the following: where x¨, ¨y is the L 2 pD 1 q inner-product.Many operations for rectangular matrices in linear algebra can be generalized to quasimatrices [57].

Hilbert-Schmidt operators
HS operators are an infinite dimensional analogue of matrices acting on vectors.Since L 2 pD 1 q is a separable Hilbert space, there is a complete orthonormal basis te j u 8 j"1 for L 2 pD 1 q.We call F : L 2 pD 1 q Ñ L 2 pD 2 q a HS operator [23,Ch. 4] with HS norm }F } HS if F is linear and The archetypical example of an HS operator is an HS integral operator F : L 2 pD 1 q Ñ L 2 pD 2 q defined by pF f qpxq " where G P L 2 pD 2 ˆD1 q is the kernel of F and }F } HS " }G} L 2 pD2ˆD1q .Since HS operators are compact operators, they have an SVD [23,Thm. 4.3.1].That is, there exists a nonnegative sequence σ 1 ě σ 2 ě ¨¨¨ě 0 and an orthonormal basis tq j u 8 j"1 for L 2 pD 2 q such that for any f P L 2 pD 1 q we have where the equality holds in the L 2 pD 2 q sense.Note that we use the complete SVD, which includes singular functions associated with the kernel of F .Moreover, one finds that }F } 2 HS " ř 8 j"1 σ 2 j , which shows that the HS norm is an infinite dimensional analogue of the Frobenius matrix norm }¨} F .In the same way that truncating the SVD after k terms gives a best rank k matrix approximation, truncating Eq. ( 4) gives a best approximation in the HS norm.That is, [23,Thm. 4.4.7] In this paper, we are interested in constructing an approximation to G in Eq. ( 3) from input-output pairs tpf j , u j qu N j"1 such that u j " F f j .Throughout this paper, the HS operator denoted by ΩΩ ˚F : L 2 pD 1 q Ñ L 2 pD 2 q is given by Similarly, for F Ω : R k Ñ L 2 pD 2 q we have }F Ω} 2 HS " , where tẽ j u k j"1 is an orthonormal basis of R k .Moreover, if Ω has full column rank then P Ω F " ΩpΩ ˚Ωq : Ω ˚F is the orthogonal projection of the range of F onto the column space of Ω.Here, pΩ ˚Ωq : is the pseudo-inverse of Ω ˚Ω.

Gaussian processes
A GP is an infinite dimensional analogue of a multivariate Gaussian distribution and a function drawn from a GP is analogous to a randomly generated vector.If K : D ˆD Ñ R is a continuous symmetric positive semidefinite kernel, where D Ď R d is a domain, then a GP is a stochastic process tX t , t ě 0u such that for every finite set of indices t 1 , . . ., t n ě 0 the vector of random variables pX t1 , . . ., X tn q is a multivariate Gaussian distribution with mean p0, . . ., 0q and covariance K ij " Kpt i , t j q for 1 ď i, j ď n.We denote a GP with mean p0, . . ., 0q and covariance K by GPp0, Kq.
Since K is a continuous symmetric positive semidefinite kernel, it has nonnegative eigenvalues λ 1 ě λ 2 ě ¨¨¨ě 0 and there is an orthonormal basis of eigenfunctions tψ j u 8 j"1 of L 2 pDq such that [23,Thm. 4.6.5]: where the infinite sum is absolutely and uniformly convergent [39].In addition, we define the trace of the covariance kernel K by TrpKqř 8 j"1 λ j ă 8.The eigendecomposition of K gives an algorithm for generating functions from GPp0, Kq.In particular, if ω " ř 8 j"1 a λ j c j ψ j , where the " 1 " 0.1 " 0.01 Fig. 1 Squared-exponential covariance kernel K SE with parameter " 1, 0.1, 0.01 (top row) and five functions sampled from GPp0, K SE q (bottom row).
coefficients tc j u 8 j"1 are independent and identically distributed standard Gaussian random variables, then ω " GPp0, Kq [26,33].We also have where the last equality is analogous to the fact that the trace of a matrix is equal to the sum of its eigenvalues.In this paper, we restrict our attention to GPs with positive definite covariance kernels so that the eigenvalues of K are strictly positive.In Fig. 1, we display the squared-exponential kernel defined as K SE px, yq " expp´|x ´y| 2 {p2 2 qq for x, y P r´1, 1s [50,Chapt. 4] with parameters " 1, 0.1, 0.01 together with sampled functions from GPp0, K SE q.We observe that the functions become more oscillatory as the length-scale parameter decreases and hence the numerical rank of the kernel increases or, equivalently, the associated eigenvalues tλ j u decay more slowly to zero.

Low-rank approximation of Hilbert-Schmidt operators
In a landmark paper, Halko, Martinsson, and Tropp proved that one could learn the column space of a finite matrix-to high accuracy and with a high probability of success-by using matrix-vector products with standard Gaussian random vectors [22].We now set out to generalize this from matrices to HS operators.Alternative randomized low-rank approximation techniques such as the generalized Nyström method [42] might also be generalized in a similar manner.Since the proof is relatively long, we state our final generalization now.
Theorem 1 Let D 1 , D 2 Ď R d be domains with d ě 1 and F : L 2 pD 1 q Ñ L 2 pD 2 q be a HS operator.Select a target rank k ě 1, an oversampling parameter p ě 2, and a D 1 ˆpk `pq quasimatrix Ω such that each column is drawn from GPp0, Kq, where K : where γ k " k{pλ 1 TrpC ´1qq with C ij " ş D1ˆD1 v i pxqKpx, yqv j pyq dx dy for 1 ď i, j ď k.Here, P Y is the orthogonal projection onto the vector space spanned by the columns of Y, σ j is the jth singular value of F , and v j is the jth right singular vector of F .
Assume further that p ě 4, then for any s, t ě 1, we have with probability ě 1 ´t´p ´rse ´ps 2 ´1q{2 s k`p .
We remark that the term rse ´ps 2 ´1q{2 s k`p in the statement of Theorem 1 is bounded by e ´s2 for s ě 2 and k `p ě 5.In the rest of the section, we prove this theorem.

Three caveats that make the generalization non-trivial
One might imagine that the generalization of the randomized SVD algorithm from matrices to HS operators is trivial, but this is not the case due to three caveats: 1.The randomized SVD on finite matrices always uses matrix-vector products with standard Gaussian random vectors [22].However, for GPs, one must always have a continuous kernel K in GPp0, Kq, which discretizes to a non-standard multivariate Gaussian distribution.Therefore, we must extend [22,Thm. 10.5] to allow for non-standard multivariate Gaussian distributions.The discrete version of our extension is the following: Corollary 1 Let A be a real n 2 ˆn1 matrix with singular values σ 1 ě ¨¨¨ě σ mintn1,n2u .Choose a target rank k ě 1 and an oversampling parameter p ě 2. Draw an n 1 ˆpk `pq Gaussian matrix, Ω, with independent columns where each column is from a multivariate Gaussian distribution with mean p0, . . ., 0q J and positive definite covariance matrix K.If Y " AΩ, then the expected approximation error is bounded by where λ 1 ě ¨¨¨ě λ n1 ą 0 are the eigenvalues of K and P Y is the orthogonal projection onto the vector space spanned by the columns of Y. Assume further that p ě 4, then for any s, t ě 1, we have with probability ě 1 ´t´p ´rse ´ps 2 ´1q{2 s k`p .
Choosing a covariance matrix K with sufficient eigenvalue decay so that lim n1Ñ8 ř n1 j"1 λ j ă 8 allows Er}Ω} 2 F s to remain bounded as n 1 Ñ 8.This is of interest when applying the randomized SVD algorithm to extremely large matrices and is critical for HS operators.A stronger statement of this result [9,Thm. 2] shows that prior information on A can be incorporated into the covariance matrix to achieve lower approximation error than the randomized SVD with standard Gaussian vectors.2. We need an additional essential assumption.The kernel in GPp0, Kq is "reasonable" for learning F , where reasonableness is measured by the quantity γ k in Theorem 1.If the first k right singular functions of the HS operator v 1 , . . ., v k are spanned by the first k `m eigenfunctions of K ψ 1 , . . ., ψ k`m , for some m P N, then (see Eq. ( 11) and Lemma 2) In the matrix setting, this assumption always holds with m " n 1 ´k (see Corollary 1) and one can have γ k " 1 when λ 1 " ¨¨¨" λ n1 [22, Thm.10.5]. 3. Probabilistic error bounds for the randomized SVD in [22] are derived using tail bounds for functions of standard Gaussian matrices [30,Sec. 5.1].Unfortunately, we are not aware of tail bounds for non-standard Gaussian quasimatrices.This results in a slightly weaker probability bound than [22,Thm. 10.7].

Deterministic error bound
Apart from the three caveats, the proof of Theorem 1 follows the outline of the argument in [22,Thm. 10.5].We define two quasimatrices U and V containing the left and right singular functions of F so that the jth column of V is v j .We also denote by Σ the infinite diagonal matrix with the singular values of F , i.e., σ 1 ě σ 2 ě ¨¨¨ě 0, on the diagonal.Finally, for a fixed k ě 1, we define the D 1 ˆk quasimatrix as the truncation of V after the first k columns and V 2 as the remainder.
Similarly, we split Σ into two parts: .
We are ready to prove an infinite dimensional analogue of [22, Thm.9.1] for HS operators.
Theorem 2 (Deterministic error bound) Let F : L 2 pD 1 q Ñ L 2 pD 2 q be a HS operator with SVD given in Eq. (4).Let Ω be a then assuming Ω 1 has full rank, we have where P Y " YpY ˚Yq : Y ˚is the orthogonal projection onto the space spanned by the columns of Y and Ω : Proof First, note that because UU ˚is the orthonormal projection onto the range of F and U is a basis for the range, we have By Parseval's theorem [51,Thm. 4.18], we have Moreover, we have the equality }F ´PY F } HS " }pI´P U ˚Y qU ˚F V} HS because the inner product x ř 8 j"1 α j u j , ř 8 j"1 βu j y " 0 if and only if ř 8 j"1 α j β j " 0. We now take A " U ˚F V, which is a bounded infinite matrix such that }A} F " }F } HS ă 8.The statement of the theorem immediately follows from the proof of [22,Thm. 9.1].
[ \ This theorem shows that the bound on the approximation error }F ´PY F } HS depends on the singular values of the HS operator and the test matrix Ω.

Probability distribution of Ω 1
If the columns of Ω are independent and identically distributed as GPp0, Kq, then the matrix Ω 1 in Theorem 2 is of size k ˆ with entries that follow a Gaussian distribution.To see this, note that If ω " GPp0, Kq with K given in Eq. ( 5), then we find that xv, ωy " N p0, ř 8 j"1 λ j xv, ψ j y 2 q so we conclude that Ω 1 has Gaussian entries with zero mean.Finding the covariances between the entries is more involved.
Lemma 1 With the same setup as Theorem 2, suppose that the columns of Ω are independent and identically distributed as GPp0, Kq.Then, the matrix Ω 1 " V 1 Ω in Theorem 2 has independent columns and each column is identically distributed as a multivariate Gaussian with positive definite covariance matrix C given by Proof We already know that the entries are Gaussian with mean 0.Moreover, the columns are independent because ω 1 , . . ., ω are independent.Therefore, we focus on the covariance matrix.Let 1 ď i, i 1 ď k, 1 ď j, j 1 ď , then since Erxv i , ω j ys " 0 we have We first show that lim m1,m2Ñ8 ˇˇE " where the last inequality follows from the Cauchy-Schwarz inequality.We now set out to show that both terms in the last inequality converge to zero as m 1 , m 2 Ñ 8.The terms Er|X m2 i 1 j 1 | 2 s and Er|X ij | 2 s are bounded by ř 8 n"1 λ n ă 8, using the Cauchy-Schwarz inequality.Moreover, we have The latter expression is zero if n ‰ n 1 or j ‰ j 1 because then c pjq n and c pj 1 q n 1 are independent random variables with mean 0. Since Erpc pjq n q 2 s " 1, we have covpX ij , X i 1 j 1 q " # ř 8 n"1 λ n xv i , ψ n yxv i 1 , ψ n y, j " j 1 , 0, otherwise.
The result follows as the infinite sum is equal to the integral in Eq. ( 9).To see that C is positive definite, let a P R k , then a ˚Ca " ErZ 2 a s ě 0, where Z a " N p0, Moreover, a ˚Ca " 0 implies that a " 0 because v 1 , . . ., v k are orthonormal and tψ n u is an orthonormal basis of L 2 pD 1 q.
[ \ Lemma 1 gives the distribution of the matrix Ω 1 , which is essential to prove Theorem 1 in Section 3.6.In particular, Ω 1 has independent columns that are each distributed as a multivariate Gaussian with covariance matrix given in Eq. ( 9).

Quality of the covariance kernel
To investigate the quality of the kernel, we introduce the Wishart distribution, which is a family of probability distributions over symmetric and nonnegative-definite matrices that often appear in the context of covariance matrices [61].If Ω 1 is a k ˆ random matrix with independent columns, where each column is a multivariate Gaussian distribution with mean p0, . . ., 0q J and covariance C, then A " Ω 1 Ω 1 has a Wishart distribution [61].We write A " W k p , Cq.We note that }Ω : 1 } 2 F " TrrpΩ : 1 q ˚Ω: 1 s " TrpA ´1q, where the second equality holds with probability one because the matrix A " Ω 1 Ω 1 is invertible with probability one (see [41,Thm. 3.1.4]).By [41,Thm. 3.2.12]for ´k ě 2, we have ErA ´1s " 1 ´k´1 C ´1, ErTrpA ´1qs " TrpC ´1q{p ´k ´1q, and conclude that The quantity γ k can be viewed as measuring the quality of the covariance kernel K for learning the HS operator F (see Theorem 1).First, 1 ď γ k ă 8 as C is symmetric positive definite.Moreover, for 1 ď j ď k, the jth largest eigenvalue of C is bounded by the jth largest eigenvalue of K as C is a principal submatrix of V ˚K V [27, Sec.III.5].Therefore, the following inequality holds, and the harmonic mean of the first k scaled eigenvalues of K is a lower bound for 1{γ k .In the ideal situation, the eigenfunctions of K are the right singular functions of F , i.e., ψ n " v n , C is a diagonal matrix with entries λ 1 , . . ., λ k , and γ k " k{p ř k j"1 λ 1 {λ j q is as small as possible.We now provide a useful upper bound on γ k in a more general setting.
Lemma 2 Let V 1 be a D 1 ˆk quasimatrix with orthonormal columns and assume that there exists m P N such that the columns of V 1 are spanned by the first k `m eigenvectors of the continuous positive definite kernel K : D 1 ˆD1 Ñ R. Then where λ 1 ě λ 2 ě ¨¨¨ą 0 are the eigenvalues of K.This bound is tight in the sense that the inequality can be attained as an equality.
m s be a quasimatrix with orthonormal columns whose columns form an orthonormal basis for Spanpψ 1 , . . ., ψ k`m q.Then, Q is an invariant space of K and C is a principal submatrix of Q ˚K Q, which has eigenvalues λ 1 ě ¨¨¨ě λ k`m .By [27,Thm. 6.46] the k eigenvalues of C, denoted by µ 1 , . . ., µ k , are greater than the first k `m eigenvalues of K: µ j ě λ m`j for 1 ď j ď k, and the result follows as the trace of a matrix is the sum of its eigenvalues.

Probabilistic error bounds
As discussed in Section 3.1, we need to extend the probability bounds of the randomized SVD to allow for non-standard Gaussian random vectors.The following lemma is a generalization of [22,Thm. A.7].
[ \ Under the assumption of Lemma 2, we find that Lemma 3 gives the following bound: , .
To obtain the probability statement found in Eq. ( 13) we require control of the tail of the distribution of a Gaussian quasimatrix with non-standard covariance kernel (see Section 3.6).In the theory of the randomized SVD, one relies on the concentration of measure results [22,Prop. 10.3].However, we need to employ a different strategy and instead directly bound the HS norm of Ω 2 .One difficulty is that the norm of this matrix must be controlled for large dimensions n, which leads to a weaker probability bound than [22].While it is possible to apply Markov's inequality to obtain deviation bounds, we highlight that Lemma 4 provides a Chernoff-type bound, i.e., exponential decay of the tail distribution of }Ω 2 } HS , which is crucial to approximate Green's functions (see Section 4.4.3).
Lemma 4 With the same notation as in Theorem 2, let ě k ě 1.For all s ě 1 we have ı .
Proof We first remark that where the Z j are independent and identically distributed (i.i.d) because ω j " GPp0, Kq are i.i.d.For 1 ď j ď , we have (c.f.Section 2.3), where c pjq m " N p0, 1q are i.i.d for m ě 1 and 1 ď j ď .First, since the series in Eq. ( 12) converges absolutely, we have where the X m are independent random variables and X m " λ m χ 2 for 1 ď m ď N .Here, χ 2 denotes the chi-squared distribution [40,Chapt. 4.3].
Let N ě 1 and 0 ă θ ă 1{p2 TrpKqq, we can bound the moment generating function of because X m {λ m are independent random variables that follow a chi-squared distribution.Using the monotone convergence theorem, we have We can minimize this upper bound over 0 ă θ ă 1{p2 TrpKqq by choosing θ " s{p2p1 `sq TrpKqq, which gives Choosing s " ? 1 `s ě 1 concludes the proof.
[ \ Lemma 4 can be refined further to take into account the interaction between the Hilbert-Schmidt operator F and the covariance kernel K (see [9,Lem. 7]).

Randomized SVD algorithm for HS operators
We first prove an intermediary result, which generalizes [22,Prop. 10.1] to HS operators.Note that one may obtain sharper bounds using a suitably chosen covariance kernels that yields a lower approximation error [9].
Lemma 5 Let Σ 2 , V 2 , and Ω be defined as in Theorem 2, and T be an ˆk matrix, where ě k ě 1.Then, where Ω 2 " V 2 Ω.Therefore, we have Moreover, using the monotone convergence theorem for non-negative random variables, we have where σ k`1 , σ k`2 , . . .are the diagonal elements of Σ 2 .Then, the quasimatrix Ω 2 has independent columns and, using Lemma 1, we have is written as a Rayleigh quotient.Finally, we have by orthonormality of the columns on U T .
[ \ We are now ready to prove Theorem 1, which shows that the randomized SVD can be generalized to HS operators.

Proof (Proof of Theorem 1)
Let Ω 1 , Ω 2 be the quasimatrices defined in Theorem 2. The k ˆpk `pq matrix Ω 1 has full rank with probability one and by Theorem 2, we have where the last inequality follows from Cauchy-Schwarz inequality.Then, using Lemma 5 and Eq. ( 10), we have where γ k is defined in Section 3.4.The observation that }Σ 2 } 2 HS " ř 8 j"k`1 σ 2 j concludes the proof of Eq. ( 6).
For the probabilistic bound in Eq. ( 7), we note that by Theorem 2 we have, , where the second inequality uses the submultiplicativity of the HS norm.The bound follows from bounding }Ω : 1 } 2 F and }Ω 2 } 2 HS using Lemmas 3 and 4, respectively.

Recovering the Green's function from input-output pairs
It is known that the Green's function associated with Eq. ( 2) always exists, is unique, is a nonnegative function G : and for each y P Ω and any r ą 0, we have Gp¨, yq P H 1 pDzB r pyqq X W 1,1 0 pDq [19]. 3Since the PDE in Eq. ( 2) is self-adjoint, we also know that for almost every x, y P D, we have Gpx, yq " Gpy, xq [19].
We now state Theorem 3, which shows that if N " Op ´6 log 4 p1{ qq and one has N inputoutput pairs tpf j , u j qu N j"1 with algorithmically-selected f j , then the Green's function associated with L in Eq. ( 2) can be recovered to within an accuracy of OpΓ ´1{2 log 3 p1{ q q with high probability.Here, the quantity 0 ă Γ ď 1 measures the quality of the random input functions tf j u N j"1 (see Section 4.4.2).
Theorem 3 Let 0 ă ă 1, D Ă R 3 be a bounded Lipschitz domain, and L given in Eq. (2).If G is the Green's function associated with L, then there is a randomized algorithm that constructs an approximation G of G using Op ´6 log 4 p1{ qq input-output pairs such that, as Ñ 0, we have with probability ě 1 ´Op logp1{ q´6 q.The term Γ is defined by Eq. (25).
Our algorithm that leads to the proof of Theorem 3 relies on the extension of the randomized SVD to HS operator (see Section 3) and a hierarchical partition of the domain of G into "wellseparated" domains.

Recovering the Green's function on admissible domains
Roughly speaking, as }x´y} 2 increases G becomes smoother about px, yq, which can be made precise using so-called admissible domains [1,2,21].Let diam Xsup x,yPX }x ´y} 2 be the diameter of X, distpX, Y q -inf xPX,yPY }x ´y} 2 be the shortest distance between X and Y , and ρ ą 0 be a fixed constant.If X, Y Ă R 3 are bounded domains, then we say that X ˆY is an admissible domain if distpX, Y q ě ρ maxtdiam X, diam Y u; otherwise, we say that X ˆY is non-admissible.There is a weaker definition of admissible domains as distpX, Y q ě ρ mintdiam X, diam Y u [21, p. 59], but we do not consider it.

Approximation theory on admissible domains
It turns out that the Green's function associated with Eq. ( 2) has rapidly decaying singular values when restricted to admissible domains.Roughly speaking, if X, Y Ă D are such that X ˆY is an admissible domain, then G is well-approximated by a function of the form [3] for some functions g 1 , . . ., g k P L 2 pXq and h 1 , . . ., h k P L 2 pY q.This is summarized in Theorem 4, which is a corollary of [3,Thm. 2.8].
Theorem 4 Let G be the Green's function associated with Eq. (2) and ρ ą 0. Let X, Y Ă D such that distpX, Y q ě ρ maxtdiam X, diam Y u.Then, for any 0 ă ă 1, there exists k ď krcpρ, diam D, κ C qsrlogp1{ qs4 `rlogp1{ qs and an approximant, G k , of G in the form given in Eq. ( 14) such that where κ C " λ max {λ min is the spectral condition number of the coefficient matrix Apxq in Eq. (2) 4  and c is a constant that only depends on ρ, diam D, κ C .
Proof In [3, Thm.2.8], it is shown that if Y " Ỹ X D and Ỹ is convex, then there exists k ď c 3 ρ{2 rlogp1{ qs 4 `rlogp1{ qs and an approximant, G k , of G such that }Gpx, ¨q ´Gk px, ¨q} L 2 pY q ď }Gpx, ¨q} where Ŷ -ty P D, distpy, Y q ď ρ 2 diam Y u and c ρ{2 is a constant that only depends on ρ, diam Y , and κ C .As remarked by [3], Ỹ can be included in a convex of diameter diam D that includes D to obtain the constant cpρ, diam D, κ C q.The statement follows by integrating the error bound in Eq. ( 15) over X.
[ \ Since the truncated SVD of G on X ˆY gives the best rank k ě k approximation to G, Theorem 4 also gives bounds on singular values: where σ j,XˆY is the jth singular value of G restricted to X ˆY .Since k " Oplog 4 p1{ qq, we conclude that the singular values of G restricted to admissible domains X ˆY rapidly decay to zero.

Randomized SVD for admissible domains
Since G has rapidly decaying singular values on admissible domains X ˆY , we use the randomized SVD for HS operators to learn G on X ˆY with high probability (see Section 3).We start by defining a GP on the domain Y .Let R Y ˆY K be the restriction5 of the covariance kernel K to the domain Y ˆY , which is a continuous symmetric positive definite kernel so that GPp0, R Y ˆY Kq defines a GP on Y .We choose a target rank k ě 1, an oversampling parameter p ě 2, and form a quasimatrix Ω " " such that f j P L 2 pY q and f j " GPp0, R Y ˆY Kq are identically distributed and independent.We then extend by zero each column of Ω from Given the training data, Y " " u 1 | ¨¨¨| u k`p ‰ such that Lu j " R Y f j and u j | BD " 0, we now construct an approximation to G on X ˆY using the randomized SVD (see Section 3).Following Theorem 1, we have the following approximation error for t ě 1 and s ě 2: with probability greater than 1 ´t´p ´e´s 2 pk`pq .Here, λ 1 ě λ 2 ě ¨¨¨ą 0 are the eigenvalues of K, GXˆY " P R X Y R X F R Y and P R X Y " R X YppR X Yq ˚RX Yq : pR X Yq ˚is the orthogonal projection onto the space spanned by the columns of R X Y.Moreover, γ k,XˆY is a measure of the quality of the covariance kernel of GPp0, R Y ˆY R Y ˆY Kq (see Section 3.4) and, for 1 ď i, j ď k, defined as γ k,XˆY " k{pλ 1 TrpC ´1 XˆY qq, where and v 1,XˆY , . . ., v k,XˆY P L 2 pY q are the first k right singular functions of G restricted to X ˆY .Unfortunately, there is a big problem with the formula GXˆY " P R X Y R X F R Y .It cannot be formed because we only have access to input-output data, so we have no mechanism for composing P R X Y on the left of R X F R Y .Instead, we note that since the partial differential operator in Eq. ( 2) is self-adjoint, F is self-adjoint, and G is itself symmetric.That means we can use this to write down a formula for GY ˆX instead.That is, where we used the fact that P R X Y is also self-adjoint.This means we can construct GY ˆX by asking for more input-output data to assess the quasimatrix F pR X R X Yq.Of course, to compute GXˆY , we can swap the roles of X and Y in the above argument.
With a target rank of k " k " rcpρ, diam D, κ C qsrlogp1{ qs 4 `rlogp1{ qs and an oversampling parameter of p " k , we can combine Theorem 4 and Eqs. ( 16) and ( 17) to obtain the bound with probability greater than 1 ´t´k ´e´2s 2 k .A similar approximation error holds for GY ˆX without additional evaluations of F .We conclude that our algorithm requires N ,XˆY " 2pk `pq " O `log 4 p1{ q ˘input-output pairs to learn an approximant to G on X ˆY and Y ˆX.

Ignoring the Green's function on non-admissible domains
When the Green's function is restricted to non-admissible domains, its singular values may not decay.Instead, to learn G we take advantage of the off-diagonal decay property of G.It is known that for almost every x ‰ y P D then where c κ C is an implicit constant that only depends on κ C (see [19,Thm. 1.1]). 6 If X ˆY is a non-admissible domain, then for any px, yq P X ˆY , we find that }x ´y} 2 ď distpX, Y q `diampXq `diampY q ă p2 `ρq maxtdiam X, diam Y u, because distpX, Y q ă ρ maxtdiam X, diam Y u.This means that x P B r pyq X D, where r " p2 ρq maxtdiam X, diam Y u.Using Eq. ( 18), we have Noting that diampY q ď r{p2 `ρq and ş Y 1 dy ď 4πpdiampY q{2q 3 {3, we have the following inequality for non-admissible domains X ˆY : where r " p2 `ρq maxtdiam X, diam Y u.We conclude that the Green's function restricted to a non-admissible domain has a relatively small norm when the domain itself is small.Therefore, in our approximant G for G, we ignore G on non-admissible domains by setting G to be zero.
Fig. 2 Two levels of hierarchical partitioning of r0, 1s 3 .The blue and green domains are admissible, while the blue and red domains are non-admissible.

Hierarchical admissible partition of domain
We now describe a hierarchical partitioning of D ˆD so that many subdomains are admissible domains, and the non-admissible domains are all small.For ease of notion, we may assumewithout loss of generality-that diam D " 1 and D Ă r0, 1s 3 ; otherwise, one should shift and scale D.Moreover, partitioning r0, 1s 3 and restricting the partition to D is easier than partitioning D directly.For the definition of admissible domains, we find it convenient to select ρ " 1{ ? 3. Let I " r0, 1s 3 .The hierarchical partitioning for n levels is defined recursively as: -I 1ˆ1ˆ1 -I 1 ˆI1 ˆI1 " r0, 1s 3 is the root for level L " 0.
-At a given level 0 ď L ď n ´1, if I j1ˆj2ˆj3 -I j1 ˆIj2 ˆIj3 is a node of the tree, then it has 8 children defined as tI 2j1`nj p1q ˆI2j2`njp2q ˆI2j3`njp3q | n j P t0, 1u 3 u.
Here, if I j " ra, bs, 0 ď a ă b ď 1, then The set of non-admissible domains can be given by this unwieldy expression where ^is the logical "and" operator.The set of admissible domains is given by ΛpP non-adm pL ´1qqzP non-adm pLqq, where P non-adm pLq is the set of non-admissible domain for a hierarchical level of L and ΛpP non-adm pL ´1qq " ď Using Eq. ( 20)-Eq.( 21), the number of admissible and non-admissible domains are precisely |P non-adm | " p3 ˆ2n ´2q 3 and |P adm | " ř n "1 2 6 p3 ˆ2L´1 ´2q 3 ´p3 ˆ2L ´2q 3 .In particular, the size of the 1D 3D Fig. 3 For illustration purposes, we include the hierarchical structure of the Green's functions in 1D after 4 levels (left) and in 3D after 2 levels (right).The hierarchical structure in 3D is complicated as this is physically a 6dimensional tensor that has been rearranged so it can be visualized.
partition at the hierarchical level 0 ď L ď n is equal to 8 L and the tree has a total of p8 n`1 ´1q{7 nodes (see Fig. 3).Finally, the hierarchical partition of D ˆD can be defined via the partition P " P adm Y P non-adm of r0, 1s 3 by doing the following: The sets of admissible and non-admissible domains of D ˆD are denoted by P adm and P non-adm in the next sections.

Recovering the Green's function on the entire domain
We now show that we can recover G on the entire domain D ˆD.

Global approximation on the non-admissible set
Let n be the number of levels in the hierarchical partition D ˆD (see Section 4.3).We want to make sure that the norm of the Green's function on all non-admissible domains is small so that we can safely ignore that part of G (see Section 4.2).As one increases the hierarchical partitioning levels, the volume of the non-admissible domains get smaller (see Fig. 4).Let X ˆY P P non-adm be a non-admissible domain, the two domains X and Y have diameter bounded by ?3{2 n because they are included in cubes of side length 1{2 n (see Section 4.3).Combining this with Eq. ( 19) yields Fig. 4 For illustration purposes, we include the hierarchical structure of the Green function in 1D.The green blocks are admissible domains at that level, the gray blocks are admissible at a higher level, and the red blocks are the non-admissible domains at that level.The area of the non-admissible domains decreases at deeper levels.
Therefore, the L 2 -norm of G on the non-admissible domain P non-adm satisfies where we used |P non-adm | " p3 ˆ2n ´2q 3 ď 27p2 3n q.This means that if we select n to be n " then we guarantee that }G} L 2 pP non-adm q ď }G} L 2 pDˆDq .We can safely ignore G on non-admissible domains-by taking the zero approximant-while approximating G to within .

Learning rate of the Green's function
Following Section 4.1.2,we can construct an approximant GXˆY to the Green's function on an admissible domain X ˆY of the hierarchical partitioning using the HS randomized SVD algorithm, which requires N ,XˆY " Oplog 4 p1{ qq input-output training pairs (see Section 4.1.2).Therefore, the number of training input-output pairs needed to construct an approximant to G on all admissible domains is given by N " ÿ where |P adm | denotes the total number of admissible domains at the hierarchical level n , which is given by Eq. ( 22).Then, we have (see Section 4.3): and, using Eq. ( 22), we obtain |P adm | " Op1{ 6 q.This means that the total number of required input-output training pairs to learn G with high probability is bounded by

Conclusions and discussion
This paper rigorously learns the Green's function associated with a PDE rather than the partial differential operator (PDO).By extending the randomized SVD to HS operators, we can identify a learning rate associated with elliptic PDOs in three dimensions and bound the number of inputoutput training pairs required to recover a Green's function approximately.One practical outcome of this work is a measure for the quality of covariance kernels, which may be used to design efficient kernels for PDE learning tasks.
There are several possible future extensions of these results related to the recovery of hierarchical matrices, the study of other partial differential operators, and practical deep learning applications, which we discuss further in this section.

Fast and stable reconstruction of hierarchical matrices
We described an algorithm for reconstructing Green's function on admissible domains of a hierarchical partition of D ˆD that requires performing the HS randomized SVD Op ´6q times.We want to reduce it to a factor that is Oppolylogp1{ qq.
For n ˆn hierarchical matrices, there are several existing algorithms for recovering the matrix based on matrix-vector products [6,32,36,37].There are two main approaches: (1) The "bottomup" approach: one begins at the lowest level of the hierarchy and moves up and (2) The "top-down" approach: one updates the approximant by peeling off the off-diagonal blocks and going down the hierarchy.The bottom-up approach requires Opnq applications of the randomized SVD algorithm [36].There are lower complexity alternatives that only require Oplogpnqq matrix-vector products with random vectors [32].However, the algorithm in [32] is not yet proven to be theoretically stable as errors from low-rank approximations potentially accumulate exponentially, though this is not observed in practice.For symmetric positive semi-definite matrices, it may be possible to employ a sparse Cholesky factorization [54,55].This leads us to formulate the following challenge: Algorithmic challenge: Design a provably stable algorithm that can recover an n ˆn hierarchical matrix using Oplogpnqq matrix-vector products with high probability?
If one can design such an algorithm and it can be extended to HS operators, then the Op ´6 log 4 p1{ qq term in Theorem 3 may improve to Oppolylogp1{ qq.This means that the learning rate of partial differential operators of the form of Eq. (2) will be a polynomial in logp1{ q and grow sublinearly with respect to 1{ .

Extension to other partial differential operators
Our learning rate for elliptic PDOs in three variables (see Section 4) depends on the decay of the singular values of the Green's function on admissible domains [3].We expect that one can also find the learning rate for other PDOs.
It is known that the Green's functions associated to elliptic PDOs in two dimensions exist and satisfy the following pointwise estimate [12]: where d x " distpx, BDq, γ is a constant depending on the size of the domain D, and C is an implicit constant.One can conclude that Gpx, ¨q is locally integrable for all x P D with }Gpx, ¨q} L p pBrpxqXDq ă 8 for r ą 0 and 1 ď p ă 8.We believe that the pointwise estimate in Eq. ( 27) implies the offdiagonal low-rank structure of G here, as suggested in [3].Therefore, we expect that the results in this paper can be extended to elliptic PDOs in two variables.PDOs in four or more variables are far more challenging since we rely on the following bound on the Green's function on non-admissible domains [19]: Gpx, yq ď cpd, κ C q λ min }x ´y} 2´d 2 , x ‰ y P D, where D Ă R d , d ě 3 is the dimension, and c is a constant depending only on d and κ C .This inequality implies that the L p -norm of G on non-admissible domains is finite when 0 ď p ă d{pd´2q.However, for a dimension d ě 4, we have p ă 2 and one cannot ensure that the L 2 norm of G is finite.Therefore, the Green's function may not be compatible with the HS randomized SVD.It should also be possible to characterize the learning rate for elliptic PDOs with lower order terms (under reasonable conditions) [13,24,28] and many parabolic operators [29] as the associated Green's functions have similar regularity and pointwise estimates.The main task is to extend [3,Thm. 2.8] to construct separable approximations of the Green's functions on admissible domains.In contrast, we believe that deriving a theoretical learning rate for hyperbolic PDOs remains a significant research challenge for many reasons.The first roadblock is that the Green's function associated with hyperbolic PDOs do not necessarily lie in L 2 pD ˆDq.For example, the Green's function associated with the wave equation in three variables, i.e., L " B 2 t ´∇2 , is not squareintegrable as Gpx, t, y, sq " δpt ´s ´}x ´y} 2 q 4π}x ´y} 2 , px, tq, py, sq P R 3 ˆr0, 8q, where δp¨q is the Dirac delta function.

Connection with neural networks
There are many possible connections between this work and neural networks (NNs) from practical and theoretical viewpoints.The proof of Theorem 3 relies on the construction of a hierarchical partition of the domain D ˆD and the HS randomized SVD algorithm applied on each admissible domain.This gives an algorithm for approximating Green's functions with high probability.However, there are more practical approaches that currently do not have theoretical guarantees [17,18].A promising opportunity is to design a NN that can learn and approximate Green's functions using input-output training pairs tpf j , u j qu N j"1 [7].Once a neural network N has been trained such that }N ´G} L 2 ď }G} L 2 , the solution to Lu " f can be obtained by computing the following integral: upxq " ż D N px, yqf pyq dy.Therefore, this may give an efficient computational approach for discovering operators since a NN is only trained once.Incorporating a priori knowledge of the Green's function into the network architecture design could be particularly beneficial.One could also wrap the selection of the kernel in the GP for generating random functions and training data into a Bayesian framework.
Finally, we wonder how many parameters in a NN are needed to approximate a Green's function associated with elliptic PDOs within a tolerance of 0 ă ă 1. Can one exploit the off-diagonal low-rank structure of Green's functions to reduce the number of parameters?We expect the recent work on the characterization of ReLU NNs' approximation power is useful [20,44,62].The use of NNs with high approximation power such as rational NNs might also be of interest to approximate the singularities of the Green's function near the diagonal [8].