1 Introduction

Physics-inspired machine learning is an actively studied area with two approaches to learning. One approach focuses on determining solutions to partial differential equations (PDEs) for fixed PDE and boundary conditions and includes the deep Ritz method (Weinan & Yu, 2018), physics-informed neural networks (Raissi et al., 2019), and least-squares ReLU neural network (Cai et al., 2021). The other approach focuses on the operators between the function spaces and includes DeepONets (Lu et al., 2021), multiwavelet-based operator (Gupta et al., 2021), graph neural operator (Li et al., 2020) and Fourier neural operator (FNO) Li et al. (2021). This study focuses on FNO, which uses a Fourier transform to manage the convolution operator between two functions quickly and practically. One advantage of FNO is its computational efficiency; unlike DeepONet, its representation is not limited to a finite-dimensional space spanned by a few basis functions. Previous studies ( Li et al. (2021) and Pathak et al. (2022)) have confirmed that the FNO successfully approximates numerical solvers and real-world data, verifying its computational efficiency and potential applicability. Unlike real-world machine-learning problems, approximating the solver operator of a PDE is deterministic and concrete. The universal approximation property of the FNO and its approximation error for certain PDE problems (Kovachki et al., 2021) have been verified; however, there are no results estimating the generalization error of the FNO. Moreover, although approximating the solver operator of the PDE is a deterministic problem, we can only provide a finite number of samples to the FNO. Therefore, the accurate inference of hidden data is another problem to consider. Several approaches have been proposed regarding the bounding generalization error of deep neural networks, such as the group norm of weights (Neyshabur et al., 2015), spectral norm (Bartlett et al., 2021), path norm (Neyshabur et al., 2015), Fisher-Rao norm (Liang et al., 2019), and relative flatness (Petzka et al., 2021). In this study, we investigated the bounding of generalization errors within the probably approximately correct (PAC) learning theory framework. In particular, we bound the Rademacher complexity of the FNO.

1.1 Overview of FNOs

Fig. 1
figure 1

a Sketch of the overall architecture of Fourier neural operator (FNO) b Detailed diagram of the Fourier layers

Figure 1 illustrates the FNO architecture. The network input is \(\mathbb {R}^{d_{a}}\)-valued function in the domain \(\tilde{D} \subset \mathbb {R}^{d}\). We denote the input function space of FNO by \(\mathcal {A}(\tilde{D};\mathbb {R}^{d_{a}})\). The vector value of the input function is lifted to a \(d_{v}\)-dimensional vector using a layer defined as \(\mathcal {N}_{P}\); while passing through the Fourier layers (denoted as \(\mathcal {A}_{i}\) in the diagram) iteratively, it is processed as a \(\mathbb {R}^{d_{v}}\)-valued function. Each Fourier layer comprises an activation function, which is the sum of a neural network with a convolution of the input function with a kernel parameterized by weight \(R_{i}\). After passing through the Fourier layers, the vector value of the \(\mathbb {R}^{d_{v}}\)-valued function \(v_{D}\) is projected onto the \(d_{u}\)-dimensional vector using \(\mathcal {N}_{Q}\). We denote the output function space of FNO by \(\mathcal {U}(\tilde{D};\mathbb {R}^{d_{u}})\). The neural network \(A_{i}\) in the Fourier layers can be chosen arbitrarily. In our results, we chose \(A_{i}\) as a fully connected network (FCN) or convolutional neural network (CNN). Because computational machines cannot handle infinite-dimensional data, we constructed an FNO model using finite parameters based on the above concept, considering real-world implementation.

1.2 Probably approximately correct learning

PAC learning is a framework of statistical learning theory proposed by Valiant (1984). One of the main concepts of PAC learning theory is the no-free-lunch (NFL) theorem, which states that it is not possible to achieve low approximation and estimation errors simultaneously. The trade-off between the two errors is closely related to the complexity of the hypothesis class. Various quantities related to the complexity of the hypothesis class, such as the VC dimension, Rademacher complexity, and Gaussian complexity, determine the learnability and decay of estimation errors. All the complexities are related; however, there are several differences. For example, the VC dimension is independent of the training set, whereas the others are not. Neural networks and deep learning can be applied to PAC learning theory as a subcategory of machine learning. Recently, various studies have investigated bounding the Rademacher complexity and the VC dimensions of the hypothesis class of neural networks. For instance, results regarding the bounding of Rademacher complexities for FCN (Neyshabur et al., 2015), RNN (Minshuo et al., 2020), and GCN (Lv, 2021), and analysis of the VC dimension of neural networks (Sontag, 1998) have been obtained. In addition, there is information about the bounding Rademacher complexity of DeepONet (Gopalani et al., 2022), a kind of neural operator. Reference (Weinan et al., 2020) estimated the generalization error of ResNet in prior and posterior estimates.

1.3 Our contributions

In this study, we defined the capacities of FNO models based on certain group norms. We bound the Rademacher complexity of the hypothesis class based on these capacities for two types of FNOs (Fourier layers with FCN and CNN) and induced the bounding of the posterior generalization error of the FNO models. In Sect. 4, we experiment with data generated from the Burgers equation problem and verify the correlation between our bounding process and empirical generalization errors. Through experiments, we gain insights into the model architecture and weights contained in various capacities. We also qualitatively confirmed that the empirical generalization errors depend on the number of modes used in the FNO model. Furthermore, we confirmed the strong correlation between our capacity factored by dataset size and empirical generalization error on experiments with varying dataset size. We replicate the experiments using other PDE problems. Finally, we compared our capacity with the Hessian trace, Fisher-Rao norm, and relative flatness, showing time, memory efficiency, and effectiveness of our capacity.

2 Preliminary

Notation Several indices are considered in the discussion. Therefore, to simplify the formulas, we denote \(x_{1}\dots x_{d}\) as \({\textbf {x}}\) and \(k_{1}\dots k_{d}\) as \({\textbf {k}}\). In addition, for the multi-index tensor in the norm, indices denoted by \(\cdot\) are used to calculate the norm; for example,

$$\begin{aligned} \Vert A_{xy\cdot }\Vert _{p} = \root p \of {\sum _{i}{\Big (A_{xyi}\Big )^{p}}}. \end{aligned}$$

Discretization of data Because the function space is infinite-dimensional, to treat the data and operator numerically, we discretize the function domain and consider the function to be a finite-dimensional vector. Let \(\tilde{D}_{N}=\{x_{1},...,x_{N}\}\) be the discretization of the domain \(\tilde{D}\in \mathbb {R}^{d}\). Then, the \(\mathbb {R}^{m}\)-valued function f is discretized into \((f(x_{1}),...,f(x_{N})))\in \mathbb {R}^{N\times m}\): Subsequently, we discretize \(\mathcal {A}(\tilde{D};\mathbb {R}^{d_{a}})\) and \(\mathcal {U}(\tilde{D};\mathbb {R}^{d_{u}})\) as \(\mathbb {R}^{N\times d_{a}}\) and \(\mathbb {R}^{N\times d_{u}}\), respectively. Then, sample data are defined as follows: element \(((a_{jk}),(u_{jk}))\in \mathbb {R}^{N\times d_{a}}\times \mathbb {R}^{N\times d_{u}}\).

Fourier transform Based on the Fourier analysis, we know that the Fourier transform transforms the convolution operation to pointwise multiplication. For the function of domain \(\tilde{D} \subset \mathbb {R}^{d}\), let \(\mathcal {F}\) and \(\mathcal {F}^{-1}\) be the Fourier and inverse Fourier transforms over \(\tilde{D}\), respectively. Thus, we obtain the following relationship:

$$\begin{aligned} f*k=\mathcal {F}^{-1}(\mathcal {F}(k)\cdot \mathcal {F}(f)). \end{aligned}$$

For our analysis, we select \(\tilde{D}\) as \([0,2\pi ]^{d}\). Because we treat functions as discretized vectors, we can treat the Fourier transform as a discrete Fourier transform. If the discretization of \(\tilde{D}\) is uniform, it can be replaced by a fast Fourier transform. Consider that \(\tilde{D}\) is discretized uniformly by resolution \(N_{1}\times \dots \times N_{d} = N\); then, for discretized function \(f \in \mathbb {R}^{N}\), its FFT \(\mathcal {F}(f)(k)\) and IFFT \(\mathcal {F}^{-1}(f)(k)\) are defined as follows:

$$\begin{aligned}{} & {} \mathcal {F}(f)(k)=\frac{1}{\sqrt{N_{1}\dots N_{d}}}\sum _{x_{1}}^{N_{1}}\dots \sum _{x_{d}} ^{N_{d}}f(x_{1},...,x_{d})e^{-2i\pi \sum _{j=1}^{d}{\frac{x_{j}k_{j}}{N_{j}}}} \\{} & {} \quad \mathcal {F}^{-1}(f)(k)=\frac{1}{\sqrt{N_{1}\dots N_{d}}}\sum _{x_{1}}^{N_{1}}\dots \sum _{x_{d}}^{N_{d}}f(x_{1},...,x_{d})e^ {2i\pi \sum _{j=1}^{d}{\frac{x_{j}k_{j}}{N_{j}}}}. \end{aligned}$$

In our analysis, we denote the components of FFT and IFFT tensors as follows: \(F_{{\textbf {k}}{} {\textbf {x}}}=\frac{1}{\sqrt{N_{1}\dots N_{d}}}e^{-2i\pi \sum _{j=1}^{d}{\frac{x_{j}k_{j}}{N_{j}}}}\), \(F^{\dag }_{{\textbf {x}}{} {\textbf {k}}}=\frac{1}{\sqrt{N_{1}\dots N_{d}}}e^{2i\pi \sum _{j=1}^{d}{\frac{x_{j}k_{j}}{N_{j}}}}\), respectively.

Definition 1

(General FNO) Let \(\tilde{D}_{N}\) be the discretized domain in \(\mathbb {R}^{d}\); then, \({\textbf {FNO}}: \mathbb {R}^{N\times d_{a}} \rightarrow \mathbb {R}^{N\times d_{u}}\) is defined as follows:

$$\begin{aligned} {\textbf {FNO}}=\mathcal {N}_{Q}\circ \mathcal {A}_{D}\circ \mathcal {A}_{D-1}\cdots \circ \mathcal {A}_{1}\circ \mathcal {N}_{P}, \end{aligned}$$

where \(\mathcal {N}_{P}\) and \(\mathcal {N}_{Q}\) denote the neural networks used for lifting and projection, respectively. Each \(\mathcal {A}_{i}\) is a Fourier layer. For simplicity, we assume that \(\mathcal {N}_{Q}\) and \(\mathcal {N}_{P}\) are linear maps. Each Fourier layer is a composition of the activation function with a sum of convolutions based on a parameterized function and linear map. Only partial frequencies were used in the Fourier layers, expressed as an index set \(K=\{(k_{1},...,k_{d})\in \mathbb {Z}^{d}: 0 \le k_{j} \le k_{max,j}, j=1,...,d\}\). The formula for the FNO is

$$\begin{aligned} v_{0}&:=\mathcal {N}_{P}(a)=\sum _{k}{P_{jk}a_{{\textbf {x}}k}} \\ v_{t+1}&:=\mathcal {A}_{t+1}(v_{t})=\sigma \bigg (A_{t+1}v_{t}+\mathcal {F}^{-1} \Big (R_{t+1}\cdot (\mathcal {F}(v_{t}))\Big )\bigg ). \\&=\sigma \Big (\sum _{{\textbf {z}},k}{{A_{t+1,{\textbf {x}}{} {\textbf {z}}jk}v_{t,{\textbf {z}}k}}} + \sum _{{\textbf {z}},{\textbf {k}}\in K,k} {F^{\dag }_{{\textbf {x}}{} {\textbf {k}}}R_{t+1,{\textbf {k}} ,jk}F_{{\textbf {k}}{} {\textbf {z}}}v_{t,{\textbf {z}}k}}\Big ) \quad (t=0,...,D-1) \\ u&:=\sum _{k}{v_{D,{\textbf {x}}k}Q_{kj}}. \end{aligned}$$

CNN layer For each Fourier layer, a general linear map can be replaced by a CNN layer. A schematic of the convolution with 2D data and a kernel is shown in Fig. 2.

Fig. 2
figure 2

Schematic of 2D-CNN layer

A certain kernel size swipes the input tensors so that each index of outputs has an inner product with the kernel and local components of the input tensor centering the index. For example, for a d-rank input tensor of size \(N_{1}\times \cdots \times N_{d}\), we consider a d-rank tensor kernel K of size \(c_{1}\times \cdots \times c_{d}\), with \(c_{i}\) less than \(N_{i}\). Let us denote this CNN layer by the kernel \(C(c_{1}\times \cdots \times c_{d})\); then, the tensor that passes through the CNN layer with K is defined as follows:

$$\begin{aligned} C(c_{1},\dots ,c_{d})(x_{x_{1}\cdots x_{d}})_{z_{1}\cdots z_{d}} = \sum _{j_{1}=0}^{c_{1}-1}\cdots \sum _{j_{d}=0}^{c_{d}-1} K_{j_{1},\dots ,j_{d}}x_{z_{1}+j_{1},\dots ,z_{d}+j_{d}}. \end{aligned}$$

The CNN layers were restricted to kernels of odd sizes to maintain the positional dimension of the tensor. Padding was applied to the input tensor of the CNN layer to fit the dimensions. For example, for \(N_{1}\times \cdots \times N_{d}\)-dimensional tensor \(x_{x_{1}\cdots x_{d}}\) and CNN layer \(C(c_{1},\dots ,c_{d})\), we pad \(\frac{c_{i}-1}{2}\) zeros for each side of the input tensor. We denote this padded tensor by \(\tilde{x}\). Subsequently, \(C(c_{1},\dots ,c_{d})(\tilde{x}_{x_{1}\cdots x_{d}})\) have the same dimensions as the input tensor. As the number of channels in the Fourier layers is fixed, for a CNN layer with multiple channels, we use the same notation, that is, \(C(c_{1},\dots ,c_{d})\). The formula for multi-channel CNN layer is as follows:

$$\begin{aligned} C(c_{1},\dots ,c_{d})(x_{x_{1}\cdots x_{d}})_{z_{1}\cdots z_{d}j} = \sum _{k=1}^{d_{u}}\sum _{j_{1}=0}^{c_{1}-1}\cdots \sum _{j_{d}=0}^{c_{d}-1} K_{j_{1},\dots ,j_{d},k,j}x_{z_{1}+j_{1},\dots ,z_{d}+j_{d},k}. \end{aligned}$$

Definition 2

(FNO with CNN layer) Consider the settings of the above FNO; the only difference is that the Fourier layer is the sum of the CNN layer and convolution with parameterized functions.

$$\begin{aligned} v_{t+1}&:=\mathcal {A}_{t+1}(v_{t})=\sigma \bigg (C_{t+1}(c_{1},\dots ,c_{d})(\tilde{v_{t}}) +\mathcal {F}^{-1}\Big (R_{t+1}\cdot (\mathcal {F}(v_{t}))\Big )\bigg ) \\&=\sigma \Big (\sum _{k=1}^{d_{u}}\sum _{j_{1}=0}^{c_{1}-1}\cdots \sum _{j_{d}=0}^{c_{d}-1} K_{t+1,jk,j_{1},\dots ,j_{d}}\tilde{v_{t}}_{x_{1}+j_{1},\dots ,x_{d}+j_{d},k} \\&\quad +\sum _{{\textbf {z}},{\textbf {k}}\in K,k}{{F_{{\textbf {x}}{} {\textbf {k}}}}^{\dag }R_{t+1,{\textbf {k}},jk}F_{{\textbf {k}}{} {\textbf {z}}}v_ {t,{\textbf {z}}k}}\Big ). \end{aligned}$$

An ideal operator should infer a solution from all the functions in the input function space. However, for practical and implementation ease, finite training samples were selected from vector space distributions, which is a discretized function space. Suppose \(\mathcal {D}\) is a distribution on \(\mathbb {R}^{N\times d_{a}}\times \mathbb {R}^{N\times d_{u}}\). Then, we define the loss function as follows:

Definition 3

(Loss for FNO) Suppose that the training dataset is given by

$$\begin{aligned} S:=\{((a_{i,jk}),(u_{i,jk}))\in \mathbb {R}^{N\times d_{a}}\times \mathbb {R}^{N\times d_{u}}:i=1,...,m\}, \end{aligned}$$

where each sample is chosen independent of the distribution \(\mathcal {D}\). The training loss is defined as follows:

$$\begin{aligned} \mathcal {L}_{S}:=\frac{1}{m}\sum _{i=1}^{m}\|u_{i,\cdot\cdot}-{\textbf {FNO}}(a_{i,\cdot\cdot})\|^{2}. \end{aligned}$$

Let p be the probability distribution of \(\mathcal {D}\), defined as \(\mathbb {R}^{N\times d_{a}}\times \mathbb {R}^{N\times d_{u}}\). Then, the loss of the entire distribution \(\mathcal {D}\) is defined as follows:

$$\begin{aligned} \mathcal {L}_{\mathcal {D}}:=\int _{\mathbb {R}^{N\times d_{a}}\times \mathbb {R}^{N\times d_{u}}}{\|u_{i,\cdot\cdot}-{\textbf {FNO}}(a_{i,\cdot\cdot})\|^{2}}dp(a,u). \end{aligned}$$

3 Generalization bound for FNOs

In this section, we calculate the upper bound of the Rademacher complexity of the FNO, and estimate the generalization bound. In addition, we present several lemmas regarding the main results. The proof of the main theorems comprises two main lemmas: inequality for the Rademacher complexity and sup-norm of FNO models. Using these lemmas, we prove the main results.

3.1 Mathematical setup

Definition 4

(Rademacher complexity) Let \(\mathcal {F}\) represent mapping from \(\mathcal {X}\) to \(\mathbb {R}\). Suppose \(\{x_{i}\in \mathcal {X}:i=1,...,m\}\). \(\epsilon _{i}\) are independent and uniform and \(\{+1,-1\}\)-valued random variables. The empirical Rademacher complexity of \(\mathcal {F}\) for a given sample set is defined as follows:

$$\begin{aligned} \mathcal {R}_{m}(\mathcal {F})=\mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{f\in \mathcal {F}}\sum _{i=1}^{m}\epsilon _{i}f(x_{i})\bigg ]. \end{aligned}$$

The main components of our results are as follows:

Definition 5

(weight norms and capacity) For the multi-rank tensor \(M_{i_{1},...,i_{m},j_{1},...,j_{k}}\), we define the following weight norm:

$$\begin{aligned} \Vert M_{i_{1},...,i_{m},j_{1},...,j_{k}}\Vert _{p:\{i_{1},...,i_{m}\},q:\{j_{1},...,j_{k}\}} :=\root q \of {\sum _{j_{1}...j_{m}}{\bigg (\root p \of {\sum _{i_{1}...i_{k}}{M_{i_{1},...,i_{m},j_{1},...,j_{k}}}^{p}}\bigg )^{q}}}. \end{aligned}$$

For \(p=\infty\) or \(q=\infty\), we consider the sup-norm instead of the definition above. Now, suppose that for an FNO with a Fourier layer of depth D, we denote Q and P as the projection and lifting weight matrices, respectively, and \(A_{i}\) and \(R_{i}\) as the weight tensors of the Fourier layers. We then define \(\Vert \cdot \Vert _{p,q}\), where p is the index for positions, frequencies, and inputs, and q is the output index. The following norm is defined for the Fourier layer:

$$\begin{aligned} \Vert (A_{i},R_{i})\Vert _{p,q} := \Vert A_{i}\Vert _{p,q} + \Vert R_{i}\Vert _{p,q}\frac{\root p* \of {k_{max,1}...k_{max,d}}}{N^{ \lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+} }}. \end{aligned}$$

The capacity of the FNO model h as a product of the weights of its layers is defined as follows:

$$\begin{aligned} \gamma _{p,q}(h):=\Vert P\Vert _{p,q}\Vert Q\Vert _{p,q}\prod _{i=1}^{D}\Vert (A_{i},R_{i})\Vert _{p,q}. \end{aligned}$$

Next, for the kernel tensor K of the CNN layer, we define the following norms for the weights and capacities of the entire neural network: In the \(\Vert \cdot \Vert _{p,q}\) norm for the kernel tensor of the CNN layer, p is the index of the kernels and input, and q is the output index.

$$\begin{aligned}{} & {} \Vert (K_{i},R_{i})\Vert _{p,q}:=\Vert K\Vert _{p,q}\root p* \of {c_{1}\dots c_{d}} + \root p* \of {k_{max,1}...k_{max,d}}\Vert R\Vert _{p,q} \\{} & {} \quad {\gamma _{CNN}}_{p,q}(h_{CNN}):=\Vert P\Vert _{p,q}\Vert Q\Vert _{p,q}\Vert \prod _{i=1}^{D}\Vert (K_{i},R_{i}) \Vert _{p,q}. \end{aligned}$$

Next, we define hypothesis classes in which the Rademacher complexity is bounded in our results. A hypothesis class is a collection of functions from which a learning algorithm selects a function.

Definition 6

(Hypothesis classes of FNO) Suppose that the function classes of the FNO with D depth and maximal modes of the Fourier layers are \(k_{max,1},...,k_{max,d}\). The width and size of the input vector, size of the output vector, and activation function are fixed. Then, the hypothesis class for a general FNO is defined as follows:

$$\begin{aligned}{} & {} \mathcal {H}_{C_{P},C_{0},...,C_{D},C_{Q}}^{d_{in}}:=\{{\textbf {FNO}}:\Vert P\Vert _{p,q}\le C_{P},\\{} & {} \quad \Vert (A_{i},R_{i})\Vert _{p,q} \le C_{i} (i=1,...,D),\Vert Q\Vert _{p,\infty } \le C_{Q} \}. \end{aligned}$$

Finally, we define the hypothesis class of the FNO with CNN layers as follows:

$$\begin{aligned}{} & {} {\mathcal {H}_{CNN}}_{C_{P},C_{0},...,C_{L},C_{Q}}^{d_{in}}:=\{{\textbf {FNO}}:\Vert P\Vert _{p,q}\le C_{P},\\{} & {} \quad \Vert (K_{i},R_{i})\Vert _{p,q} \le C_{i} (i=1,...,D), \Vert Q\Vert _{p,\infty } \le C_{Q} \}. \end{aligned}$$

We also define the following auxiliary definition for the hypothesis class of sub-neural networks of FNO models, where the terminal layer is the Fourier layer (denoted as \({\textbf {FNO}}_{sub:i}\)).

$$\begin{aligned} \mathcal {H}_{C_{P},C_{0},...,C_{i}}^{d_{in}}:=\{{\textbf {FNO}}_{sub:i}:\Vert P\Vert _{p,q}\le C_{P}, \Vert (A_{t},R_{t})\Vert _{p,q} \le C_{t},(t=1,...,i) \}. \end{aligned}$$

Similarly, we define \({\mathcal {H}_{CNN}}_{C_{P},C_{0},...,C_{i}}^{d_{in}}\).

3.2 Main results

The notations in each lemma and theorem are based on the definitions in Sect. 3.1. The activation function is Lipschitz continuous and passes through the origin (\(\sigma (0)=0\)). Moreover, we set our notations as follows: for a given sample \(S=\{a_{i}\}_{i=1,\dots ,m}\) (with input data \(a_{i}\)) and hypothesis class \(\mathcal {H}_{C_{P},C_{0},...,C_{t}}^{d_{in}}\), we denote \(h(a_{i})\) by \(v_{t,i}\), where \(h\in \mathcal {H}_{C_{P},C_{0},...,C_{t}}^{d_{in}}\). The components are denoted by \(v_{t,i,{\textbf {x}}j}\).

The following lemma regarding \(l_{p}\) norms is frequently used in our proofs.

Lemma 1

(Norm inequality) If \(1\le p\le q \le \infty\), then for \(v \in \mathbb {R}^N\) we obtain the following inequality:

$$\begin{aligned} \Vert v\Vert _{q}\le \Vert v\Vert _{p} \le \Vert v\Vert _{q}N^{\frac{1}{p}-\frac{1}{q}}. \end{aligned}$$

Let \(\lfloor \cdot \rfloor _{+}\) denote a ReLU function. Then, for an arbitrary \(1\le p,q\), the inequality can be defined as

$$\begin{aligned} \Vert v\Vert _{p} \le \Vert v\Vert _{q}N^{\lfloor \frac{1}{p}-\frac{1}{q}\rfloor _{+}}. \end{aligned}$$

The following lemma handles nonlinear loss in our proof (the proof can be found in Maurer (2016)).

Lemma 2

(Vector-contraction inequality for Rademacher complexity) Assume that \(\sigma\) is a Lipschitz continuous function with Lipschitz constant L and \(\mathcal {F}\) is a hypothesis class of \(\mathbb {R}^{N}\)-valued functions. Thus, we obtain the following inequality:

$$\begin{aligned} \mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{f\in \mathcal {F}}\sum _{i=1}^{m}\epsilon _{i}\sigma (f(x_{i}))\bigg ]\le \sqrt{2}L\mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{f\in \mathcal {F}}\sum _{i,k}\epsilon _{ik}f_{k}(x_{i})\bigg ]. \end{aligned}$$

Here, we present the main results. The proof has two parts. First, we obtain the upper bound of \(p*\)-norm of the output of FNO models. Second, we bound the Rademacher complexity of the FNO model on samples based on the obtained upper bound. We assume that the projection and lifting layers are linear maps. However, we can easily generalize this to a general FCN.

Lemmas 3 and 3’ are the main factors in our results; the Fourier layers are peeled inductively.

Lemma 3

Suppose \(\mathcal {H} = \mathcal {H}_{C_{P},C_{1},...,C_{D},C_{Q}}^{d_{in}}\) is the FNO hypothesis class with constants \(C_{P},C_{1},\dots ,C_{D},C_{Q}\). Then, for the sample \(a \in \mathbb {R}^{N\times d_{a}}\), we obtain the following inequality:

$$\begin{aligned}{} & {} \sup _{h \in \mathcal {H}}\Vert h(a)_{\cdot \cdot }\Vert _{p*,\infty } \\{} & {} \quad \le L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}}C_{Q}C_{D}\dots C_{1}C_{P}\Vert a\Vert _{p*} \end{aligned}$$

Proof

$$\begin{aligned}&h(a)_{{\textbf {x}}j} \nonumber \\&=\sum _{k}v_{D,{\textbf {x}}k}Q_{kj} \nonumber \\&\le \Vert v_{D,{\textbf {x}}\cdot }\Vert _{p*}\Vert Q_{\cdot j}\Vert _{p} \end{aligned}$$
(1)

Then, we have the following:

$$\begin{aligned}&\Vert h(a)_{\cdot \cdot }\Vert _{p*,\infty } \\&\le \sup _{j} \root p* \of {\sum _{{\textbf {x}}}\Vert v_{D,{\textbf {x}} \cdot }\Vert ^{p*}_{p*}\Vert Q_{\cdot j}\Vert ^{p*}_{p}} \\&\le \Vert v_{D,\cdot \cdot }\Vert _{p*}C_{Q} \end{aligned}$$

Subsequently, we peel off the Fourier layers.

$$\begin{aligned}&\sigma \Big (A_{D}(a)+\mathcal {F}^{-1}(R_{D}\cdot (\mathcal {F}(a)))\Big )_{{\textbf {x}}j}\nonumber \\&=\sigma \Big (\sum _{{\textbf {z}},k}{A_{D,{\textbf {x}}{} {\textbf {z}}kj}v_{D-1,{\textbf {z}}k}}+ \sum _{{\textbf {k}},{\textbf {z}},k}{F^{\dag }_{{\textbf {x}}{} {\textbf {k}}}R_{D,{\textbf {k}},jk}F_{{\textbf {k}} {\textbf {z}}}v_{D-1{\textbf {z}}k}}\Big )\nonumber \\&\le L\Big \vert \sum _{{\textbf {z}},k}{A_{D,{\textbf {x}}{} {\textbf {z}}kj}v_{D-1,{\textbf {z}}k}} + \sum _{{\textbf {k}},{\textbf {z}},k}{{F^{\dag }_{{\textbf {x}}{} {\textbf {k}}}R_{D,{\textbf {k}},jk}F_ {{\textbf {k}}{} {\textbf {z}}}v_{D-1,{\textbf {z}}k}}}\Big \vert \nonumber \\&\le L\bigg (\Vert A_{D,{\textbf {x}}\cdot \cdot j}\Vert _{p}+ \Big \Vert \sum F^{\dag }_{{\textbf {x}}{} {\textbf {k}}}R_{D,{\textbf {k}},j\cdot }F_{{\textbf {k}}\cdot }\Big \Vert _{p}\bigg )\Big \Vert v_{D-1,\cdot \cdot }\Big \Vert _{p*}, \end{aligned}$$
(2)

For \(\Big \Vert \sum F^{\dag }_{{\textbf {x}}{} {\textbf {k}}}R_{D,{\textbf {k}},j\cdot }F_{{\textbf {k}}\cdot }\Big \Vert _{p}\) in (2),

$$\begin{aligned} \Big \Vert \sum _{{\textbf {k}}} F^{\dag }_{{\textbf {x}}{} {\textbf {k}}}R_{D,{\textbf {k}},j\cdot }F_{{\textbf {k}}\cdot }\Big \Vert _{p} =\root p \of {\sum _{{\textbf {z}},k}{\Big (\sum _{{\textbf {k}}} F^{\dag }_{{\textbf {x}}{} {\textbf {k}}}R_{D,{\textbf {k}},jk}F_{{\textbf {k}}{} {\textbf {z}}}\Big )^{p}}}. \end{aligned}$$

For fixed \({\textbf {x}}, {\textbf {z}}, k\), \(\Big (F^{\dag }_{{\textbf {x}}{} {\textbf {k}}}F_{{\textbf {k}}{} {\textbf {z}}}\Big )_{{\textbf {k}}}\) is a \(k_{max,1},\dots ,k_{max,d}\)-dimensional vector, where each component exhibits the \(\frac{e^{ib}}{N}\) form. Thus, by applying Hölder’s inequality, we obtain the following inequality:

$$\begin{aligned}&\Big \Vert \sum _{{\textbf {k}}} F^{\dag }_{{\textbf {x}}{} {\textbf {k}}}R_{D,{\textbf {k}},j\cdot }F_{{\textbf {k}}\cdot }\Big \Vert _{p} \\&\le \root p \of {\sum _{{\textbf {z}},k}{\Big ({\frac{\root p* \of {k_{max,1}...k_{max,d}}}{N}\Vert R_{D,\cdot ,jk}\Vert _{p}\Big )^{p}}}} \\&= \frac{\root p* \of {k_{max,1}...k_{max,d}}}{N}\root p \of {N\sum _{{\textbf {k}},k}{R_{D,{\textbf {k}},jk}^{p}}} \\&= \root p* \of {\frac{k_{max,1}...k_{max,d}}{N}}\Vert R_{D,\cdot ,j,\cdot }\Vert _{p}. \end{aligned}$$

Subsequently, the following bound is obtained:

$$\begin{aligned}&\sigma \Big (A_{D}(a)+\mathcal {F}^{-1}(R_{D}\cdot (\mathcal {F}(a)))\Big )_{{\textbf {x}}j}\\&\le L\Big (\Vert A_{D,{\textbf {x}}\cdot \cdot j}\Vert _{p}+\root p* \of {\frac{k_{max,1}...k_{max,d}}{N}} \Vert R_{D,\cdot ,j,\cdot }\Vert _{p}\Big ) \Big \Vert v_{D-1,\cdot \cdot }\Big \Vert _{p*}. \end{aligned}$$

We iteratively apply the above bound to obtain the following inequality:

$$\begin{aligned}&\sup _{h \in \mathcal {H}_{C_{P},C_{0},...,C_{D}}}\Vert v_{D,\cdot \cdot }\Vert _{p*} \nonumber \\&\le \sup _{h \in \mathcal {H}_{C_{P},C_{1},...,C_{D}}}L\Big (\Vert A_{D}\Vert _{p,p*} + \root p* \of {k_{max,1}...k_{max,d}} \Vert R_{D}\Vert _{p,p*}\Big ) \Big \Vert v_{D-1,\cdot \cdot }\Big \Vert _{p*}\nonumber \\&\le (NH)^{\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}} \sup _{h \in \mathcal {H}_{C_{P}, C_{1},...,C_{D}}} \nonumber \\&L\Big (\Vert A_{D}\Vert _{p,q} + \frac{\root p* \of {k_{max,1}...k_{max,d}}}{N^{ \lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}} \Vert R_{D}\Vert _{p,q}\Big ) \Big \Vert v_{D-1,\cdot \cdot }\Big \Vert _{p*} \end{aligned}$$
(3)
$$\begin{aligned}&\le L(NH)^{\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}C_{D}\sup _{h \in \mathcal {H}_ {C_{P},C_{1},...,C_{D-1}}}\Big \Vert v_{D-1,\cdot \cdot }\Big \Vert _{p*} \nonumber \\&\le ... \nonumber \\&\le L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}C_{D}\dots C_{1} \sup _{h \in \mathcal {H}_{C_{P}}}\Big \Vert v_{1,\cdot \cdot }\Big \Vert _{p*} \nonumber \\&\le L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}}C_{D}\dots C_{1}C_{P}\Vert a\Vert _{p*} . \end{aligned}$$
(4)

By combining the two inequalities, we obtain the following inequality.

$$\begin{aligned}&\Vert h(a)_{\cdot \cdot }\Vert _{p*,\infty } \\&\le \Vert v_{D,\cdot \cdot }\Vert _{p*}C_{Q} \\&\le L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}}C_{Q}C_{D}\dots C_{1}C_{P}\Vert a\Vert _{p*} \end{aligned}$$

We use Hölder’s inequality in (1) and (2) and norm inequalities in (3) and (4). \(\square\)

The proof of the following lemma is similar to that of Lemma 3. However, in this case, the hypothesis class is FNO with CNN layers.

Lemma 3’

Suppose \(\mathcal {H} = {\mathcal {H}_{CNN}}_{C_{P},C_{1},...,C_{D},C_{Q}}^{d_{in}}\) is the hypothesis class of an FNO with CNN layer and constants \(C_{P},C_{1},\dots ,C_{D},C_{Q}\). Then, for a sample \(a\in \mathbb {R}^{N\times d_{a}}\), we obtain the following inequality:

$$\begin{aligned}{} & {} \sup _{h \in \mathcal {H}}\Vert h(a)_{\cdot \cdot }\Vert _{p*,\infty } \\{} & {} \quad \le L^{D}H^{(D+1)\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}C_{Q}C_{D}\dots C_{1}C_{P}\Vert a\Vert _{p*} \end{aligned}$$

Proof

We modify the induction parts of the Fourier layers in the proof of Lemma 3.

$$\begin{aligned}&\sigma \Big (C_{D}(a)+\mathcal {F}^{-1}(R_{D}\cdot (\mathcal {F}(a)))\Big )_{{\textbf {x}}j}\nonumber \\&=\sigma \Big (\sum _{j_{1}=0}^{c_{1}-1}\cdots \sum _{j_{d}=0}^{c_{d}-1}\sum _{k=1}^ {d_{v}}K_{D,jk,j_{1},\dots ,j_{d}} v_{D-1,x_{1}+j_{1},\dots ,x_{d}+j_{d}k} \nonumber \\&\quad +\sum {F^{\dag }_{{\textbf {x}}{} {\textbf {k}}}R_{D,{\textbf {k}},jk}F_{{\textbf {k}}z_{1}...z_{d}} v_{D-1,i,z_{1}...z_{d}k}}\Big ) \nonumber \\&\le L\bigg (\Vert K_{D,j,\cdots }\Vert _{p}\bigg \Vert v_{D-1,x_{1}+\cdot ,\dots ,x_{d}+\cdot ,\cdot } \bigg \Vert _{p*} \nonumber \\& \quad +\root p* \of {\frac{k_{max,1}...k_{max,d}}{N}}\Vert R_{D,\cdot ,j,\cdot } \Vert _{p}\bigg \Vert v_{D-1,\cdot \cdot }\bigg \Vert _{p*}\bigg ). \end{aligned}$$
(5)

where we use Hölder’s inequality in (5). Subsequently, by applying \(p*\) norm to the inequality above over \({\textbf {x}},j\) and the norm inequality, we obtain the following inequality:

$$\begin{aligned}&\bigg \Vert \sigma \Big (C_{D}(a)+\mathcal {F}^{-1}(R_{D}\cdot (\mathcal {F}(a)))\Big )_ {\cdot \cdot }\bigg \Vert _{p*} \\&\le L\bigg (\root p* \of {\sum _{j}\Vert K_{D,j,\cdots }\Vert _{p}^{p*}\sum _ {{\textbf {x}}}\sum _{j_{1}=0}^{c_{1}-1}\cdots \sum _{j_{d}=0}^{c_{d}-1}\sum _{k=1}^{d_{v}}\Vert v_ {D-1,x_{1}+\cdot ,\dots ,x_{d}+\cdot ,\cdot }\Vert ^{p*} }\\& \quad +\root p* \of {k_{max,1}...k_{max,d}}\Vert R_{D}\Vert _{p,q} H^{\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}\bigg \Vert v_{D-1,\cdot \cdot }\bigg \Vert _{p*} \bigg ) \\&\le LH^{\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}\bigg (\root p* \of {c_ {1}\dots c_{d}}\Vert K_{D}\Vert _{p,q} +\root p* \of {k_{max,1}\dots k_{max,d}}\Vert R_{D}\Vert _{p,q} \bigg ) \bigg \Vert v_{D-1,\cdot \cdot }\bigg \Vert _{p*}. \end{aligned}$$

The remainder of this proof is similar to that of Lemma 3. \(\square\)

Lemma 4

Suppose \(\mathcal {H}_{C_{P},C_{1},...,C_{D},C_{Q}}^{d_{in}}\) is the hypothesis class of the FNO with given constants \(C_{P},C_{1},\dots ,C_{D}, C_{Q}\). Then, for samples \(S=\{a_{i}\}_{i=1,\dots ,m}\), we obtain the following inequality:

$$\begin{aligned}{} & {} \mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in \mathcal {H}_{C_{P},C_{1},...,C_{D},C_{Q}}^{d_{in}}}\sum _{i,{\textbf {x}},j} \epsilon _{i{\textbf {x}}j}h(a_{i})_{{\textbf {x}}j}\bigg ] \le \frac{N^{\frac{1}{p}}d_{u}}{m}\sum _{i} \sup _{h \in \mathcal {H}_{C_{P},C_{1},...,C_{D},C_{Q}}^{d_{in}}} \Vert h(a_{i})_{\cdot \cdot }\Vert _{p*,\infty }. \end{aligned}$$

Proof

$$\begin{aligned}&\mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in \mathcal {H}_{C_{P},C_{1},...,C_{D}, C_{Q}}^{d_{in}}}\sum _{i,{\textbf {x}},j}\epsilon _{i{\textbf {x}}j}h(a_{i})_{{\textbf {x}}j}\bigg ] \nonumber \\&\le \mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in \mathcal {H}_{C_{P},C_{1},..., C_{D},C_{Q}}^{d_{in}}}\sum _{i,{\textbf {x}},j}\Big \vert h(a_{i})_{{\textbf {x}}j}\Big \vert \bigg ] \nonumber \\&\le \frac{N^{\frac{1}{p}}d_{u}}{m} \mathbb {E}_{\epsilon }\bigg [\sup _{h \in \mathcal {H}_ {C_{P},C_{1},...,C_{D},C_{Q}}^{d_{in}}}\sum _{i}\Vert h(a_{i})_{\cdot \cdot }\Vert _{p*,\infty }\bigg ] \nonumber \\&\le \frac{N^{\frac{1}{p}}d_{u}}{m}\sum _{i} \sup _{h \in \mathcal {H}_{C_{P},C_{1},...,C_{D} ,C_{Q}}^{d_{in}}} \Vert h(a_{i})_{\cdot \cdot }\Vert _{p*,\infty } \end{aligned}$$
(6)

where we used norm inequality in (6). \(\square\)

Theorem 1

Suppose \(\mathcal {H}_{C_{P},C_{1},...,C_{D},C_{Q}}^{d_{in}}\) is a hypothesis class with constants \(C_{P},C_{1},\dots ,C_{D},C_{Q}\). Then, for samples \(S=\{a_{i}\}_{i=1,\dots ,m}\), we obtain the following inequality:

$$\begin{aligned}{} & {} \mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in \mathcal {H}_{C_{P},C_{1},..., C_{D},C_{Q}}^{d_{in}}}\sum _{i,{\textbf {x}},j}\epsilon _{i{\textbf {x}}j}h(a_{i})_{{\textbf {x}}j} \bigg ] \\{} & {} \quad \le L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}}N^{\frac{1}{p}}{d_{u}} C_{Q}C_{D}\dots C_{1}C_{P}\frac{1}{m}\sum _{i=1}^{m}\Vert {a_{i}}\Vert _{p*}. \end{aligned}$$

Proof

$$\begin{aligned}&\mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in \mathcal {H}_{C_{P},C_{1},..., C_{D},C_{Q}}^{d_{in}}}\sum _{i,{\textbf {x}},j}\epsilon _{i{\textbf {x}}j}h(a_{i})_{{\textbf {x}}j} \bigg ] \\&\le N^{\frac{1}{p}}d_{u}\frac{1}{m}\sum _{i=1}^{m}\sup _{h \in \mathcal {H}_{C_{P},C_{1} ,...,C_{D},C_{Q}}}\Vert h(a_{i})_{\cdot \cdot }\Vert _{p*,\infty }\qquad \qquad \qquad \qquad \qquad \qquad \qquad (\text {Lemma } 4) \\&\le L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}- \frac{1}{q}\rfloor _{+}}N^{\frac{1}{p}}d_{u}C_{Q}C_{D}\dots C_{1}C_{P}\frac{1}{m}\sum _{i=1} ^{m}\Vert {a_{i}}\Vert _{p*}. \quad \quad (\text {Lemma }3) \\ \end{aligned}$$

\(\square\)

Theorem 2

(FNO with CNN layer) Suppose \({\mathcal {H}_{CNN}}_{C_{P},C_{1},...,C_{D},C_{Q}}^{d_{in}}\) is a hypothesis class with constants \(C_{P},C_{1},\dots ,C_{D},C_{Q}\). Then, for samples \(S=\{a_{i}\}_{i=1,\dots ,m}\), we obtain the following inequality:

$$\begin{aligned}{} & {} \mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in {\mathcal {H}_{CNN}}_ {C_{P},C_{1},...,C_{D},C_{Q}}^{d_{in}}}\sum _{i,{\textbf {x}},j}\epsilon _{i{\textbf {x}}j}h(a_{i}) _{{\textbf {x}}j}\bigg ] \\{} & {} \quad \le L^{D}H^{(D+1)\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}N^{\frac{1}{p}}d_{u} C_{Q}C_{D}\dots C_{1}C_{P}\frac{1}{m}\sum _{i=1}^{m}\Vert {a_{i}}\Vert _{p*}. \end{aligned}$$

Proof

$$\begin{aligned}&\mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in {\mathcal {H}_{CNN}}_{C_{P},C_{1}, ...,C_{D},C_{Q}}^{d_{in}}}\sum _{i,{\textbf {x}},j}\epsilon _{i{\textbf {x}}j}h(a_{i})_ {{\textbf {x}}j}\bigg ] \\&\le N^{\frac{1}{p}}d_{u}\frac{1}{m}\sum _{i=1}^{m}\sup _{h \in {\mathcal {H}_{CNN}} _{C_{P},C_{1},...,C_{D},C_{Q}}}\Vert h(a_{i})_{\cdot \cdot }\Vert _{p*,\infty } \qquad \qquad \qquad \qquad \qquad (\text {Lemma} 4) \\&\le L^{D}H^{(D+1)\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}N^{\frac{1}{p}} {d_{u}}C_{Q}C_{D}\dots C_{1}C_{P}\frac{1}{m}\sum _{i=1}^{m}\Vert {a_{i}}\Vert _{p*}. \qquad \qquad (\text {Lemma 3'}) \\ \end{aligned}$$

\(\square\)

Corollary 1

For a constant \(\gamma >0\), consider the hypothesis class \(\mathcal {H}_{\gamma _{p,q}\le \gamma }\), which is a collection of FNOs with \(\gamma _{p,q}\le \gamma\). For samples \(S=\{a_{i}\}_{i=1,\dots ,m}\), we obtain the following inequality:

$$\begin{aligned}{} & {} \mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in {\mathcal {H}_{\gamma _{p,q}\le \gamma }}} \sum _{i,{\textbf {x}},j}\epsilon _{i{\textbf {x}}j}h(a_{i})_{{\textbf {x}}j}\bigg ] \\{} & {} \quad \le \gamma _{}L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}}N^{\frac{1}{p}}d_{u} \frac{1}{m}\sum _{i=1}^{m}\Vert {a_{i}}\Vert _{p*}. \end{aligned}$$

For a given hypothesis class \({\mathcal {H}_{CNN}}_{\gamma _{p,q}\le \gamma }\), similar to \(\mathcal {H}_{\gamma _{p,q}\le \gamma }\), and training samples \(S=\{a_{i}\}_{i=1,\dots ,m}\), we obtain the following inequality:

$$\begin{aligned}{} & {} \mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in {\mathcal {H}_{CNN}}_{\gamma _{p,q}\le \gamma }}\sum _{i,{\textbf {x}},j}\epsilon _ {i{\textbf {x}}j}h(a_{i})_{{\textbf {x}}j}\bigg ] \\ {}{} & {} \quad \le \gamma _{CNN}L^{D}H^{(D+1)\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}N^{\frac{1}{p}}d_{u}\frac{1}{m}\sum _{i=1}^{m}\Vert {a_{i}}\Vert _{p*}. \end{aligned}$$

Proof

As

$$\begin{aligned}&\mathcal {H}_{\gamma _{p,q}\le \gamma } \subset \bigcup _{0\le C_{P}C_{1}\dots C_{D}C_{Q}<\gamma }{\mathcal {H}_{C_{P},C_{1},...,C_{Q}}^{d_{in}} }. \end{aligned}$$

We obtain the following inequalities.

$$\begin{aligned}&\mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in {\mathcal {H}_{\gamma _{p,q} \le \gamma }}}\sum _{i,{\textbf {x}},j}\epsilon _{i{\textbf {x}}j}h(a_{i})_{{\textbf {x}}j}\bigg ] \\&\le \mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h\in \bigcup _{0\le C_{P}C_{1} \dots C_{D}C_{Q}<\gamma }{\mathcal {H}_{C_{P},C_{1},...,C_{Q}}^{d_{in}} }}\sum _ {i,{\textbf {x}},j}\epsilon _{i{\textbf {x}}j} h(a_{i})_{{\textbf {x}}j}\bigg ] \\ \end{aligned}$$

Because the upper bound of \(p*\)-norm of the models of the hypothesis class in the above equation is the same as that in Lemma 3, we apply the same logic as in Theorem 1. Thus, we obtain the following inequality:

$$\begin{aligned}&\mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in {\mathcal {H}_{\gamma _{p,q}\le \gamma } }}\sum _{i,{\textbf {x}},j}\epsilon _{i{\textbf {x}}j}h(a_{i})_{{\textbf {x}}j}\bigg ] \\&\le \gamma _{}L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}}N^{\frac{1}{p}}d_{u} \frac{1}{m}\sum _{i=1}^{m}\Vert {a_{i}}\Vert _{p*}. \end{aligned}$$

Similarly, based on the above proof, we obtain the inequality for FNO with CNN layers. \(\square\)

Recall the following fundamental theorem (for details, see Shalev-Shwartz and Ben-David (2014)) that states the statistical estimation of the generalization error bound of a given hypothesis class in terms of Rademacher complexity.

Theorem 3

(Generalization error bounding based on Rademacher complexity) Given hypothesis class \(\mathcal {H}\) and loss function \(l:\mathcal {H}\times Z\) that satisfy the following case: for all \(h \in \mathcal {H}\) and \(z \in Z\), we obtain \(\vert l(h,z)\vert \le c\). Then, with a probability of at least \(1-\delta\), for all \(h\in \mathcal {H}\), we obtain

$$\begin{aligned} \mathbb {E}_{\mathcal {D}}[l(h,z)]-\mathbb {E}_{S}[l(h,z)] \le 2\mathcal {R}_{m}(l \circ \mathcal { H})+c\sqrt{\frac{2\log {4/\delta }}{m}}. \end{aligned}$$

where \(\mathcal {D}\) is the probability distribution on Z and S is a training dataset sampled from \(\mathcal {D}\) i.i.d.

Before considering the generalization bound of FNO, we select the distribution \(\mathcal {D}\) on \(\mathbb {R}^{N\times d_{a}}\times \mathbb {R}^{N\times d_{u}}\) to have a compact support. Thus, \(\vert l(h,z)\vert \le c\) condition in Theorem 3 holds. Then, using Theorem 3 and Corollary 1, we obtain the following estimation of the generalization error bound:

Theorem 4

(Generalization error bound for FNO) For the training dataset \(S=\{(a_{i},u_{i})\}_{i=1,\dots ,m}\), sampled from probability distribution \(\mathcal {D}\) i.i.d., and for hypothesis class \(\mathcal {H}_{\gamma _{p,q}\le \gamma }\), let \(h^{\star }\) be the ERM minimizer of \(L_{S}\) and \(\Vert h(a)-u\Vert _{2} \le \epsilon ^{2}\) for all \((a,u)\sim \mathcal {D}\), \(h\in \mathcal {H}_{\gamma _{p,q}\le \gamma }\). Subsequently, with a probability of at least \(1-\delta\), we obtain the following inequality:

$$\begin{aligned}{} & {} L_{\mathcal {D}}(h^{\star })-L_{S}(h^{\star }) \\{} & {} \quad \le 4\sqrt{2}\epsilon \gamma _{}L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}} N^{\frac{1}{p}}d_{u}\frac{1}{m}\sum _{i=1}^{m}\Vert {a_{i}}\Vert _{p*}+\epsilon ^{2} \sqrt{\frac{2\log {4/\delta }}{m}}. \end{aligned}$$

Similarly for hypothesis class of FNOs with CNN layers, dataset S, and hypothesis class \({\mathcal {H}_{CNN}}_{\gamma _{p,q}\le \gamma }\), let \(h^{\star }_{CNN}\) be the ERM minimizer of \(L_{S}\) and \(\Vert h(a)-u\Vert _{2} \le \epsilon ^{2}\) for all \((a,u)\sim \mathcal {D}\), \(h\in {\mathcal {H}_{CNN}}_{\gamma _{p,q}\le \gamma }\). Subsequently, with a probability of at least \(1-\delta\), we obtain the following inequality:

$$\begin{aligned}{} & {} L_{\mathcal {D}}(h^{\star }_{CNN})-L_{S}(h^{\star }_{CNN}) \\{} & {} \quad \le 4\sqrt{2}\epsilon \gamma _{CNN}L^{D}H^{(D+1)\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}N^{\frac{1}{p}}d_{u}\frac{1}{m}\sum _{i=1}^{m}\Vert {a_{i}}\Vert _{p*} +\epsilon ^{2}\sqrt{\frac{2\log {4/\delta }}{m}}. \end{aligned}$$

Proof

We just need to calculate \(\mathcal {R}_{m}(l \circ \mathcal { H})\) term in Theorem 3.

$$\begin{aligned}&\mathcal {R}_{m}(l \circ \mathcal {H}_{\gamma _{p,q}\le \gamma (\mathcal {N}_{FNO} )}) \le 2\sqrt{2}\epsilon \mathbb {E}_{\epsilon }\bigg [\frac{1}{m}\sup _{h \in {\mathcal {H}_{\gamma _{p,q}\le \gamma }}}\sum _{i,{\textbf {x}},j}\epsilon _{i{\textbf {x}}j}h(a_{i})_{{\textbf {x}}j}\bigg ] \qquad \qquad (\text {Lemma } 2) \\&\le 2\sqrt{2}\epsilon \gamma _{}L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}} N^{\frac{1}{p}}d_{u}\frac{1}{m}\sum _{i=1}^{m}\Vert {a_{i}}\Vert _{p*}.\qquad \qquad \qquad (\text {Corollary } 1) \end{aligned}$$

Similarly, based on the above proof, we obtain the inequality for the FNO with CNN layers. \(\square\)

If the capacity of FNO model h is \(\gamma\), it is included in the hypothesis class \(\mathcal {H}_{\gamma _{p,q}\le \gamma }\). Because the inequalities in Theorem 4 hold for all hypotheses in class, we have the following posterior estimate of FNO:

Corollary 2

(Posterior estimation of generalization and expected errors) Given architecture parameters \(N, H, d_{u}, d_{a}, L\), and training samples \(\{(a_{i},u_{i})\}_{i=1,\dots ,m}\) with \(\Vert a_{i}\Vert _{p*} \le B\) for all i. Suppose h is a trained FNO (Fourier layer with FCN or CNN) such that \(\Vert h(a)-u\Vert _{2} \le \epsilon ^{2}\) for all training samples. Then, with a confidence level of at least \(1-\delta\), we obtain the following estimates:

$$\begin{aligned}{} & {} L_{\mathcal {D}}(h)-L_{S}(h) \\{} & {} \quad \le 4\sqrt{2}\epsilon L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}} N^{\frac{1}{p}}d_{u}\gamma _{p,q}(h)B +\epsilon ^{2}\sqrt{\frac{2\log {4/\delta }}{m}}. \\{} & {} \Longrightarrow L_{\mathcal {D}}(h) \le 4\sqrt{2}\epsilon L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}} N^{\frac{1}{p}}d_{u}\gamma _{p,q}(h)B \\ {}{} & {} \qquad +\epsilon ^{2}\bigg (1+\sqrt{\frac{2\log {4/\delta }}{m}}\bigg ). \end{aligned}$$

for FNOs with CNN,

$$\begin{aligned}{} & {} L_{\mathcal {D}}(h_{CNN})-L_{S}(h_{CNN}) \\ {}{} & {} \quad \le 4\sqrt{2}\epsilon L^{D}H^{(D+1)\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}N^{\frac{1}{p}}d_{u}{\gamma _{CNN}}_{p,q}(h_{CNN})B +\epsilon ^{2}\sqrt{\frac{2\log {4/\delta }}{m}}. \\{} & {} \Longrightarrow L_{\mathcal {D}}(h_{CNN}) \le 4\sqrt{2}\epsilon L^{D}H^{(D+1)\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}N^{\frac{1}{p}}d_{u}{\gamma _{CNN}}_{p,q}(h_{CNN})B \\ {}{} & {} \qquad +\epsilon ^{2}\bigg (1+\sqrt{\frac{2\log {4/\delta }}{m}}\bigg ). \end{aligned}$$

Our definitions of capacity and results are motivated by the group-norm capacity of an FCN (Neyshabur et al., 2015). The results in Neyshabur et al. (2015) imposed the upper bound of the generalization error on the capacity and inverse factor of \(\sqrt{m}\). The proof of this upper bound relies on the homogeneity of the ReLU activation function. Our results apply only to the Lipschitzness of the activation function. Although our results only have the O(1) bound, if we focus on the ReLU activation case, we may have the following bound based on our capacity derivation and theorems in Neyshabur et al. (2015).

Corollary 3

(Posterior estimation of generalization error and expected error in the RELU activation case) Given architecture parameters \(N, H, d_{u}, d_{a}, L\), and training samples \(\{(a_{i},u_{i})\}_{i=1,\dots ,m}\) with \(\Vert a_{i}\Vert _{p*} \le B\) for all i. Suppose h is a trained FNO (Fourier layer with FCN or CNN) such that \(\Vert h(a)-u\Vert _{2} \le \epsilon ^{2}\) for all training samples and \(1\le p \le 2\), \(1\le q \le p*\). Then, with a confidence level of at least \(1-\delta\), we obtain the following estimates:

$$\begin{aligned}{} & {} L_{\mathcal {D}}(h)-L_{S}(h) \\{} & {} \quad \le 4\sqrt{2}\epsilon L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}} N^{\frac{1}{p}}d_{u}\gamma _{p,q}(h)\frac{min\{p*,4log(2d_{a})\}B}{\sqrt{m}} \\ {}{} & {} \qquad +\epsilon ^{2}\sqrt{\frac{2\log {4/\delta }}{m}}. \\{} & {} \Longrightarrow L_{\mathcal {D}}(h) \le 4\sqrt{2}\epsilon L^{D}(NH)^{D\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}H^{\lfloor \frac{1}{p*}-\frac{1}{q}\rfloor _{+}} N^{\frac{1}{p}}d_{u}\gamma _{p,q}(h)\frac{min\{p*,4log(2d_{a})\}B}{\sqrt{m}} \\ {}{} & {} \qquad +\epsilon ^{2}\bigg (1+\sqrt{\frac{2\log {4/\delta }}{m}}\bigg ). \end{aligned}$$

For FNOs with CNN,

$$\begin{aligned}{} & {} L_{\mathcal {D}}(h_{CNN})-L_{S}(h_{CNN}) \\{} & {} \quad \le 4\sqrt{2}\epsilon L^{D}H^{(D+1)\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}N^{\frac{1}{p}}d_{u}{\gamma _{CNN}}_{p,q}(h_{CNN})\frac{min\{p*,4log(2d_{a})\}B}{\sqrt{m}}\\{} & {} \qquad +\epsilon ^{2}\sqrt{\frac{2\log {4/\delta }}{m}}. \\{} & {} \Longrightarrow L_{\mathcal {D}}(h_{CNN}) \le 4\sqrt{2}\epsilon L^{D}H^{(D+1)\lfloor \frac{1}{p*}-\frac{1}{q} \rfloor _{+}}N^{\frac{1}{p}}d_{u}{\gamma _{CNN}}_{p,q}(h_{CNN})\frac{min\{p*,4log(2d_{a}) \}B}{\sqrt{m}} \\ {}{} & {} \qquad +\epsilon ^{2}\bigg (1+\sqrt{\frac{2\log {4/\delta }}{m}}\bigg ). \end{aligned}$$

Therefore, for the ReLU activation case, we can guarantee convergence of the generalization error with increasing training dataset size. However, the actual convergence rate of the generalization error is higher than the theoretical bound, as observed in Sect. 4.

4 Experiments

This section validates the experimental results. In addition, we show that the capacity we defined is an effective index for estimating empirical generalization errors in various respects.

4.1 Overall correlation over various p and q

First, we investigate the correlation between our capacity and the empirical generalization errors for various capacities of p and q.

Data specification For the experiment, we synthesized a dataset based on the following Burgers equation:

$$\begin{aligned} u_{t}=-uu_{x}+0.01uu_{xx} \end{aligned}$$

The domain of the problem is a circle; we uniformly discretize the domain by \(N=1024\). As described in Sect. 2, each data point represents a pair of functions. In our experiment, the input function was an initial condition, and the target function was a solution to the above equation at \(t=0.1\). Each input function was generated from Gaussian random fields with covariance \(k(x,y)=e^{-\frac{(x-y)^2}{(0.05)^2}}\). The training and test datasets comprise 800 and 200 pairs of functions, respectively (both generated independently).

Correlation for various capacities of p and q We investigated the correlation between the generalization error and capacities. Each point in Fig. 3 represents a trained model for the randomly chosen hyperparameters. The architecture of the models used in our experiment was organized as follows: 2-depth Fourier layers, linear layers without projection activation, and lifting layers. The width is fixed at 64. The weight decay for each training session was randomly chosen from 0, 2*1e−2, 4*1e−2, 6*1e−2, and 8*1e−2; \(k_{max}\) was randomly chosen from 8, 12, 16, and 20; the kernel size was randomly chosen from 1, 3, 5, and 7 for 100 iterations.

Fig. 3
figure 3

Scatter plot of generalization error versus capacity for \(p=1.2, q=1.2\)

Table 1 Correlation between empirical generalization error and capacities of various p and q for trained models with randomly chosen hyperparameters

Table 1 lists the correlations for the various values of p and q. The correlation decreases with increasing p and q. This is because as p increases, the p-norm loses information about elements other than the highest norm. Thus, the information of the model is lost in a capacity defined by high values of p and q. However, as p goes to \(\infty\), \(p*\) reaches 1; thus, kernel size and \(k_{max}\) have a greater effect on capacity as p increases. Therefore, we assume that the capacity of a high p contains more information regarding model architecture. To prove our arguments, we conducted experiments in which \(k_{max}\) was varied and the other hyperparameters were fixed. First, to show that the capacities of low p and q contain more information about the model weights than its architecture, we trained three models with negligible differences in \(k_{max}\). Second, to demonstrate that the capacities of high p and q are more closely related to the model architecture, we trained three types of models with considerable differences in \(k_{max}\). For each experiment, we trained the models 30 times for each \(k_{max}\) setting, that is, 14, 16, and 18 in the left column of Fig. 4 and 10, 30, and 50 in the right column. Hyperparameters other than \(k_{max}\) were fixed: the kernel size of the CNN layer was 1, the width was 64, and the depth of the Fourier layers was 2. As revealed by the left column of Fig. 4, models with small gaps in \(k_{max}\) lose the correlation between the generalization gap and capacity with increasing p and q. However, in the right column, the highest correlation between the capacity and generalization error is obtained for higher p and q values compared to those in the left column. The correlation is maintained at  0.89 for the \(p,q =\infty\) case.

Fig. 4
figure 4

Left: Scatter plot, correlation, and linear regression between generalization error and capacities of various p and q for 30 trained models for \(k_{max}=14, 16, 18\); Right: Scatter plot, correlation, and linear regression between generalization error and capacities of various p and q for 30 trained models for \(k_{max}=10, 30, 50\)

4.2 Dependency of generalization errors on architectures and datasets

4.2.1 Dependency on \(k_{max}\)

Next, we examined the dependency of the generalization error on the model architecture. In the experiments, hyperparameters other than \(k_{max}\) are fixed. We consider two cases: Fourier layers at depths of 1 and 2.

A low \(k_{max}\) implies that the dynamics of learning are unpredictable and chaotic (Seleznova & Kutyniok, 2021); therefore, we did not consider models with extremely small values of \(k_{max}\). We varied \(k_{max}\) from 13 to 39 in two intervals. For a detailed analysis, we removed the CNN layer parts from the Fourier layers, such that the generalization error is proportional to the weight norm \(R_{i}\). To verify the influence of \(k_{max}\) size, we divided the generalization error by \(R_{i}\). As the defined capacity is correlated with the generalization error, it is expected that the divided generalization error is correlated with \(\root p* \of {k_{max}}\) at a depth of 1 and \(\root \frac{p*}{2} \of {k_{max}}\) at a depth of 2. As listed in Tables 2 and 3, the generalization error divided by the norms is correlated to \(\root p* \of {k_{max}}\) and \(\root \frac{p*}{2} \of {k_{max}}\). Hence, we can verify that the correlation is low for low capacities of p and q, and the desired dependency on \(k_{max}\) is unclear. As p and q increase, this correlation first increases and then decreases slightly. Based on these data, we can conclude that capacities with higher p and q contain more information about the model architecture(\(k_{max}\)), and capacities with very high p and q may cause a loss of specific information about each model; thus, the correlation decreases. Figure 5 shows the scatter plot and regression for a few experimental cases. The generalization error dependence on \(k_{max}\) was more convex at a depth of two. Based on our definition of capacity, the exponent of \(k_{max}\) is proportional to the depth of the Fourier layers. Therefore, the increased convexity illustrated on the right side of the figures qualitatively validates the results.

Table 2 Correlation between empirical generalization error divided by weight norm of Fourier layers and \(\root p* \of {k_{max}}\) for FNO with depth 1 of Fourier layers
Table 3 Correlation between empirical generalization error divided by weight norm of Fourier layers and \(\root \frac{p*}{2} \of {k_{max}}\) for FNO with depth 2 of Fourier layers
Fig. 5
figure 5

Left: Scatter plot and regression between generalization error divided by norms of Fourier layers and \(\root p* \of {k_{max}}\) for various pq where the depth of Fourier layer is 1; Right: Scatter plot and regression between generalization error divided by norms of Fourier layers and \(\root \frac{p*}{2} \of {k_{max}}\) for various pq where the depth of Fourier layer is 2

4.2.2 Empirical dependency on the size of training samples

As in Corollary 3 and other results on bounding the generalization error of neural nets by capacity norms, generalization error should converge as the training data set grows. The convergence rate is \(O(n^{-l})\) where \(0\le l \le 0.5\). However, we found that the real convergence rate is much faster than the theoretical bounding rate. Moreover, our defined capacity is a suitable indicator by observing that the combination of our capacity with factor by training dataset size is highly correlated to the empirical generalization error. First, we show that we also obtain a high correlation for nonlinear activation other than ReLU. The architecture of the models is as follows: all the hyperparameters are the same as in Sect. 4.1, the weight decay, \(k_{max}\), is randomly chosen from [0,e−3,...,4e−3], [10,12,...,20] respectively. The results are shown in Fig. 6.

Fig. 6
figure 6

Left: Scatter plot of generalization error for ReLU case Right: Scatter plot of generalization error for GELU case

Next, we demonstrate a high correlation between our empirically determined formula and the generalization errors of our experimental results. The architecture of this experiment was the same as that described above. However, the architecture of the model was fixed at \(k_{max}=14\) and the other hyperparameters were equal to those in the above experiments. The only varying parameter was the training dataset size [200, 400,..., 10,000]. We found that the actual convergence rates are much faster than the results of Corollary 3 (which implies that the upper bound of the generalization error is \(O(m^{-0.5})\)) and the convergence rates are dependent on the activation functions of the models; \(O(m^{-1.1})\) for ReLU, and \(O(m^{-1.55})\) for GELU. The results are presented in Figs. 7 and 8. The shaded areas in the right figures indicate the variance in the empirical generalization error. We also calculated the variance of our estimation; however, when normalized to the scale of empirical generalization error, it is significantly small compared to empirical generalization error. Therefore, it is a stable index for estimating generalization errors.

Fig. 7
figure 7

Left: Scatter plot of generalization error for ReLU Right: Graph of empirical generalization error and normalized capacity factored by size of training dataset

Fig. 8
figure 8

Left: Scatter plot of generalization error for GELU case Right: Graph of empirical generalization error and normalized capacity factored by size of training dataset

4.3 Correlation for different resolution test samples and out of distribution samples

The upper bound of the generalization errors in Sect. 3 depends on discretization and dataset structure. To investigate how our capacity correlates with the generalization error of the test samples generated from different data distributions of the original training dataset, we constructed several test datasets other than the original one. For the experiment, we trained our models based on the same settings as those in Sect. 4.2.2, and inferenced on various test datasets. Recall that covariance of the GRF is \(k(x,y)=e^{-\frac{(x-y)^2}{c^2}}\) where c is a coefficient: Our test datasets were all four cases: \(c=0.05\) with discretization \(N=512\) and \(N=2048\). and \(N=1024\) with \(c=0.025\) and \(c=0.1\). The capacity is calculated as \(p=2, q=2\). The experimental results are shown in Fig. 9 and Table 4. It is interesting that although in Corollary 2 and 3, the upper bound has explicit dependence on discretization size N, in the empirical experiment, our capacity itself has a resolution-invariant property, showing almost the same tendency in the N=2048 and N=512 cases as in the original case. When \(c=0.1\), the tendencies of regression line and data are similar as in \(c=0.05\) case, and even empirical generalization errors are lower. However, when \(c=0.025\), the tendency of the data points was frustrated and had a higher generalization error. We assume that the main reason for this phenomenon is that the information on the frequency data distribution is different. For \(c=0.025\) case, each function pair has more high-frequency components. The \(c=0.1\) case has even fewer high-frequency components than \(c=0.05\) case.

Fig. 9
figure 9

Scatter plot and Correlation on the various test datasets

Table 4 Correlation between empirical generalization error and our capacity for various test datasets where c = 0.05, N = 1024 case is i.i.d to training dataset

4.4 Additional experiments on other PDEs

To show that our capacity is an effective indicator for estimating the generalization error, we experimented with more cases, which are problems of the governing equations 1-d integration, 1-d heat equation, 2-d heat equation and 2-d Navier–stokes equation. We verified the correlation between the empirical generalization error and the defined capacity. Similar to the experiments for the Burgers equation described in Sect. 4.2, We checked the correlation between empirical generalization error and capacity factored by the sizes of dataset for the fixed model architecture and varying sizes of dataset (Fig. 10). As discussed in Sect. 4.2, the GELU performs better than ReLU activation, and we fixed our activation as GELU throughout all experiments. We omitted the CNN layer to simplify the experiment.

Fig. 10
figure 10figure 10

Left: Scatter plot of generalization error for various PDE problems Right: Graph of empirical generalization error and normalized capacity factored by size of training dataset

4.4.1 1-d integration

1-d integration is one of the simplest function operators. The experimental settings were as follows:

$$\begin{aligned}{} & {} u_{x}=10u_{0}, \\ {}{} & {} \quad u(0) = 0 \end{aligned}$$

where \(u_{0}\) is an initial condition, and u is a scaled integration of this function. As in the Burgers equation setting in Sects. 4.1 and 4.2, our domain is a circle uniformly discretized by \(N=1024\). The function was scaled by multiplying it by 10 for normalization. Each initial function is generated from Gaussian random fields with covariance \(k(x,y)=e^{-\frac{(x-y)^2}{(0.05)^2}}\), similar to the Burgers equation. The total training dataset comprises 10,000 pairs of functions, and test dataset comprises 200 pairs of functions, respectively.

Training settings For fixed hyperparameters and varying sizes of the training dataset, we fixed the model architecture as \(k_{max}=14\) and the width as 64, the depth of the Fourier layers as 2, and the weight decay as e−3. The size of training dataset is [200, 400,..., 10,000]. For each training dataset size, the training was repeated five times.

4.4.2 1-d heat equation

We experimented with another 1-d time dependent PDE problem known as 1-d heat equation. 1-d heat equation describes the heat distribution on a physical object such as a steel rod. The governing equations are as follows.

$$\begin{aligned} u_{t}=u_{xx} \end{aligned}$$

The domain of the problem is a circle, and we uniformly discretize the domain by N = 1024. The initial condition was generated by the same Gaussian random field as that in the Burgers equation and 1-d integration settings. The target function is a section of the solution of the heat equation for a given initial condition at T=0.1. The training and test datasets have 10,000 and 200 pairs of functions, respectively.

Training settings The training settings are the same as in 1-d integration problem.

4.4.3 2-d heat equation

Time-dependent heat equations are parabolic PDEs that describe heat diffusion. We experimented based on a 2-d time-dependent heat equation to verify that our capacity is effective for a linear 2-dimensional PDE. The governing equations are as follows:

$$\begin{aligned} u_{t}= \nabla ^{2} u \end{aligned}$$

The domain of the problem is a torus, which means that the problem is periodic along the x- and y-axes. The domain was uniformly discretized with N=64 in both coordinates. The initial condition is generated by 2-d Gaussian random field with the distribution \(\mu = \mathcal {N}(0,7^{3/2}(-\Delta + 49I)^{-2.5})\) having the same GRF as in A.3.3 of Li et al. (2021). The target function is a section of the solution of the heat equation for a given initial condition at T = 0.005: The training and test datasets have 5000 and 200 pairs of functions, respectively.

Training settings For fixed hyperparameters and varying sizes of the training dataset, we fixed the model architecture as \((k_{max,1},k_{max,2}) = (14,14)\), with a width of 32, a depth of 2, and a weight decay of e−3. The training dataset size is [100, 200,..., 5000]. For each training dataset size, the training was repeated five times.

4.4.4 2-d Navier–Stokes equation

We consider the vicious, incompressible 2-d Navier–Stokes equation in vorticity form. The governing equations are as follows.

$$\begin{aligned}{} & {} u_{t} + v \cdot \nabla u = \nu \Delta u + f, \\ {}{} & {} \quad \Delta \cdot v = 0, \\ {}{} & {} \quad u(x,0) = u_{0} \end{aligned}$$

where u denotes the vorticity of velocity field v (\(u = \nabla \times v\)) defined in the torus-product time interval (\(T\times [0,T]\)). \(0 \le \nu\) is the viscosity coefficient, and f is a forcing function fixed as \(f(x)=0.1(sin(2\pi (x_{1}+x_{2})) + cos(2\pi (x_{1}+x_{2})))\). We selected one of the 2-d Navier–Stokes equation training dataset samples constructed in Li et al. (2021). Therefore, the initial vorticity function is generated from \(\mu = \mathcal {N}(0,7^{3/2}(-\Delta + 49I)^{-2.5})\) as in the heat equation. The viscosity coefficient \(\nu\) is 1e−4. We set the target of our model as vorticity at time \(T=5\) (=u(0, 5)) for a given initial vorticity.

Training settings The training settings are the same as in the 2-d heat equation problem.

4.5 Comparison with other capacities

In this subsection, we compare our capacity with recently developed capacity norms: the Fisher-Rao norm (Liang et al., 2019), the Hessian trace norm (Petzka et al., 2021) and relative flatness (Petzka et al., 2021). For the experiment, we trained models with hyperparameters of width 16, 2-depth Fourier layers; \(k_{max}\) was chosen from [10,12,...,20] and weight decay from [0,1e−3,...,4e−3] on training dataset with 800 pairs of functions, which is the same as in the previous experiment settings. From Table 6 and Fig. 11, we infer that our capacity has the highest correlation among the norms because our capacity does contain information about the hyperparameter concerning the model’s architecture, whereas other norms do not. In addition, complicated derivatives or second-derivative information are not required to calculate our capacity. Therefore, as shown in Table 5, the calculation time of our capacity is much shorter than that of the other norms, enhancing memory efficiency.

Table 5 Time and memory cost for calculation of each norms
Table 6 Correlation between empirical generalization error and various norms for different test datasets
Fig. 11
figure 11

Scatter plot and correlation for between empirical generalization error and various norms

5 Conclusion

We investigated the bounding Rademacher complexity of an FNO and defined its capacity, which depends on the model architecture and the group norms of the weights. Although several results exist regarding the bounding Rademacher complexity of various types of neural networks, the FNO possesses tensor weights of rank higher than two. Therefore, our study may be useful for other NNs containing higher-rank tensors. Our results are experimentally validated. Based on these experiments, we gained insights into the impact of p and q values and information about the model weights and architecture stored in terms of capacities. Through experiments on other PDEs and their dependency on the architecture and size of datasets, we validated that our capacity norm is effective for estimating the empirical generalization error. By comparing it with other sophisticated capacity norms, we empirically prove that our capacity norm is an efficient and effective index among these norms. Moreover, although various neural operators have been developed, including FNO and DeepONet, the analysis of PAC learning for these neural operators has not been performed in detail. Thus, this study may serve as a guide for such analyses. In this study, we assume that the activation function is fixed. For a general model containing parameterized activation, such as PReLU, we need to modify our analysis. Although the Rademacher complexity contains information about datasets, the bounding of our results lacks a specific dependency on each problem. Because we experimented with various PDE problems, the performance of the FNO varied for each problem. Therefore, we must extend the complexities to include information about datasets. In addition, although we empirically verified faster convergence compared to the theoretical bound, we need to justify the empirical fast convergence of the generalization error through careful theoretical analysis.