Bounding the Rademacher Complexity of Fourier neural operators

A Fourier neural operator (FNO) is one of the physics-inspired machine learning methods. In particular, it is a neural operator. In recent times, several types of neural operators have been developed, e.g., deep operator networks, Graph neural operator (GNO), and Multiwavelet-based operator (MWTO). Compared with other models, the FNO is computationally efficient and can learn nonlinear operators between function spaces independent of a certain finite basis. In this study, we investigated the bounding of the Rademacher complexity of the FNO based on specific group norms. Using capacity based on these norms, we bound the generalization error of the model. In addition, we investigated the correlation between the empirical generalization error and the proposed capacity of FNO. From the perspective of our result, we inferred that the type of group norms determines the information about the weights and architecture of the FNO model stored in the capacity. And then, we confirmed these inferences through experiments. Based on this fact, we gained insight into the impact of the number of modes used in the FNO model on the generalization error. And we got experimental results that followed our insights.


Introduction
Physics-inspired machine learning is an actively studied area, in which two approaches exist.One approach includes the deep Ritz method (Weinan and Yu (2018)), PINNs (Raissi et al. (2019)), and LSNN (Cai et al. (2021)), and the other approach includes DeepONets (Lu et al. (2021)), MWTO (G. Gupta and Bogdan (2021)), GNO (Li et al. (2020)), and Fourier neural operator (FNO) (Li et al. (2021)).The former approach focuses on determining solutions to PDEs for fixed PDE and boundary conditions, whereas the latter focuses on operators between function spaces.In this study, we focus on the FNO, which uses a Fourier transform to quickly and practically manage the convolution operator between two functions.One of the advantages of the FNO is its computational efficiency compared with those of other methods, and unlike DeepONet, its representation is not limited to finite-dimensional space spanned by few basis functions.Previous studies (Li et al. (2021) and Pathak et al. (2022)) confirmed that the FNO can successfully approximate a numerical solver and real-world data, thereby indicating its computational efficiency and potential applicability.Unlike real-world machine-learning problems, approximating the solver operator of the PDE is deterministic and concrete.There is a result of the universal approximation property of the FNO and its approximation error on certain PDE problems (Kovachki et al. (2021a)).However, there is no result for estimating the generalization error of the FNO.Although approximating the solver operator of the PDE is deterministic problem, we can provide only a finite number of samples to the FNO.Therefore, the accuracy of inference on hidden data is another problem that needs to be considered.Several approaches with regard to the bounding generalization error of deep neural networks, e.g., the group norm of weights (Neyshabur et al. (2015)), spectral norm (Bartlett et al. (2021)), path norm (Neyshabur et al. (2015)), Fisher-Rao norm (Liang et al. (2019)), and relative flatness (Petzka et al. (2021)), exist.In this study, we investigate the bounding of generalization errors in the framework of the PAC learning theory.In particular, we bound the Rademacher complexity of the FNO. Figure 1 shows the overall structure of the FNO architecture.The input of the network is R da -valued function on the domain D ⊂ R d .We denote the input function space of the FNO by A( D; R da ).The vector value of the input function is lifted to the d v -dimensional vector using a layer defined as N P .While passing through Fourier layers (which is denoted as A i in the diagram) iteratively, it is processed as a R dv -valued function.Each Fourier layer comprises the activation function, the sum of a neural network with the convolution of the input function with a kernel parameterized by weight R i .After passing through the Fourier layers, the vector value of the R dv -valued function v D is projected onto the d u -dimensional vector using N Q .We denote the output function space of the FNO by U( D; R du ).Neural network A i in the Fourier layers can be arbitrarily chosen.In our results, we chose A i as the Fully connected network (FCN) or Convolutional neural network (CNN).As computational machines cannot handle infinite dimensional data, we constructed the FNO model using finite parameters based on the aforementioned concept with regard to real-world implementation.

Probably approximately correct (PAC) learning
PAC learning is a framework of the statistical learning theory proposed by Leslie Valiant in 1984(Valiant (1984)).One of the main concepts of the PAC learning theory is the no free lunch (NFL) theorem, which states that it is not possible to simultaneously achieve low approximation and estimation errors.The tradeoff between such errors is closely related to the complexity of the hypothesis class.Various quantities related to the complexity of the hypothesis class determine the learnability and decay of estimation errors, e.g., VC dimension, Rademacher complexity, and Gaussian complexity.All the complexities are related; however, there are several differences.For example, the VC dimension is independent of training sets, and the others are not.Neural networks and deep learning, as a subcategory of machine learning, can be applied to the PAC learning theory.In recent times, various studies have been conducted on bounding the Rademacher complexity and VC dimension of the hypothesis class of neural networks.For instance, results with regard to the bounding of Rademacher complexities for FCN (Neyshabur et al. (2015)), RNN (Minshuo et al. (2020)), GCN (Lv (2021)), and the analysis of the VC dimension of neural networks (Sontag (1998)) have been obtained.In addition, there is information about the bounding Rademacher complexity of DeepONet (Gopalani et al. (2022)) which is also one kind of a neural operator.(Weinan et al. (2020)) estimated the generalization error of ResNet in prior and posterior estimates.

Our Contributions
In this study, we define the capacities of FNO models based on certain types of group norms.And we bound the Rademacher complexity of the hypothesis class based on these capacities for two kinds of FNOs (Fourier layers with FCN and CNN) and induce the bounding of posterior generalization error of the FNO models.In Section 4, we experiment with the data generated from the Burgers equation problem and verify the correlation between our bounding process and empirical generalization errors.And through experiments, we gained insights into the information of model architecture and model weights contained in various types of capacity.We also qualitatively confirmed that empirical generalization errors depend on the number of modes used in the FNO model.

Preliminary
Notation Several indices have been considered in our discussion.Therefore, to simplify the formulas, we denote x 1 . . .x d as x and k 1 . . .k d as k.In addition, for the multi-index tensor in the norm, indexes denoted as • are used in the calculation of the norm, e.g., Discretization of data As the function space is infinite-dimensional, to treat the data and operator numerically, we discretize the domain of the function and consider the function to be a finite-dimensional vector.Let DN = {x 1 , ..., x N } be the discretization of domain D ∈ R d .Then, the R m -valued function f is discretized into (f (x 1 ), ..., f (x N ))) ∈ R N ×m .Then, we discretize A( D; R da ) and U( D; R du ) as R N ×da and R N ×du , respectively.Then, sample data are defined as follows: element ((a jk ), (u jk )) ∈ R N ×da × R N ×du .
Fourier transform Based on the Fourier analysis, we know that the Fourier transform transfers the convolution operation to pointwise multiplication.For the function of domain D ⊂ R d , let F and F −1 be the Fourier and inverse Fourier transforms over D, respectively.Thus, we obtain the following relation: For our analysis, we select D as [0, 2π] d .As we treat functions as discretized vectors, we can treat the Fourier transform as a discrete Fourier transform.If the discretization of D is uniform, it can be replaced with a fast Fourier transform.Consider that D is discretized uniformly by resolution N 1 × • • • × N d = N , then, for discretized function f ∈ R N , its FFT F(f )(k) and IFFT F −1 (f )(k) are defined as follows: x j k j N j .
For our analysis, we denote the components of the FFT and IFFT tensors as x j k j N j , respectively.
Definition 1 (General FNO) Let DN be the discretized domain in R d ; then, FNO : R N ×da → R N ×du is defined as follows: where N P and N Q denote the neural networks for lifting and projection, respectively.Each A i is a Fourier layer, For simplicity, we assume that N Q and N P are linear maps.Each Fourier layer is a composition of the activation function with a sum of convolutions based on a parameterized function and linear map.Only partial frequencies are used in the Fourier layers.The frequencies used in the model are expressed in an index set The detailed formula of the FNO is CNN layer For each Fourier layer, we can replace the general linear map with a CNN layer.A schematic diagram of the convolution with 2D data and a kernel are shown in Figure 2. A certain size of kernel swipes the input tensors so that resulting for each index of output an inner product with kernel and local components of the input tensor centering the index.For example, for a d-rank input tensor with a size of ; then, the tensor that passes through the CNN layer with K is defined as follows: Definition 2 (FNO with the CNN layer) Consider the settings of the aforementioned FNO; the only difference is that the Fourier layer is a sum of the CNN layer and convolution with parameterized functions.
An ideal operator should infer solution from all the functions in the input function space.But for practical and implemental reasons, finite training samples are selected from distributions on the vector space, which is a discretized function space.Suppose D is a distribution on R N ×da × R N ×du ; then, we define the loss functions as follows: Definition 3 (Loss for FNO) Suppose that the training dataset is given by where each sample is independently chosen from the distribution D; the training loss is defined as follows: Let p be the probability distribution of D, which is defined as R N ×da × R N ×du .Then, the loss of the entire distribution D is defined as follows:

Generalization bound for FNOs
In this section, we calculate the upper bound of the Rademacher complexity of the FNO; based on this bounding, we estimate the generalization bound.We demonstrate several lemmas concerning our main results.The proof of the main theorems comprises two main lemmas: inequality for the Rademacher complexity part and supremum of the norm of FNO models.Using these lemmas, we prove our main results.

Mathematical Setup
Definition 4 (Rademacher complexity) Let F be a class of mapping from X to R. Suppose {x i ∈ X : i = 1, ..., m} is given.i are independent, uniform, {+1, −1}-valued random variables.The empirical Rademacher complexity of F on the given sample set is defined as follows: The following definitions are the main components of our results: Definition 5 (weight norms and capacity) For the multi-rank tensor M i1,...,im,j1,...,j k , we define the following weight norm: M i1,...,im,j1,...,j k p:{i1,...,im},q:{j1,...,j k } := q j1...jm For p = ∞ or q = ∞ cases, we think sup-norm instead of above definition.Now, suppose for an FNO with a Fourier layer of depth D, we denote Q for the weight matrix of projection, P for the weight matrix of lifting, and A i and R i for the weight tensors of the Fourier layers.Then, we define • p,q , where p is the index for positions, frequencies, and inputs, and q is the index of output.We define the following norm for the Fourier layer: The capacity of the FNO model h as a product of the weights of its layers is defined as follows: Now, we define the norms for the CNN layers.For the kernel tensor K of the CNN layer, we define the following norms of the weights and capacities of the entire neural network.In the • p,q norm for the kernel tensor of the CNN layer, p is the index of kernels and input, and q is the index of the output.
Next, we define hypothesis classes, of which the Rademacher complexity is bounded in our results.A hypothesis class is a collection of functions, from which a learning algorithm selects a function.
Definition 6 (Hypothesis classes of FNO) Suppose that the function classes of the FNO with the D depth and maximal modes of Fourier layers are k max,1 , ..., k max,d .The width, size of the input vector, size of the output vector, and activation function are fixed.We define the hypothesis class for a general FNO as follows: Finally, we define the hypothesis class of the FNO with the CNN layers as follows: We also define the following auxiliary definition for the hypothesis class of sub-neural networks of FNO models, where the terminal layer is the Fourier layer (denoted as FNO sub:i ).

Main Results
Notations in each lemma and theorem are based on the definitions of Section 3.1, and the activation function is Lipschitz continuous.And we set our notations as follows: for a given sample S = {a i } i=1,...,m (where a i are the input data) and hypothesis class H din C P ,C0,...,Ct , we denote h(a i ) by v t,i where h ∈ H din C P ,C0,...,Ct .The components are denoted by v t,i,xj .
The following lemma regarding l p norms is frequently used in our proofs: Lemma 1 (norm inequality) If 1 ≤ p ≤ q ≤ ∞, for v ∈ R N we obtain the following inequality: Let • + denote the ReLU function.Then, for an arbitrary 1 ≤ p, q, inequality can be defined as The following lemma is required to handle nonlinear loss in our proof (a proof of this lemma can be found in (Maurer (2016))): Lemma 2 (Vector-contraction inequality for the Rademacher complexity) Assume that σ is a Lipschitz continuous function with Lipschitz constant L, and F is a hypothesis class of R N -valued functions.Then we have the following inequality.
We now prove our main results.Our proof is composed of two parts.Firstly, we get an upper bound of p * -norm of the output of FNO models.And we bound the Rademacher complexity of the FNO model on samples based on the upper bound we found.In our discussion, we assume that the projection and lifting layers are linear maps; however, we can easily generalize this to a general FCN.
Lemma 3 and 3' are the main factors of our result, in which Fourier layers are inductively peeled off.
Lemma 3 Suppose H = H din C P ,C1,...,C D ,C Q is the hypothesis class of the FNO with constants C P , C 1 , . . ., C D , C Q .Then, for a sample a ∈ R N ×da , we obtain the following inequality: Then we have following: For fixed x, z, k, F † xk F kz k is a k max,1 , . . ., k max,d -dimensional vector, where each component exhibits the e ib N form.Thus, by applying Hölder's inequality, we obtain the following inequality: Then, we obtain the following bound: So applying the above bound iteratively, we get the following inequality: ≤ (N H) So, combining the two inequalities we got, we have the following inequality: Hölder's inequality is used in (1) and ( 2).And, we used norm inequality in (3) and ( 4).
The proof of the following lemma is similar to Lemma 3.However, in this case, the hypothesis class is composed of FNO with CNN layers.
Lemma 3' Suppose H = H CN N din C P ,C1,...,C D ,C Q is the hypothesis class of an FNO with CNN layer with constants C P , C 1 , . . ., C D , C Q .Then, for a sample a ∈ R N ×da , we obtain the following inequality: We just need to modify the induction parts of the Fourier layers in the proof of Lemma 3.
where we use Hölder's inequality in (5).Then, by applying the p * norm to the aforementioned inequality over x, j and the norm inequality, we obtain the following inequality: .
The remainder of the proof is similar to that of Lemma 3.
Lemma 4 Suppose H din C P ,C1,...,C D ,C Q is the hypothesis class of the FNO with given constants C P , C 1 , . . ., C D , C Q .Then, for samples S = {a i } i=1,...,m , we obtain the following inequality: Where we used norm inequality in (6).
Theorem 1 Suppose H din C P ,C1,...,C D ,C Q is a hypothesis class with constants C P , C 1 , . . ., C D , C Q .Then, for samples S = {a i } i=1,...,m , we obtain the following inequality: Theorem 2 (FNO with a CNN layer) Suppose H CN N din C P ,C1,...,C D ,C Q is a hypothesis class with constants C P , C 1 , . . ., C D , C Q .Then, for samples S = {a i } i=1,...,m , we obtain the following inequality: Corollary 1 For a constant γ > 0, consider the hypothesis class H γp,q≤γ , which is a collection of FNOs with γ p,q ≤ γ.For samples S = {a i } i=1,...,m , we obtain the following inequality: For a given hypothesis class H CN N γp,q≤γ , similar to H γp,q≤γ , and training samples S = {a i } i=1,...,m , we obtain the following inequality: Proof As We have the following inequality: Since the upper bound of p * -norm of models of hypothesis class in the above equation is the same as in Lemma 3, we can apply the same logic as in Theorem 1. So, we get the following inequality Similarly, based on the aforementioned proof, we obtain the inequality for the FNO with CNN layers.
Recall the following well-known theorem, which states statistical estimation of generalization error bound of given hypothesis class in terms of Rademacher complexity.This fundamental theorem can be found in (Shalev-Shwartz and Ben-David (2014)).
Theorem 3 (Generalization error bounding based on the Rademacher complexity) For a given hypothesis class H and loss function l : H × Z that satisfy the following case: for all h ∈ H and z ∈ Z, we obtain |l(h, z)| ≤ c.Then, with a probability of at least 1 − δ, for all h ∈ H, we obtain where D is probability distribution on Z and S is a training dataset sampled from D i.i.d.
Before considering generalization bound of FNO, we choose distribution D on R N ×da × R N ×du to have a compact support.So that |l(h, z)| ≤ c condition in Theorem 3 holds.Then, using the aforementioned theorem 3 and our corollary 1, we obtain the following estimation of the generalization error bound: Theorem 4 (Generalization error bound for FNO) For the training dataset S = {(a i , u i )} i=1,...,m which is sampled from probability distribution D i.i.d., and for hypothesis class H γp,q≤γ , let h be the ERM minimizer of L S and suppose h(a) − u 2 ≤ 2 for all (a, u) ∼ D, h ∈ H γp,q≤γ .Then, with a probability of at least 1 − δ, we obtain the following inequality: Similarly for hypothesis class of FNOs with CNN layers, for dataset S, and for hypothesis class H CN N γp,q≤γ , let h CN N be the ERM minimizer of L S and suppose h(a) − u 2 ≤ 2 for all (a, u) ∼ D, h ∈ H CN N γp,q≤γ .Then, with a probability of at least 1 − δ, we obtain the following inequality: such that h(a) − u 2 ≤ 2 for all training samples.Then with the confidence of at least 1 − δ, we have the following estimates.
And for FNOs with CNN,

Experiments
In this section, we validate our results based on experiments.First, we investigate the correlation between our capacity and the empirical generalization errors for various capacities of p and q.Hereafter, we check the dependency of generalization errors on the model architecture by varying k max .Data specification For our experiment, we synthesize the dataset based on the following Burgers equation problem: The domain of the problem is a circle, and we uniformly discretize the domain by N = 1024.As described in Section 2, each data point is a pair of functions.In our experiment, the input function is an initial condition, and the target function is a solution to the aforementioned equation at t = 0.5.Each input function is generated from Gaussian random fields with covariance k(x, y) = e − (x−y) 2 (0.05) 2 .The training dataset comprises 800 pairs of functions, and the test dataset comprises 200 pairs of functions (both generated independently).Correlation for various capacities of p and q We checked the correlation between the generalization error and capacities.Each point in Figure 3 represents a trained model for the randomly chosen hyperparameters.The architecture of the models in our experiment is organized as follows: 2-depth Fourier layers, linear layers without activation for projection, and lifting layers.The width is fixed at 64.The weight decay for each training session is randomly chosen from 0, 20*1e-3, 40*1e-3, 60*1e-3, and 80*1e-3; k max is randomly chosen from the values of 8, 12, 16, and 20; the size of the kernel is randomly chosen from 1, 3, 5, and 7 for 100 iterations.
Figure 3: Scatter plot of generalization error vs capacity for p = 1.2, q = 1.2 p = 1 p = 1.2 p = 1.6 p = 2 p = 4 p = ∞ q = 1 0.8757 0.9137 0.7794 0.7595 0.7542 0.7285 q = 1.2 0.8395 0.9358 0.8007 0.7635 0.7526 0.7265 q = 1.6 0.8127 0.9007 0.8476 0.7750 0.7495 0.7231 q = 2 0.8037 0.8720 0.8815 0.7860 0.7466 0.7204 q = 4 0.7919 0.8417 0.9084 0.7938 0.7322 0.7112 q = ∞ 0.7555 0.8229 0.8859 0.7765 0.7235 0.7219 Table 1: Correlation between empirical generalization error and capacities of various p and q for trained models with randomly chosen hyperparameters Table 1 lists the correlations for various values of p and q.As observed, the correlation tends to decrease when the p and q values increase.This is because, as p increases, the p-norm loses information about elements other than the highest one.Thus, the information of the model itself is lost in a capacity defined by high values of p and q.However, instead of losing information about model weights, as p goes to ∞, p * goes to 1; thus, the terms about kernel size and k max are more affecting to capacity as p increases.Therefore, we assume that capacities of high p contain more information about the model architecture.To prove our arguments, we conduct experiments in which k max vary and other hyperparameters are fixed.First, to show that capacities of low p and q have more information about the model's weights than its architecture, we trained three types of models that have negligible differences in k max .Second, to show that capacities of high p and q are more related to the model architecture, we trained three kinds of models that have considerable differences in k max .For each experiment, we trained the models 30 times for each k max setting, i.e., 14, 16, and 18 for the left column of Figure 4 and 10, 30, and 50 for the right column.Hyperparameters other than k max are fixed: the kernel size of the CNN layer is 1, the width is 64, and the depth of the Fourier layers is 2. As evident in the left column of Figure 4, models with small gaps in k max lose the correlation between the generalization gap and capacity as the p and q values increase.However, as shown in the right column of Figure 4, the highest correlation between capacity and generalization error is obtained for higher p and q values compared with those of the left column case.The correlation is maintained at 0.89 for the p, q = ∞ case.
Figure 4: Left: Scatter plot, correlation and linear regression between generalization error and capacities of various p and q for 30 trained models for each k max = 14, 16, 18; Right: Scatter plot, correlation and linear regression between generalization error and capacities of various p and q for 30 trained models for each k max = 10, 30, 50.

√
k max for various p, q where depth of fourier layer is 2.

Conclusion
We investigated the bounding Rademacher complexity of an FNO and defined its capacity, which depends on the model architecture and the group norm of the weights.Although several results already exist with regard to the bounding Rademacher complexity of various types of neural networks, the FNO possesses tensor weights that rank higher than two.Therefore, our study may be helpful for other NNs that contain higher-rank tensors.We validated our results through experiments.Based on these experiments, we gained insights into the impact of p and q values and the information about model weights and architecture stored in terms of capacities.Various neural operators have been developed, including FNO and DeepONet; however, the analysis of PAC learning to these neural operators has not been performed in detail.Thus, this study may serve as a guide for such analysis.Herein, the original FNO by (Li et al. (2021)) was implemented with the GeLU activation function, which contains various parameters.In this study, we assumed the activation function to be fixed.For a general model containing parameterized activation, e.g., GeLU, we need to modify our analysis.Although the Rademacher complexity contains information about datasets, the bounding of our results lacks specific dependency on each problem.As we experimented with various PDE problems, the performance of the FNO varied for each problem.Therefore, we need to extend the complexities to include information about the datasets.

Figure 1 :
Figure 1: (a) Sketch of overall architecture of FNO (b) Detailed diagram of fourier layers

Figure 5 :
Figure 5: Left: Scatter plot and regression between generalization error divided by norms of fourier layers and p * √ k max for various p, q where depth of fourier layer is 1; Right: Scatter plot and regression between generalization error divided by norms of fourier layers and p * 2 1)Similarly, based on the aforementioned proof, we obtain the inequality for the FNO with CNN layers.Now, if the capacity of FNO model h is γ, it is included in the hypothesis class H γp,q≤γ .Since inequalities in theorem 4 hold for all hypotheses in class, we have the following posterior estimate of FNO.For given architecture parameters N, H, d u , d a , L, and training samples {(a i , u i )} i=1,...,m with a i p * ≤ B for all i.Suppose h is trained FNO (Fourier layers with FCN or CNN)