Stability and Generalization of Hypergraph Collaborative Networks

Graph neural networks have been shown to be very effective in utilizing pairwise relationships across samples. Recently, there have been several successful proposals to generalize graph neural networks to hypergraph neural networks to exploit more complex relationships. In particular, the hypergraph collaborative networks yield superior results compared to other hypergraph neural networks for various semi-supervised learning tasks. The collaborative network can provide high quality vertex embeddings and hyperedge embeddings together by formulating them as a joint optimization problem and by using their consistency in reconstructing the given hypergraph. In this paper, we aim to establish the algorithmic stability of the core layer of the collaborative network and provide generalization guarantees. The analysis sheds light on the design of hypergraph filters in collaborative networks, for instance, how the data and hypergraph filters should be scaled to achieve uniform stability of the learning process. Some experimental results on real-world datasets are presented to illustrate the theory.


Introduction
Many real-world applications involve datasets that exhibit graph structures that depict pairwise relationships between vertices.These applications span a wide range of domains, including text analysis [1,2], social network analysis [3], molecule classification in chemistry [4], point cloud processing [5], mesh generation [6], and knowledge graphs [7], to name a few.In some applications, the only available data are graphs, while in some other applications, features about the vertices and edges are also available.Very naturally, machine learning algorithms that can exploit both graphs and features often yield better results than algorithms that solely rely on graph structures.Of particular interest is the class of graph convolution network (GCN) methods [8][9][10].These methods use neural network layers whose output at a vertex depends mostly on others that are deemed relevant according to the graph structure.A very common point of view is that GCN is a generalization of the traditional spatial filters, where graphs can represent neighbors that are not limited to spatial closeness.
While graphs can depict pairwise relationships, hypergraphs can represent relationships between multiple vertices.A hypergraph is a graph in which each edge, or hyperedge, can connect to multiple vertices.Hypergraphs can therefore represent more complex relationships.For example, in traditional paper-authorship networks, two articles (vertices) are connected if they share one or more coauthors.In this way, authorship information, which can provide important clues to the topic of the articles, is lost.Hypergraphs come to the rescue by treating each author as a hyperedge and each article as a vertex.Several works have been devoted to hypergraph learning.In [11], spectral clustering and semi-supervised parametric models on graphs were extended to hypergraphs.In [12], a tensor representation of hypergraphs was proposed to make the optimization of hypergraphs more amenable.Generalizations of graph convolution networks to hypergraph convolution networks have been studied in [13][14][15][16].The main task is to define a convolution operator in a hypergraph such that the transition probability between two vertices can be measured, and the embeddings (or features) of each vertex can be propagated in a hypergraph convolution network.More propagations should be done between vertices connected by a common hyperedge.
Recently, [17] and [18] studied and developed the convolution of vertex and hyperedge features to suitably aggregate values in hypergraphs.It was shown that such hypergraph collaborative networks (HCoN) have obtained superior results in some semi-supervised learning problems compared to other hypergraph neural network models.The proposal is to formulate the learning of vertex and hyperedge embeddings as a joint optimization problem to allow for updating the vertex and hyperedge embeddings simultaneously.The authors showed that the performance can be further boosted by incorporating a hypergraph reconstruction error as one of the objectives.
Apart from the development of machine learning algorithms and applications, efforts on the theory have also been made in the past decades, from general learning theory to specific properties of a class of algorithms.In this paper, we aim to study the algorithmic stability and generalization of the class of HCoN.We show that, in the single-layer case, the HCoN is algorithmically stable.As a result, the generalization gap (difference between training error and testing error) converges to zero as the size of the training set increases.We should explain the notions of training and testing in our context in Section 3.6 below.The analysis sheds light on the design of hypergraph filters in HCoNs, for instance, how the data and filters should be scaled to achieve the uniform stability of the learning process.Our generalization result is also valid for hypergraph convolution networks with the propagation of vertex embeddings only.
The rest of the paper is organized as follows.In Section 2, we review some works related to hypergraph neural networks and generalization guarantee studies.In Section 3, we introduce the HCoN model, some assumptions about the loss function and activation function, and some preliminary results.The results of the algorithmic stability of the HCoN and the generalization guarantee are established in Section 4. In Section 5, we present some experimental studies on the generalization gap to illustrate the theory.Some concluding remarks are given in Section 6.

Related Work
Generalization guarantees concern the expected difference between training and testing errors.Bousquet and Elisseeff [19] and Mukherjee et al. [20] showed that under suitable regularity assumptions, the stability of an algorithm implies generalization.In addition to the general theory, Bousquet and Elisseeff [19] also studied the stability and generalization of the global minimizer of some regularized learning models.Such results are therefore independent of the particular algorithm used to compute an approximate optimizer.Stochastic gradient descent (SGD) is a popular algorithm to obtain a suboptimal solution to optimization problems.Some classical results on the generalization of SGD in the case of a single pass of data are reported in [21].The analysis in the case of multiple passes of SGD is reported in [22].
Stability and generalization analysis of SGD applied to graph convolution networks (GCNs) is studied in [23].The stability is established in terms of the dominant eigenvalue of the graph filter.A difference between [23] and [19] is that the former considers the model parameters computed via SGD, whereas the latter considers the theoretical optimal model parameters.Our work generalizes the analysis of GCNs in [23] to hypergraph collaborative networks [17,18] which also include some edge features.A technical difference between our work and [23] is that the latter assumes some Lipschitz conditions on the composition of the loss function and the neural network outputs but we require some Lipschtiz conditions on the loss function only.Ours are therefore easier to verify and more fundamental.
Various kinds of stability have been proposed and studied in the literature [19,20].Similar to [23], we use uniform stability which yields tighter bounds than other forms of stability, such as error stability, hypothesis stability and pointwise hypothesis stability [19].Finally, another related work is [24], which studied regularized graph learning problems and devised some generalization guarantees.However, the resulting generalization gap is inversely proportional to the second smallest eigenvalue of the graph Laplacian matrix.Thus the bound can grow with the size of the graph.The bounds devised by us and [23] are independent of the graph size, and the generalization gap converges to zero with the sample size and the graph size.

Hypergraph Convolution Networks 3.1 Hypergraphs
A hypergraph is a graph where a hyperedge can connect to any number of vertices.In the undirected version, each hyperedge is represented by a subset of vertices.A hypergraph is denoted by G = (V, E) where V is the set of vertices and E = {e : e ⊂ V} is the set of hyperedges.The number of vertices and the number of hyperedges are denoted by N = |V| and M =|E|, respectively.A hypergraph can also be represented by an incidence matrix H ∈ R N ×M , whose (i, j)-th equals 1 if the j-th hyperedge is connected to the i-th vertex, and equals 0 otherwise.For simplicity, we consider unweighted graphs, but the weights can be introduced into the hypergraph filters easily as done in [13][14][15][16].
Generalizations of graph convolution networks to hypergraph convolution networks have been studied.Given a feature matrix X V ∈ R N ×F V (where F V is the dimension of the vertex features) and the dimension O of the output embeddings, the convolution operator in a hypergraph for the propagation of vertex features (or embeddings) is constructed as follows: where σ(•) is an activation function, Q V ∈ R F V ×O is a matrix of learnable parameters, H ∈ R N ×M is the incident matrix of the hypergraph, and D V = diag(H1) and D E = diag(H ⊤ 1) are diagonal matrices with the degrees of the vertices and the degrees of the hyperedges on the diagonals.We note in (1) that the operator mixes the feature vectors of vertices connected by a common hyperedge.In this way, feature vectors of neighboring vertices of a vertex are propagated into the vertex and aggregated to form a new feature vector for the vertex.This model does not make use of hyperedge features.

The Hypergraph Collaborative Network Model
Unlike the hypergraph convolution networks in (1), the hypergraph collaborative networks (HCoN) in [17] and [18] aim to predict vertex-level and hyperedge-level labels together based on the hypergraph structure and a set of vertex features and hyperedge features.The single-layer hypergraph collaborative network we consider is given by f In addition to the notations in (1), here X E ∈ R M ×F E is a given matrix of hyperedge features, F E is the dimension of the hyperedge features, is the normalized incident matrix.The functions f and g are referred to as the vertex encoder and the hyperedge encoder, respectively.They are used to produce labels at vertices and hyperedges, respectively.In the sequel, we focus on the analysis of vertex encoder (2) because the analysis of ( 3) is essentially the same.
We remark that the HCoN model described in [18] reads In this work, we assume that the vertex weights W and hyperedge weights U are set to identity matrices for ease of presentation.We also absorb the constants α and β into the parameters Q V , Q E , P V , and P E ; the degree of freedom of the model remains unchanged.

The Activation Function
Algorithmic stability concerns the change in the loss function value with respect to the change in the data.Since SGD aggregates gradients, it is very natural that the stability must rely on the regularity of the activation function and the loss function.The activation function σ : R → R is assumed to satisfy the following.Standard activations such as sigmoid, ELU, and tanh verify these assumptions.RELU fails the σ-smoothness.However, one can consider a smoothed RELU to restore the theory: It follows that the derivative σ ′ is bounded, i.e. |σ ′ (x)| ≤ α σ .

The Loss Function
Let ŷ be an estimated label and let y be the true label.The loss function ℓ : [y min , y max ] × [y min , y max ] → R + is denoted by ℓ(ŷ, y).The following assumptions are made.
By the nonnegativity and continuity of ℓ(•, •) and the compactness of [y min , y max ] × [y min , y max ], the loss function is bounded: In the case of the binary cross-entropy ℓ(ŷ, y) = −y ln ŷ, the Lipschitzity condition does not hold.However, it can be easily remedied by rescaling or clipping ŷ to the range of [ϵ, 1] for a 0 < ϵ < 1, so that ℓ(ŷ, y) ≤ − ln ϵ.
We remark that the following Lipschitzity of a composite function is assumed in [23]: Here, ∥ • ∥ 2 denotes the Euclidean norm and f (v|θ) denotes the output of a neural network with parameters θ at vertex v.The θ and θ ′ are two sets of parameters for the network.The ∇ θ denotes the gradient operator w.r.t.θ (i.e.∂ ∂θ ).Our assumption (9) is solely in terms of ∂ℓ ∂ ŷ and is more fundamental.

Network Outputs at a Single Vertex
In this subsection, we provide more details about the characteristics of the network outputs at each vertex and devise a bound of the hypergraph filter outputs that will be useful in the next section.We remark that a similar analysis can also be conducted for network outputs at each hyperedge.Here, we focus on the analysis of vertex propagation (2) only.
For a vertex v ∈ V, let e(v) be a binary vector whose j-th entry is 1 if v is the j-th vertex and is 0 otherwise.Denote the learnable parameters with θ where E X E .Note that A v is a linear combination of the vertex feature vectors over the neighboring vertices of v and B v is a linear combination of the edge feature vectors over the hyperedges joining v.For notational simplicity, we assume that the output dimension is O = 1 .The analysis also works for a general O with minor modifications.Let Then, we have Here, ∇ θ is the gradient operator w.r.t.θ and σ ′ is the derivative of σ.
Denote by ∥ • ∥ 2 the matrix 2-norm.Note that where µ( H) = ∥ H∥ 2 is the largest singular value of H.The last inequality follows from the fact that the diagonal entries of D E are positive integers so that ∥D It follows that Moreover, by ( 12) and ( 6), we have Finally, we remark that if the two terms in the sum in (2) are weighted by α and 1 − α respectively as described in ( 4) and ( 5), then ( 14) and ( 15) hold with

The SGD Algorithm
Let D be an unknown joint distribution of the vertex and the associated label.Let S = {(v 1 , y 1 ), (v 2 , y 2 ), . . ., (v n , y n )} be a set of n i.i.d.samples from D. The set S serves as the training set for the HCoN.The objective function of an HCoN is given by The learning task is a transductive semi-supervised task.The incident matrix H and the feature matrices X V and X E are considered given.The formation of a training set is a sampling of vertices (and the associated labels) in the network.The training process (or learning process) refers to the minimization of L via SGD.A testing sample is another i.i.d.sample of a vertex in the network.The testing process refers to taking the output at the testing sample and comparing it to the known label.The sample space is therefore a finite space, to which the concentration inequality we used below applies.We consider T iterations of the SGD algorithm, where the batch size is 1.At the t-th iteration, a sample (v it , y it ) is drawn from S with replacement.The parameters θ = (Q ⊤ V , Q ⊤ E ) ⊤ are updated as follows: for t = 1, 2, . . ., T , where η > 0 is the learning rate.The final parameters learnt are θ T , also denoted by θ(S, A).The variable A denotes a particular randomization of SGD, i.e. the sequence (i 1 , i 2 , . . ., i T ). Let ) ⊤ be the parameters learned with S ′ .The SGD update is given by ), y ′ it ) for t = 1, 2, . . ., T .The initial parameters for S and S ′ are set equal, i.e. θ 0 = θ ′ 0 .The parameters learned, using the same randomization A as that for S, are denoted by θ ′ T = θ ′ (S ′ , A).The difference between the parameters θ t and θ ′ t at the t-th SGD iteration is .
Since S and S ′ are identical except for the i * -th sample, if t * is the first iteration at which (v i * , y i * ) and (v ′ i * , y ′ i * ) are sampled, we have ∆θ t = 0 for all t < t * .Note that

Main Results
Following the approach of [22] and [23], we first establish the uniform stability of SGD and then obtain the generalization guarantees.

Uniform Stability
Recall that S and S ′ are two training sets that differ only by the i * -th sample.In Lemma 1 and Lemma 2, we bound the gradient difference when the same sample and different samples are used, respectively.Then, we devise the difference between the SGD updates for S and S ′ in Lemma 3. The stability then follows in Theorem 4. Note that σ ′ is the derivative of σ.Other variables with a prime ( ′ ) denote quantities derived from the perturbed training set S ′ .
Lemma 1 At the t-th SGD iteration, we have By (11) and ( 12), we have f = σ(d), f ′ = σ(d ′ ), and Therefore, by ( 14), we have Hence, □ Lemma 2 At the t-th SGD iteration, we have . By ( 8) and ( 15), we have □ Lemma 3 Let S and S ′ be two training sets that differ by one sample.Let θ = θ(S, A) and θ ′ = θ ′ (S ′ , A) be the graph filter parameters of the HCoN models trained using SGD for T iterations on S and S ′ , respectively.Then, the expected difference in the filter parameters is bounded by, where Note that the probabilities of the two scenarios considered in Lemma 1 and Lemma 2 are n−1 n and 1 n , respectively.By Lemma 1 and Lemma 2, we have where C := (α ℓ νσ + ν ℓ α 2 σ )g 2 max and C ′ := 2α ℓ ασgmax.Hence, we have Solving the above recursion with the initial condition ∆θ 0 = 0 yields

□
Next, we prove the uniform stability of the single-layer HCoN model trained using the SGD algorithm.Uniform stability has been introduced in [19] for the study of nonrandomized algorithms, where the algorithms are assumed to be insensitive to the order of the training set.However, SGD is a randomized algorithm that depends on the order of the random samples.In [22], the notion of uniform stability was extended to randomized algorithms.A randomized algorithm is said to be uniformly stable if there exists a constant β, possibly depending on n, such that sup Here, θ = θ(S, A) and θ ′ = θ ′ (S ′ , A) are the parameters learned with the datasets S and S ′ , respectively.Thus the difference in the loss function values is averaged over all randomizations.In this paper, we adopt this notion of uniform stability to establish a bound of the generalization gap.

Generalization Gap
Consider an HCoN trained on S with a randomization A. Denote the output of the trained network by f (v|θ(S, A)).Let z = (v, y) be a random sample from D. We denote the loss w.r.t.z by ℓ(S, A, z) := ℓ(f (v|θ(S, A)), y).
The generalization error is defined by The empirical risk (a.k.a.training error ) is defined by: where z i = (v i , y i ) is the i-th sample in S. The generalization gap is given by Next, we present a perturbation result for G.It is a general result for stable algorithms.The proof can be found in the Appendix of [23].However, we also include the proof here for completeness and we have fixed a few minor typos of [23].Recall that by Theorem 4, we have the uniform stability Lemma 5 (Gap perturbation) Let S and S ′ be two training sets that differ by their i * -th samples.Then, Here, κ is given by (19) and γ ℓ is the upper bound of ℓ.
Proof First, consider the perturbation of R(S, A): Second, consider the perturbation of Remp(S, A): Finally, we have so that the result follows.□ For reference, we state the classical McDiarmid's concentration inequality [25], which provides the chance of an inequality with respect to a random training set.for all S and S ′ that differ by one coordinate, then, for any ϵ > 0, we have We are now in position to present our main result for the generalization gap.
Theorem 7 (Generalization Gap) Consider a single-layer HCoN model trained on a dataset S using the SGD algorithm for T iterations.The following expected generalization gap holds for all 0 < δ < 1, with probability at least 1 − δ, Here, κ is given by (19) and γ ℓ is the upper bound of ℓ.
Proof Let F (S) = G(S) be the generalization gap.By Lemma 5 and McDiarmid's concentration inequality, we have where c := 2κ+γ ℓ n .Let δ = e − 2ϵ 2 nc 2 .Note that Hence, the following holds with a probability of at least 1 − δ: It remains to be shown that E S [G(S)] ≤ κ/n.Note that The last inequality is due to Theorem 4. □ Theorem 7 states that as n → ∞, the gap converges to zero at the rate of O(1/ √ n), provided that κ in (19) does not grow with n and the network size (M and N ).It boils down to the requirement that the g max given in (13) does not grow with n, M and N .In general, g max can grow with the size of the network.For example, if the entries of X V and X E are uniformly distributed on the interval [0, 1], then We therefore normalize each column of X V to a unit vector.Thus, we have which is a constant independent of the graph size.Here, ∥ • ∥ F denotes the Frobenius norm.Likewise, we also normalize X E so that ∥X E ∥ 2 ≤ √ F E .Another factor that may affect the growth of g max is the incidence matrix H.When H is normalized to H, it can be shown that the dominant singular value µ( H) is bounded above by 1 (see [18]).In contrast, we have µ(H) = O( √ N M ).With both kinds of normalization in place, we have a constant independent of the graph size.Likewise, g max in ( 16) is bounded by It is worthwhile to compare our results with some related works.In the study of stable algorithms in [19], the learnt model θ is assumed to be a global optimizer of the loss function, which is assumed to be convex.There are no SGD iterations.As such, the corresponding constant κ is independent of T .However, the assumption of global optimality is impractical.The work in [22] considers regression models with θ 1 , . . ., θ T generated with SGD.The constant κ is shown to be O(T ).However, there are also some rather strong convexity assumptions in the model that are not applicable to neural networks with nonlinearities.The bound in [23] for GCNs has the same kind of exponential dependence on T as ours.This is due to the lack of convexity in the model.Moreover, SGD does not guarantee a monotonic reduction of the loss function even when the learning rate is coupled with a line search.Thus the situation encountered in [23] and in this paper is more sophisticated but more practical.However, we manage to establish the convergence with respect to the size n of the training set.
The result indicates the consistency between training and testing errors, which is important for the model to be useful.Of course, it relies on the wellknown fundamental assumption that the training and testing sets are drawn from the same distribution.In Section 5, we will also numerically study the generalization gap with respect to different parameters.
In the paper by [17] and [18], vertex and hyperedge classification is studied.Experimental results on several benchmark datasets have shown that the performance of the hypergraph collaborative network is better than that of the baseline methods.Our theoretical results for the stability and generalization of hypergraph collaborative networks can further confirm their usefulness.
Finally, we would like to remark that our theoretical analysis is also valid for hypergraph convolution networks (HCNs) in (1), where propagation of embedding is done on vertices only.The HCoN model reduces to an HCN when the hyperedge feature matrix X E is set to the zero matrix and the hyperedge feature dimension F E is set to zero.
Corollary 8 (Generalization Gap) Consider a single layer HCN model trained on a dataset S using the SGD algorithm for T iterations.The following expected generalization gap holds for all 0 < δ < 1, with probability at least 1 − δ, Here, κ is given by (19) and γ ℓ is the upper bound of ℓ.

Experiments
The purpose of this section is to numerically study the behavior of the generalization gap of HCoNs.We refer the reader to [18] for the accuracy of the model.

Datasets
We use the widely used benchmark datasets of Citeseer [26], Cora [27] and PubMed [28] for evaluations.The three networks consist of 1498, 16313, and 3840 vertices, respectively.Citeseer and PubMed are cocitation datasets.In the hypergraph, each vertex represents a document.The set of citations of a document form a hyperedge.The vertex features for Citeseer and PubMed are the bag-of-words vector representations and the term frequency-inverse document frequency (TF-IDF), respectively.Cora is a coauthorship dataset.Each vertex represents a document and each hyperedge represents an author connecting to documents of the author.The vertex features are the TF-IDF vectors of the documents.A more detailed description of the feature vector generation process can be found in [18].In each of the three networks, 30% of the vertices are used as the test set; 30%-70% of the vertices are used as the training set.4) and ( 5) is varied to see its effects.The gap values are the average of 10 different randomizations to serve as a proxy for the expectation E A [•].As the size of the training set increases, the generalization gap decreases.This is predicted with our theory that the gap decreases as the size of the training set increases.For a fixed size of the training set, the gap is smaller as α increases; this indicates that for these datasets, the vertex features are more important for prediction than the hyperedge features.

Effect of Incidence Matrix Normalization
In the experiment, we demonstrate the effect of normalization of the incidence matrix H.The normalized model is given in (4) and ( 5).The unnormalized model is obtained by replacing H in (4) and ( 5) with H. Fig. 4 shows the generalization gap as the SGD progresses.The size of the training set is fixed  to 70%.The α is fixed to 0.9 because it yields the lowest gap and highest accuracy (see [18]).Each epoch represents a cycle of iterations over the whole training set.The gap values are obtained from a single randomization.First, the results show that when H is normalized to H, the gap reaches a smaller level and converges faster as the number of iterations increases.This is predicted with our theoretical bound of the generalization gap since µ( H) ≤ 1  ).The theory thus explains why normalization is important.Second, regarding the results with normalized incidence matrices, although the theoretical bound grows with T (the number of iterations), the gap stabilizes quickly in practice.This is because the bound represents the worst-case scenario, which is pessimistic.

Conclusion
In this paper, we have established uniform stability and generalization guarantee for single-layer hypergraph collaborative networks.The results show the importance of the the normalization of the vertex and hyperedge features and normalization of the hypergraph incidence matrix H.With these normalizations, we have a bound of the generalization gap independent of the graph size, and therefore, the generalization gap converges to zero.Some numerical experiments are presented to examine the convergence of the generalization gap in practice.Several future research directions will be considered.1) Extend the analysis to multilayer HCoNs, which are more useful in practice.The main challenge is that the gradients of the network become highly nonlinear and are more difficult to estimate.2) Consider more general first-order stochastic optimization algorithms to include other commonly used algorithms such as SGD with momentum and ADAM.The problem is to devise estimates that can precisely reflect the potential acceleration delivered by these algorithms and study how the acceleration affects the stability and generalization gap. 3) While the present result has an advantage in that it does not assume any specific data distribution, it would also be useful to analyze the generalization gap in the presence of a data distribution.The idea is to improve the estimation of the expectations E z∼D and E S in Lemma 5, Proposition 6, and Theorem 7. Specifically, determine the distribution of the gap from the assumed distribution of data, and the devise a special case of McDiarmid's concentration inequality with an improved bound.

Fig. 1
Fig.1shows the generalization gap as a function of the size of the training set.The weight α in (4) and (5) is varied to see its effects.The gap values are the average of 10 different randomizations to serve as a proxy for the expectation E A [•].As the size of the training set increases, the generalization gap decreases.This is predicted with our theory that the gap decreases as the size of the training set increases.For a fixed size of the training set, the gap is smaller as α increases; this indicates that for these datasets, the vertex features are more important for prediction than the hyperedge features.

Fig. 2
Fig. 2 shows the generalization gap as a function of the size of the training set for different learning rates.The gap values are the average of 10 different randomizations.The generalization gap reduces as the size of the training set n increases.Moreover, the gap decreases with the learning rate for each fixed n.Such phenomena are consistent with the bound devised in Theorem 7. To illustrate the convergence of the training process, we show the training and test loss function values at each epoch in Fig. 3. To avoid overwhelming with too many figures, we show only the result for 70% training sample, α = 0.9 and η = 0.01.The loss function values are the average of 10 different randomizations.The graph shows that 1) the training loss function value is monotonically decreasing, and hence, the network fits better to the training data; 2) the test loss function value is also monotonically decreasing, and hence, the generalization error is improving.The starting training and test loss function values are similar because the network parameters are initialized randomly.The difference between the training and the test loss function values at the final epoch constitutes the generalization gap reported in Fig. 1 and Fig. 2. Our theory predicts that the gap at a fixed T decreases as the size n of the training set increases; see Fig. 1 and Fig. 2 for the convergence of the gap as a function of n.

Fig. 1 :
Fig. 1: Convergence of generalization gap as the training size n increases.

Fig. 2 :
Fig. 2: Convergence of generalization gap as the training size n increases.

Fig. 3 :
Fig. 3: Convergence of the training process as the number of iterations T increases.

Fig. 4 :
Fig. 4: Convergence of the generalization gap as the number of iterations T increases.