A Distribution Dependent and Independent Complexity Analysis of Manifold Regularization

Manifold regularization is a commonly used technique in semi-supervised learning. It guides the learning process by enforcing that the classification rule we find is smooth with respect to the data-manifold. In this paper we present sample and Rademacher complexity bounds for this method. We first derive distribution \emph{independent} sample complexity bounds by analyzing the general framework of adding a data dependent regularization term to a supervised learning process. We conclude that for these types of methods one can expect that the sample complexity improves at most by a constant, which depends on the hypothesis class. We then derive Rademacher complexities bounds which allow for a distribution \emph{dependent} complexity analysis. We illustrate how our bounds can be used for choosing an appropriate manifold regularization parameter. With our proposed procedure there is no need to use an additional labeled validation set.


Introduction
In many applications, as for example image or text classification, gathering unlabeled data is easier than gathering labeled data. Semi-supervised methods try to extract information from the unlabeled data to get improved classification results over purely supervised methods. A well known technique to incorporate unlabeled data into a learning process is manifold regularization (MR) [6,16]. This procedure adds a data-dependent penalty term to the loss function and that penalizes classification rules that behave non-smooth with respect to the data distribution. This paper presents a sample complexity and a Rademacher complexity analysis for this procedure and illustrates how our Rademacher complexity bounds can be used for choosing a suitable manifold regularization parameter. We organize our paper as follows. In Sections 2 and 3 we discuss related work and introduce the semi-supervised setting.
In Section 4 we fomalize the idea of adding a distribution-dependent penalty term to a loss function. Algorithms such as manifold, entropy or co-regularization [6,13,18] follow this idea. Our formalization of this idea is inspired by Balcan and Blum [3] and allows for a similar sample complexity analysis. Section 5 reviews the work from Balcan and Blum [3] and generalizes a sample complexity bound from their paper. We then show how this bound can be used to derive sample complexity bounds for the proposed framework, and thus in particular for MR. In the same section we discuss what our findings mean in terms of the performance gap between arXiv:1906.06100v1 [cs.LG] 14 Jun 2019 supervised and semi-supervised methods. We conclude that if our hypothesis set has finite pseudo-dimension, any semi-supervised learner (SSL) that falls in our framework can have at most a constant improvement in terms of sample complexity when compared to all supervised learners. This relates to work of Darnstädt et al. [11] and Ben-David et al. [7] and we argue that we generalize their results.
While our results from the previous sections are uniformly over all distributions and thus distribution independent, we show in Section 6 how one can obtain distribution dependent complexity bounds for MR. We review a kernel formulation of MR [19] and show how this can be used to estimate Rademacher complexities for specific datasets.
In Section 7 we illustrative on an artificial dataset how the distribution dependent bounds can be used for choosing the regularization parameter of MR. This is in particular useful as this analysis does not need an additional labeled validation set, and in semi-supervised settings labeled samples are often assumed to be sparse.
In Section 8 we discuss our results.
We summarize the three core contributions of this paper.
1. We formalize the concept of distribution-dependent regularization for semi-supervised learning and derive distribution independent sample complexity bounds for it. This includes in particular sample complexity bounds for MR.
2. We compare the derived upper bounds with the lower bounds of purely supervised learners. We show that for all SSL that fall in our framework, the performance gap, when compared to their supervised counterpart, can be at most a constant. The constant depends on the hypothesis class used.
3. We derive a computational feasible method to compute distribution dependent Rademacher complexity bounds for MR. We illustrate how these complexity bounds can be used to find a suitable regularization parameter for MR and removes the need for a labeled validation set.

Related Work
There are currently two related analyses of MR that show, to some extent, that a SSL can learn efficiently if it knows the true underlying manifold, while a fully supervised learner may not. Globerson et al. [12] investigate a setting where they essentially restrict themselves to distributions over the input space X that correspond to unions of irreducible algebraic sets of a fixed size k ∈ N, and each algebraic set is either labeled 0 or 1. A SSL that knows the true distribution on X can identify the algebraic sets and reduce the hypothesis space to all 2 k possible label combinations on those sets. As we are left with finitely many hypotheses we can learn them efficiently, while they show that every supervised learner is left with a hypothesis space of infinite VC dimension.
A similar type of analysis was done by Niyogi [16]. Niyogi considers manifolds that arise as embeddings from a circle, where the labeling over the circle is (up to the decision boundary) smooth. Niyogi then shows that a learner that has knowledge of the manifold can learn efficiently while for every fully supervised learner one can find an embedding and a distribution for which this is not possible.
The relation to our paper is as follows.
They provide examples where sample complexity between a semi-supervised and a supervised learner be arbitrarily large, while we explore the general difference in sample complexity between a supervised method and MR.

The Semi Supervised Setting
We work in the statistical learning framework. That means that we assume we are given a feature domain X and a label space Y together with an unknown probability distribution P over X × Y. Additionally we are given a loss function φ : Y × Y → R, which is convex in the first argument and in practice often a surrogate loss function for the 0 − 1 loss, as for example the hinge loss. A hypothesis f is a function f : X → Y. We set (X, Y ) to be a random variable distributed according to P , while small x and y are elements of X and Y respectively. Our goal is to find a hypothesis f , within a restricted class F, such that the expected loss E[φ(f (X), Y )] is small. Throughout this paper we call models that minimize E[φ(f (X), Y )] discriminative models as the loss only depends on the output of the model f (x) and not on the input x. In the standard supervised setting we choose a hypothesis f based on a i.i.d. sample S n = {(x i , y i )} i∈{1,..,n} drawn from P . With that we define the empirical risk of a model f ∈ F with respect to φ and measured on the sample S n asQ(f, S n ) = n i=1 φ(f (x i ), y i ). For ease of notation we sometimes omit S n and just writeQ(f ) =Q(f, S n ). Given a learning problem defined by (P, F, φ) and a labeled sample S n , one way to choose a hypothesis is by the empirical risk minimization principle (1) We refer to f sup as the supervised solution as it does not depend on the unlabeled sample. In SSL we additionally have samples with unknown labels. So we assume to have n+m samples (x i , y i ) i∈{1,..,n+m} independently drawn according to P , where y i has not been observed for the last m samples. We call the first n samples S n the labeled samples. The last m samples {(x i )} i∈{n+1,..,n+m} are referred to as unlabeled samples. We furthermore set U = {x 1 , ..., x xn+m }, so U is the set that contains all our available information about the feature distribution.

Proposed Framework for Semi-Supervised Learning
We propose to include the unlabeled sample in the following manner. We first introduce a second convex loss function that only depends on the input feature and a hypothesis ψ : F × X → [0, 1]. We refer to ψ as the unsupervised loss as it does not depend on any labels. We propose to add the unlabeled data through the loss function ψ and add it as a penalty term to the supervised loss to obtain the semi-supervised solution where λ > 0 controls the trade-off between the supervised and the unsupervised loss. For ease of notation we set We don't claim any novelty for the idea of adding a unsupervised loss for regularization, a different framework can be found in [10,Chapter 10]. We are, however, not aware of this particular formulation of the framework and a sample complexity analysis as presented here. As we are in particular interested in the class of MR schemes we first remark that this method actually fits in our framework.
Example: Manifold Regularization Overloading the notation we write now P (X) for the distribution P restricted to X . In MR one assumes that the input distribution P (X) has support on a compact manifold M ⊂ X . Our assumption is that the predictor f ∈ F varies smoothly in the geometry of M [6]. There are several regularization terms that can enforce this smoothness, one of which is M ||∇ M f (x)|| 2 dP (x). Belkin and Niyogi [5] show that M ||∇ M f (x)|| 2 dP (x) may be approximated with a finite sample of X drawn from P (X). Given such a sample U = {x 1 , ..., x n+m } one defines first a weight matrix W , where W ij = e −||xi−xj || 2 /σ . We set L then as the Laplacian matrix . Thus we can set the unsupervised loss as ψ(f, (x i , x j )) = (f (x i ) − f (x j )) 2 W ij , and this is indeed a convex function in f .

Analysis of the Framework
In this section we analyze the properties of the semi-supervised solution f semi found in Equation (2). We give sample complexity bounds for this procedure and compare them to sample complexities for the supervised case. We derive our results from Balcan and Blum [3], who introduced the concept of unsupervised loss functions. The use of the unsupervised loss function in their work differs however from our work. While they use it to restrict the hypothesis space directly, we use it as a regularization tool in the empirical risk minimization as usually done in practice. To switch between the views of a constrained optimization formulation and our formulation (2) we use the following classical result from convex optimization [14, Theorem 1]. Lemma 1. Let φ(f (x), y) and ψ(f, x) be functions convex in f for all x, y. Then the following two optimization problems are equivalent: Where equivalence means that for each λ we can find a τ such that both problems have the same solution and vice versa.
The next subsection introduces the sample complexity bound and shows how it can be used to give theoretical guarantees for the presented framework.

Sample Complexity Bounds
Sample complexity bounds for supervised learning use typically a notion of complexity of the hypothesis space to bound the worst case difference between the estimated and the true risk. One well-known capacity notion is the VC-dimension [20], which allows for a distribution free, worst-case analysis on the complexity of the hypothesis space. Balcan and Blum [3] use this notion to derive sample complexity bounds for SSL. While this bound was given for classification error and boolean unsupervised loss functions, we adapt it to be valid for all bounded supervised loss functions and all unsupervised loss functions bounded by [0, 1]. To do that we use the notion of pseudo-dimension, an extension of the VC-dimension to real valued loss functions and hypotheses [20,15]. We briefly review the concept of pseudo-dimension. Let . Ignoring the label y we can similarly define L β ψ (f (x)) for the unsupervised loss ψ. The function L β φ (f (x), y) : X × Y → {0, 1} is the 0-1 error we observe if we try to classify the point (x, y) with the hypothesis f and by thresholding the loss function φ at β. For a given sample {(x 1 , y 1 ), ..., (x l , y l )} the following set captures then all possible 0-1 errors (and thus all possible labellings), when using the functions L β φ (f (x), y) for classification: For brevity we substitute in the following z = (x, y) and Z = X × Y. The pseudo-dimension of a hypothesis class F and a loss function φ is then defined as Those definitions let us state our first main Theorem, which is a generalization of Balcan and Blum [3, Theorem 10] to bounded loss functions.
Assume φ is a measurable loss function such that there exists a B > 0 with φ(f (x), y) ≤ B for all x, y and f ∈ F and let P be a distribution. Furthermore let f * τ = arg min Then an unlabeled sample U of size and a labeled sample S n of size is sufficient to ensure that with probability at least 1 − δ the classifier g ∈ F that minimizesQ(·, S n ) subject tô Proof. The proof can be found in the supplementary material.
In the next subsection we show how to use this theorem to derive sample complexity bounds for MR. But before that we want to make one remark about the assumption that the loss function φ is globally bounded. If we assume that F is a reproducing kernel Hilbert space there exists an M > 0 such that for all f ∈ F and x ∈ X it holds that |f (x)| ≤ M ||f || F . If we restrict the norm of f by introducing a regularization term with respect to the norm ||.|| F and assume that φ is continuous we can conclude that φ is globally bounded in f . In a way this can also be seen as a justification to also use an intrinsic regularization for the norm of f in addition to the regularization by the unsupervised loss, as only then the guarantees of Theorem 1 apply. Using this bound together with Lemma 1 we can state the following corollary to give a PAC-style guarantee for our proposed framework.

Corollary 1.
Let φ and ψ be convex supervised and an unsupervised loss function that fulfill the assumptions of Theorem 1. The semi-supervised solution as found by Equation (2) then satisfies the guarantees given in Theorem 1, so f semi fulfills Inequality (6) when we plug it in for g.
As our paper focuses mostly on MR there are two notes to make. First, recall that in this setting ψ 2 , so we gather unlabeled samples from X × X instead of X and thus we need 2m instead of m unlabeled samples for the same bound. Second, the unlabeled loss ψ does not necessarily map to [0, 1]. But it is clear that we can adapt the bounds for bounded unlabeled loss functions the same way we did for bounded supervised loss functions.

Comparison to the Supervised Solution
In the SSL community it is well known that using SSL does not come without a risk [10,Chapter 4]. Thus it is of particular interest how they compare to purely supervised schemes. There are, however, many potential supervised methods we can think of. In the work of Ben-David et al. [7], Darnstädt et al. [11] and Globerson et al. [12] this problem is avoided by comparing to all possible supervised schemes. The framework introduced in this paper allows for a more fine-grained analysis as the semi-supervision happens on top of an already existing supervised methods. Thus for our framework it is natural to compare the sample complexities of f sup with the sample complexity of f semi . To compare the supervised and semi-supervised solution we use again the language of sample complexities and state a simple lower bound on the sample complexity for the semi-supervised solution.
Assume φ is a measurable loss function such that there exists a B > 0 with φ(f (x), y) ≤ B for all x, y and f ∈ F. Furthermore let Then there exists a constant C 1 such that a labeled sample S n of size is necessary to ensure that for all distributions P with probability at least 1 − δ the classifier g ∈ F that minimizeŝ Q(·, S n ) subject to R(·, U ) ≤ τ satisfies Q(g) ≤ Q(f * τ ) + .
We want to point out two things on this lower bound. First note that in Theorem 1 we provided an upper bound on the sample complexity, so a sample size which is sufficient for the guarantees. The lower bound for sample complexity from Theorem 2 is the necessary amount of labeled samples for the given guarantees. Second we note that we do not have any requirement on the unlabeled sample size. This is because we stated Theorem 2 for the assumption that we have infinitely many unlabeled samples. As the theorem holds for that case, it for sure holds when we have only a finite unlabeled sample size. The following corollary captures the comparison to the supervised solution. Corollary 2. Assume the setting of Theorem 2 and assume that φ and ψ are convex. There exists a universal constant C 2 such that the semi-supervised solution found in (2) needs at least a labeled sample size of to guarantee the same performance as the supervised solution found in (1) which has access to a labeled sample of size Proof. The lower bound on the semi-supervised sample complexity is exactly Theorem 2, while the upper bound on the supervised solution is a classical result as found in Shalev-Shwartz and Ben-David [17, p. 73].
We can interpret the corollary also as follows. The worst case difference of labeled sample consumption between the supervised and semi-supervised solution only depends on the two universal constants C 1 and C 2 and on the pseudo-dimensions Pdim(F, φ) and Pdim(F ψ τ , φ). In particular the sample complexity of the supervised method is at most C2 Pdim(F ,φ) C1 Pdim(F ψ τ ,φ) as much as the semi-supervised method. This is finite if Pdim(F, φ) is finite.

The Limits of Manifold Regularization
We now relate our result to the conjectures published by Shalev-Shwartz and Ben-David [17]: A SSL cannot learn faster by more than a constant (which my depend on the hypothesis class F and the loss φ) than the supervised learner. Darnstädt et al. [11,Theorem 2 and 3] showed that this conjecture is essentially true for classes with finite VC-dimension, and SSL that do not make any distributional assumptions. Corollary 2 and the text thereafter shows that this statement also holds for every SSL that falls in our proposed framework and thus extents this result. Our result holds explicitly for SSLs that do make assumptions about the distribution: MR makes the assumption that the labeling function behaves smoothly with respect to the underlying manifold.

Rademacher Complexity for Manifold Regularization
The analysis of the previous sections was distribution independent. We saw that the sample complexity difference between manifold regularized methods and purely supervised methods was at most a constant, which depends on the hypothesis class and the regularization parameter. We believe that in order to find out in which scenarios semi-supervised learning can help it is useful to also look at distribution dependent complexity measures. For this we derive in this section computational feasible upper and lower bounds on the Rademacher complexity of MR. For this we first review the work of Sindhwani et al. [19]; they create a kernel which corresponds to MR. Having this kernel we can use standard upper and lower bounds of the Rademacher complexity for RKHS, as found for example in [9]. The analysis is thus similar to [18], they consider a co-regularization setting. In particular [19, p1] show the following, here informally stated, theorem. Theorem 3 ([19, Propositions 2.1, 2.2]). Let H be a RKHS with inner product < ·, · > H . As before let U = {x 1 , ..., x n+m }, f, g ∈ H and f U = (f (x 1 , ..., x n+m ) t . Furthermore < ·, · > R n be any inner product in R n . LetH be the same space of functions as H, but with a newly defined inner product by < f, g >H =< f, g > H + < f U , g U > R n . ThenH is a RKHS.
Assume now that L is a positive definite n-dimensional matrix and we set the inner product < f U , g U > R n = f t U Lg U . By setting L as the Laplacian matrix as explained in Section 4 we note that the norm ofH now automatically regularizes with respect to the data manifold given by {x 1 , ..., x n+m }. We furthermore know the exact form of the kernel ofH. Theorem 4 ([19, Proposition 2.2]). Let k(x, y) be the kernel of H, K be the gram matrix given by K ij = k(x i , x j ) and k x = (k(x 1 , x), ..., k(x n+m , x)) t . Finally let I be the n + m dimensional identity matrix. The kernel ofH is then given byk (x, y) = k(x, y) − k t x (I + LK) −1 Lk y . This interpretation of MR is useful to derive computational feasible upper and lower bounds of the empirical Rademacher complexity, so distribution dependent complexity bounds. First, recall that the empirical Rademacher complexity of the hypothesis class H and measured on the sample labeled input features {x 1 , ..., x n } is defined as where σ = (σ 1 , ..., σ n ) are i.i.d Rademacher random variables, i.e. P (σ i = 1) = P (σ i = −1) = 1 2 . Theorem 5 ([9, p. 333]). Let H be a RKHS with kernel k and H r = {f ∈ H | ||f || H ≤ r}. Given an n sample {x 1 , ..., x n } we can bound the empirical Rademacher complexity of H r by With the previous two theorems we can give upper bounds on the Rademacher complexity of MR, in particular we can also derive a bound of the maximal complexity reduction over supervised learning. Corollary 3. Let H be a RKHS and for f, g ∈ H define the inner product < f, g >H =< f, g > H +f U (µL)g t U , where L is a positive definite matrix and µ ∈ R is a regularization parameter. LetH r be defined as before, then Similarly we can also obtain a lower bound according to Inequality (7).
The corollary allows us to compute upper bounds of the Rademacher complexity for MR and shows in particular that the difference of the Rademacher complexity of the supervised and the semi-supervised method is given by the term k t xi ( 1 µ I n+m + M K) −1 M k xi . This can be used for example to compute generalization bounds [15,Chapter 3]. We can also use the kernel to compute local Rademacher complexities which may yield tighter generalization bounds [4]. We, however, will use the Rademacher complexity bounds instead for choosing the regularization parameter µ without the need for an additional labeled validation set.

Experiment: Concentric circles
We now illustrate how one can use the upper bound of the Rademacher complexity (8) for model selection, in particular we can get an initial idea of how to choose the regularization parameter µ. The idea is to plot the Rademacher complexity versus the parameter µ as in Figure 1 . For increasing µ we will see that the Rademacher complexity reduces, so how to choose the optimal µ? We propose to use an heuristic which is often used in clusterings, the so called 'elbow criteria' [8]. We essentially want to find a µ such that increasing the µ will not result in much reduction of the complexity anymore. We test this idea on a dataset which consists out of two concentric circles with 500 datapoints in R 2 , 250 per circle, see also Figure 2. We use a gaussian base kernel with bandwith set to 0.5 and for the MR matrix L we use the Laplacian matrix as before, where weights are computed also with a gaussian kernel with bandwith set to 0.2. Note that those parameters have to be carefully set in order to capture the structure of the dataset, but this is not the concern of this paper, we already assume that we found a reasonable choice for those parameters. We furthermore add a small L2-regularization that ensures that the radius r from Inequality (8) is finite. The precise value of r plays here a secondary role as the behavior of the curve from Figure 1 remains the same. Details on the data generation process can be found in our code. We are now interested in how to choose the regularization parameter µ. Looking at Figure 1 we observe that for values smaller than 0.1 the curve still drops steeply, while after 0.2 it starts to flatten out. We thus plot the resulting kernels for µ = 0.02 and µ = 0.2 in Figure 2. We plot the isolines of the kernel around the point of class one, the red dot in the figure. We indeed observe that for µ = 0.02 we don't capture that much structure yet, while for µ = 0.2 the two conectric circles are almost completely separated by the kernel.

Discussion
We presented a distribution dependent and distribution independent complexity analysis for manifold regularization. We note that the improvements in terms of sample or Rademacher complexity do not immediately imply that we will also observe an improvement in terms of performance. The reason for that is that the difference in performance between supervised and semi-supervised learning depends depends on two things. First, on how the approximation error of the class F ψ τ compares the approximation error of F is. Second, on how much we reduce the complexity by switching from the hypothesis class F to the hypothesis class F ψ τ . In our analysis we discussed the second part and we argue that our proposed framework is suited to address this problem, as the resulting SSL schemes have a natural supervised counterpart. The first part depends on a notion the literature often refers to as a semi-supervised assumption. This assumption basically states that we can learn with F ψ τ as good as with F. Regarding our example of the two concentric circles, this would mean that each circle actually corresponds to a class. The problem is that it is typically not known whether one can test efficiently if the assumption is true or not, without prior knowledge. The issue becomes clearer when one realizes that the assumption is always a property on the distribution P, so one would need labeled samples   to verify it. We speculate that this verification process would negate the sample savings one can achieve with SSL.
We are not aware of much research in this direction, besides the work of Balcan et al. [2] which discusses the sample consumption to test the so called cluster assumption and the work of Azizyan et al. [1] which analyze the overhead of cross-validating the extra hyper parameter from their proposed semi-supervised learner.
Furthermore we note that one condition for the main finding in Corollary 1 is that the unsupervised loss function is convex and this is not the case for example for entropy regularization, which in fact as a concave unsupervised loss function.
It is an open question whether we can give similar bounds for those functions. We are not aware of any simple way to adapt the presented theory for non-convex functions.