# Generalization bounds for learning with linear, polygonal, quadratic and conic side knowledge

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s10994-014-5478-4

- Cite this article as:
- Tulabandhula, T. & Rudin, C. Mach Learn (2015) 100: 183. doi:10.1007/s10994-014-5478-4

## Abstract

In this paper, we consider a supervised learning setting where side knowledge is provided about the labels of unlabeled examples. The side knowledge has the effect of reducing the hypothesis space, leading to tighter generalization bounds, and thus possibly better generalization. We consider several types of side knowledge, the first leading to linear and polygonal constraints on the hypothesis space, the second leading to quadratic constraints, and the last leading to conic constraints. We show how different types of domain knowledge can lead directly to these kinds of side knowledge. We prove bounds on complexity measures of the hypothesis space for quadratic and conic side knowledge, and show that these bounds are tight in a specific sense for the quadratic case.

### Keywords

Statistical learning theory Generalization bounds Rademacher complexity Covering numbers, constrained linear function classes Side knowledge## 1 Introduction

Surely, for many applications the amount of domain knowledge we could potentially use within our learning processes is vastly larger than the amount of domain knowledge we actually use. One reason for this is that domain knowledge may be nontrivial to incorporate into algorithms or analysis. A few types of domain knowledge that do permit analysis have been explored quite in depth in the past few years and used very successfully in a variety of learning tasks; this includes knowledge about the sparsity properties of linear models (\(\ell _{1}\)-norm constraints, minimum description length) or smoothness properties (\(\ell _{2}\)-norm constraints, maximum entropy). A reason that domain knowledge is not usually incorporated in theoretical analysis is that it can be very problem specific; it may be too specific to the domain to have an overarching theory of interest. For example, researchers in NLP (Natural Language Processing) have long figured out various exotic domain specific knowledge that one can use while performing a learning task (Chang et al. 2008a, b). The present work aims to provide theoretical guarantees for a large class of problems with a general type of domain knowledge that goes beyond sparsity and smoothness.

To show that linear, polygonal, quadratic and conic constraints on a linear hypothesis space can arise naturally in many circumstances, from constraints on a set of unlabeled examples. This is in Sect. 2. We connect these with relevant semi-supervised learning settings.

To provide upper bounds on covering number and empirical Rademacher complexity for linearly constrained linear function classes. Bounds for the case of linear and polygonal constraints are found in Sects. 3.3 and 3.4 respectively. Two of the three bounds in these sections are not original to this paper, but their application to general side knowledge with linear constraints is novel.

To provide two upper bounds on the complexity of the hypothesis space for the quadratic constraint case This can be used directly in generalization bounds. The use of a certain family of circumscribing ellipsoids and the quadratic bounds of Sect. 3.5 are novel to this paper.

To show that one of the upper bounds on the quadratically constrained hypothesis space we provided has a matching lower bound, also in Sect. 3.5. This is novel to this paper.

To provide a bound on the complexity of the hypothesis space for the conic constraint case. These bounds are in Sect. 3.7 and are novel to this paper.

We develop a novel proof technique for upper bounding linear, quadratic and conic constraint cases based on convex duality.

Side knowledge can be particularly helpful in cases where data are scarce; these are precisely circumstances when data themselves cannot fully define the predictive model, and thus domain knowledge can make an impact in predictive accuracy. That said, for any type of side knowledge (sparsity, smoothness, and the side knowledge considered here), the examples and hypothesis space may not conform in reality to the side knowledge. (Similarly, the training data may not be truly random in practice.) However, if they do, we can claim lower sample complexities, and potentially improve our model selection efforts. Thus, we cannot claim that our side knowledge is always true knowledge, but we can claim that if it is true, we are able to gain some benefit in learning.

### 1.1 Motivating examples

Fung et al. (2002) added multiple linear constraints (polygonal constraints) to a specific ERM algorithm, the linear SVM, as a way to incorporate prior knowledge. They investigated the effect of using this type of prior knowledge for classification on a DNA promoter recognition dataset (Towell et al. 1990). In this classification task, the linear constraints result from precomputed rules that are separate from the training data (this is similar to our polygonal setting where constraints are generated from knowledge about the unlabeled examples). The “leave-one-out” error from the 1-norm SVM with the additional constraints was less than that of the plain 1-norm SVM and other training-data-based classifiers such as decision trees and neural networks. This and other types of knowledge incorporation in SVMs are reviewed by Lauer and Bloch (2008) and also Le et al. (2006).

James et al. (2014) motivated the use of linear constraints with LASSO, which is also an ERM procedure. In their experiment, they estimated a demand probability function using an on-line auto lending dataset. They ensured monotonicity of the demand function by applying a set of linear constraints (similar to the poset constraints in 2.1) and compared the output to two other methods: logistic regression and the unconstrained LASSO, both of which output non-monotonic demand probability curves.

Nguyen and Caruana (2008a) considered additional unlabeled examples whose labels are partially known. In particular, they worked on a type of multi-class classification task where they know that the label of each unlabeled example belongs to a known subset of the set of all class labels. This knowledge about the unlabeled examples translates into multiple linear constraints (polygonal constraints). They provided experimental results on five datasets showing improvements over multi-class SVMs.

Gómez-Chova et al. (2008) implemented a technique (known as LapSVMs) that uses Laplacian regularization augmented with standard SVMs for two image classification tasks related to urban monitoring and cloud screening (which are both remote sensing tasks). Laplacian regularization means that the regularization term is a quadratic function of the model, derived from a set of unlabeled examples, like our quadratic setting (see Sect. 2.2). In both tasks, the Laplacian-regularized linear SVMs outperformed the standard SVMs in terms of overall accuracy (these improvements are of the order of 2–3 % in both cases).

Shivaswamy et al. (2006) formulated robust classification and regression problems as described in Sect. 2.3 leading to conic constraints on the model class. For classification, they used the OCR, Heart, Ionosphere and Sonar datasets from the UCI repository to illustrate the effect of missing values and how robust SVM classification (which introduces second order conic constraints) provides better classification accuracy than the standard SVM classifier after imputation. For regression, they showed improvements in prediction accuracy of a robust version of SVR (again introducing conic constraints on the hypothesis space) as compared to a standard SVR trained after imputation on the Boston housing dataset (also from the UCI repository).

Finally, “Appendix” also provides experimental results showing the advantage of using side knowledge in a ridge regression problem.

## 2 Linear, polygonal, quadratic and conic constraints

We are given training sample \(S\) of \(n\) examples \(\{(x_{i},y_{i})\}_{i=1}^{n}\) with each observation \(x_{i}\) belong to a set \({\mathcal {X}}\) in \(\mathbb {R}^{p}\). Let the label \(y_{i}\) belong to a set \({\mathcal {Y}}\) in \(\mathbb {R}\). In addition, we are given a set of \(m\) unlabeled examples \(\{\tilde{x}_{i}\}_{i=1}^{m}\). We are not given the true labels \(\{\tilde{y}_{i}\}_{i=1}^{m}\) for these observations. Let \({\mathcal {F}}\) be the function class (set of hypotheses) of interest, from which we want to choose a function \(f\) to predict the label of future unseen observations. Let it be linear, parameterized by coefficient vector \(\beta \) and its description will change based on the constraints we place on \(\beta \).

### 2.1 Assumptions leading to linear and polygonal constraints

We will provide three settings to demonstrate that linear constraints arise in a variety of natural settings: poset, must-link, and sparsity on \(\{\tilde{y}_{i}\}_{i=1}^{m}\). In all three, we will include standard regularization of the form \(\Vert \beta \Vert _q\le c_1\) by default.

#### 2.1.1 Poset

Partial order information about the labels \(\{\tilde{y}_{i}\}_{i=1}^{m}\) can be captured via the following constraints: \(f(\tilde{x}_{i}) \le f(\tilde{x}_{j}) + c_{i,j}\) for any collection of pairs \((i,j) \in [1,...,m]\times [1,...,m]\). This gives us up to \(m^2\) constraints of the form \(\beta ^{T}(\tilde{x}_{i} - \tilde{x}_{j}) \le c_{i,j}.\)\({\mathcal {F}}\) can be described as: \({\mathcal {F}}:=\{ f | f(x) = \beta ^{T}x, \Vert \beta \Vert _{q} \le c_{1}, \beta ^{T}(\tilde{x}_{i} - \tilde{x}_{j}) \le c_{i,j}, \forall (i,j) \in E\}\), where \(E\) is the set of pairs of indices of unlabeled data that are constrained.

#### 2.1.2 Must-link

Here we bound the absolute difference of labels between pairs of unlabeled examples: \( |f(\tilde{x}_{i}) - f(\tilde{x}_{j})| \le c_{i,j}\). This captures knowledge about the nearness of the labels. This leads to two linear constraints: \(-c_{i,j} \le \beta ^{T}(\tilde{x}_{i}-\tilde{x}_{j}) \le c_{i,j}.\) These constraints have been used extensively within the semi-supervised (Zhu 2005) and constrained clustering settings (Lu and Leen 2004, Basu et al. 2006) as must-link or ‘in equivalence’ constraints. For must-link constraints, \({\mathcal {F}}\) is defined as: \( {\mathcal {F}}:=\{ f | f(x) = \beta ^{T}x, \Vert \beta \Vert _{q} \le c_{1}, -c_{i,j} \le \beta ^{T}(\tilde{x}_{i}-\tilde{x}_{j}) \le c_{i,j}, \forall (i,j) \in E\}\), where \(E\) is again the set of pairs of indices of unlabeled data that are constrained.

#### 2.1.3 Sparsity and its variants on a subset of \(\{\tilde{y}_{i}\}_{i=1}^{m}\)

Similar to sparsity assumptions on \(\beta \), here we want that only a small set of labels is nonzero among a set of unlabeled examples. In particular, we want to bound the cardinality of the support of the vector \([\tilde{y}_{{1}} \ldots \tilde{y}_{{|{\mathcal {I}}|}}]\) for some index set \({\mathcal {I}} \subset \{1,...,m\}\). Such a constraint is nonlinear. Nonetheless, a convex constraint of the form \(\Vert [\tilde{y}_{{1}} \ldots \tilde{y}_{{|{\mathcal {I}}|}}]\Vert _{1} \le c_{{\mathcal {I}}} \) (\(2^{|{\mathcal {I}}|}\) linear constraints) can be used as a proxy to encourage sparsity. The function class is defined as: \( {\mathcal {F}}:=\{ f | f(x) = \beta ^{T}x, \Vert \beta \Vert _{q} \le c_{1}, \Vert [\beta ^T\tilde{x}_{{1}} \ldots \beta ^T\tilde{x}_{{|{{\mathcal {I}}}|}}]\Vert _{1} \le c_{{\mathcal {I}}}\}\). A similar constraint can be obtained if we instead had partial information with respect to the dual norm: \(\Vert [\tilde{y}_{{1}} \ldots \tilde{y}_{{ |{\mathcal {I}}| }}]\Vert _{\infty } \le c_{{\mathcal {I}}}\).

### 2.2 Assumptions leading to quadratic constraints

We will provide several settings to show that quadratic constraints arise naturally.

#### 2.2.1 Must-link

A constraint of the form \((f(\tilde{x}_{i}) - f(\tilde{x}_{j}))^{2} \le c_{i,j}\) can be written as \( 0 \le \beta ^{T}A \beta \le c_{i,j}\) with \(A = (\tilde{x}_{i}-\tilde{x}_{j})(\tilde{x}_{i}-\tilde{x}_{j})^T\). Here \(A\) is rank-deficient as it is an outer product, which leads to an unbounded ellipse; however, its intersection with a full ellipsoid (for instance, an \(\ell _{2}\)-norm ball) is not unbounded and indeed can be a restricted hypothesis set. Set \({\mathcal {F}}\) is defined by: \({\mathcal {F}}= \{\beta : \beta ^{T}\beta \le c_{1}, \beta ^{T} (\tilde{x}_{i}-\tilde{x}_{j})(\tilde{x}_{i}-\tilde{x}_{j})^T \beta \le c_{i,j}; (i,j) \in E\}\), where \(E\) is again the set of pairs of indices of unlabeled data that are constrained.

#### 2.2.2 Constraining label values for a pair of examples

We can define the following relationship between the labels of two unlabeled examples using quadratic constraints: if one of them is large in magnitude, the other is necessarily small. This can be encoded using the inequality: \(f(\tilde{x}_{i})\cdot f(\tilde{x}_{j}) \le c_{i,j}\). If \(f(x) \in {\mathcal {Y}}\subset \mathbb {R}_{+}\), then \(f(\tilde{x}_{i})\cdot f(\tilde{x}_{j}) \le c_{i,j}\) gives the following quadratic constraint on \(\beta \) with the associated rank 1 matrix being \(A = \tilde{x}_{i}\tilde{x}_{j}^{T}\): \(\beta ^T A\beta \le c_{i,j}.\) This is not quite an ellipsoidal constraint yet because matrices associated with ellipsoids are symmetric positive semidefinite. Matrix \(A\) on the other hand is not symmetric. Nonetheless, the quadratic constraint remains intact when we replace matrix \(A\) with the symmetric matrix \(\frac{1}{2}(A + A^{T})\). If in addition, the symmetric matrix is also positive-definite (which can be verified easily), then this leads to an ellipsoidal constraint. The hypothesis space \({\mathcal {F}}\) becomes: \({\mathcal {F}}= \left\{ \beta : \beta ^{T}\beta \le c_{1}, \beta ^{T} \tilde{x}_{i}\tilde{x}_{j}^{T}\beta \le c_{i,j}; (i,j) \in E \right\} .\)

#### 2.2.3 Energy of estimated labels

We can place an upper bound constraint on the sum of squares (the “energy”) of the predictions, which is: \(||{X}_{U}^{T}\beta ||_{2}^{2} = \sum _{i}(\beta ^{T}\tilde{x}_{i})^{2} = \beta ^T(\sum _{i}\tilde{x}_{i}\tilde{x}_{i}^{T})\beta \) where \(X_{U}\) is a \(p \times m\) dimensional matrix with \(\tilde{x}_i\)’s as its columns.^{1} The set \({\mathcal {F}}\) is \({\mathcal {F}}= \left\{ \beta : \beta ^{T}\beta \le c_{1}, ||{X}_{U}^{T}\beta ||_{2}^{2} \le c \right\} \). Extensions like the use of Mahalanobis distance or having the norm act on only a subset of the estimates of \(\{\tilde{y}\}_{i=1}^{m}\) follow accordingly.

#### 2.2.4 Smoothness and other constraints on \(\{\tilde{y}_{i}\}_{i=1}^{m}\)

Consider the general ellipsoid constraint \(\Vert \Gamma {X}_{U}^{T}\beta \Vert _{2}^{2} \le c\) where we have added an additional transformation matrix \(\Gamma \) in front of \({X}_{U}^{T}\beta \). If \(\Gamma \) is set to the identity matrix, we get the energy constraint previously discussed. If \(\Gamma \) is a banded matrix with \(\Gamma _{i,i} = 1\) and \(\Gamma _{i,i+1} = -1\) for all \(i=1,...,m\) and remaining entries zero, then we are encoding the side knowledge that the variation in the labels of the unlabeled examples is smoothly varying: we are encouraging the unlabeled examples with neighboring indices to have similar predicted values. This matrix \(\Gamma \) is an instance of a difference operator in the numerical analysis literature. In this context, banded matrices like \(\Gamma \) model discrete derivatives. By including this type of constraint, problems with identifiability and ill-posedness of an optimal solution \(\beta \) are alleviated. That is, as with the Tikhonov regularization on \(\beta \) in least squares regression, constraints derived from matrices like \(\Gamma \) reduce the condition number. The set \({\mathcal {F}}\) is defined as: \({\mathcal {F}}= \left\{ \beta : \beta ^{T}\beta \le c_{1}, \Vert \Gamma {X}_{U}^{T}\beta \Vert _{2}^{2} \le c \right\} .\)

#### 2.2.5 Graph based methods

Some graph regularization methods such as manifold regularization (Belkin and Niyogi 2004) also encode information about the labels of the unlabeled data. They also lead to convex quadratic constraints on \(\beta \). Here, along with the unlabeled examples \(\{\tilde{x}_{i}\}_{i=1}^{m}\), our side knowledge consists of an \(m\)-node weighted graph \(G = (V,E)\) with the Laplacian matrix \(L_{G} = D - A\). Here, \(D\) is a \(m\times m\)-dimensional diagonal matrix with the diagonal entry for each node equal to the sum of weights of the edges connecting it. Further, \(A\) is the adjacency matrix containing the edge weights \(a_{ij}\), where \(a_{ij} = 0\) if \((i,j) \notin E\) and \(a_{ij} = e^{-c\Vert \tilde{x}_{i}-\tilde{x}_{j}\Vert _{q}}\) if \((i,j) \in E\) (other choices for the weights are also possible). The quadratic function \(({X}_{U}^{T}\beta )^{T} L_{G}({X}_{U}^{T}\beta )\) is then twice the sum over all edges, of the weighted squared difference between the two node labels corresponding to the edge: \(2\sum _{(i,j) \in E}a_{ij}\left( f(\tilde{x}_{i}) - f(\tilde{x}_{j})\right) ^{2}.\) Intuitively, if we have the side knowledge that this quantity is small, it means that a node should have similar labels to its neighbors. For classification, this typically encourages the decision boundary to avoid dense regions of the graph. The set \({\mathcal {F}}\) is defined as: \({\mathcal {F}}= \{\beta : \beta ^{T}\beta \le c_{1}, \beta ^{T}{X}_{U}^{T}L_{G}{X}_{U}^{T}\beta \le c\}\).

### 2.3 Assumptions leading to conic constraints

We provide two scenarios that naturally lead to conic constraints on the model class: robustness against uncertainty and stochastic constraints.

#### 2.3.1 Robustness against uncertainty in linear constraints

Consider any of the linear constraints considered in Sect. 2.1. All of these can be generically represented as: \(\{a_k^T \beta \le 1\;\; \forall k=1,..,K\}\) where for each \(k\), \(a_k\) is a function of the unlabeled sample \(\{\tilde{x}_j\}_{j=1}^{m}\) (for instance, \(a_k = \tilde{x}_i - \tilde{x}_k\) for Poset constraints). Further assume that each \(a_k\) is only known to belong to an ellipsoid \(\varXi _{k} = \{\overline{a}_k + A_ku: u^Tu \le 1\}\) with both parameters \(\overline{a}_k\) and \(A_k\) known. This can happen due to measurement limitations, noise and other factors. We want to guarantee that, irrespective of the true value of \(a_{k} \in \varXi _k\), we still have \(a_k^T \beta \le 1\).

#### 2.3.2 Stochastic Programming

*Remark 1*

A question of practical interest would be about ways to impose constraints seen in Sects. 2.1, 2.2 and 2.3 in a computationally efficient manner. Fortunately, for all the cases we have considered thus far, the side knowledge can be encoded as a set of convex constraints leading to efficient algorithms (if the original empirical risk minimization problem is convex). Further, note that unlike must-link and similarity side knowledge that lead to convex constraints, cannot-link and dissimilarity knowledge is relatively harder to impose and is typically non-convex.

## 3 Generalization bounds

In each of the scenarios considered in Sect. 2, essentially we are given \(m\) unlabeled examples \(\tilde{x}\) whose subsets satisfy various properties or side knowledge (for instance, linear ordering, quadratic neighborhood similarity, etc). This side knowledge is also shown to constrain the hypothesis space in various ways. In this section, we will attempt to answer the following statistical question: what effect do these constraints have on the generalization ability of the learned model? We will compute bounds on the complexity of the hypothesis space when the types of constraints seen in Sect. 2 are included.

### 3.1 Definition of complexity measures

We will look at two complexity measures: the covering number of a hypothesis set and the Rademacher complexity of a hypothesis set. Their definitions are as follows:

**Definition 1**

*Covering Number* (Kolmogorov and Tikhomirov 1959): Let \(A \subseteq \varOmega \) be an arbitrary set and \((\varOmega , \rho )\) a (pseudo-)metric space. Let \(|\cdot |\) denote set size. For any \(\epsilon > 0\), an \(\epsilon \)**-cover** for \(A\) is a finite set \(U \subseteq \varOmega \) (not necessarily \( \subseteq A\)) s.t. \( \forall \omega \in A, \exists u \in U\) with \(d_{\rho }(\omega , u) \le \epsilon \). The **covering number** of \(A\) is \(N(\epsilon ,A,\rho ) := \inf _{U} |U|\) where \(U\) is an \(\epsilon \)-cover for \(A\).

**Definition 2**

*Rademacher Complexity*(Bartlett and Mendelson 2002): Given a training sample \(S = \{x_{1},...,x_{n}\}\), with each \(x_i\) drawn i.i.d. from \(\mu _{\mathcal {X}}\), and hypothesis space \({\mathcal {F}}\), \({\mathcal {F}}_{|S}\) is the defined as the restriction of \({\mathcal {F}}\) with respect to \(S\). The

*empirical Rademacher complexity of*\({\mathcal {F}}_{|S}\) is

*Rademacher complexity of*\({\mathcal {F}}\) is its expectation:

If instead we let \(\sigma _{i} \sim \mathcal {N}(0,1)\) in the definition, this is the Gaussian complexity of the function class. Generalization bounds often use both these quantities in their statements (Bartlett and Mendelson 2002). Unless otherwise specified, the feature vectors in feature space \({\mathcal {X}}\) are bounded in norm by constant \(X_b > 0\) and the coefficient vectors of the linear function class \({\mathcal {F}}\) are bounded in norm with constant \(B_b > 0\).

### 3.2 Complexity measures within generalization bounds

### 3.3 Complexity results with a single linear constraint

We state two results: the first is based on volumetric arguments and bounds the covering number and the second is based on convex duality and bounds the empirical Rademacher complexity. The first is a result from Tulabandhula and Rudin (2014) while the second is new to this paper.

**Volumetric upper bound on the covering number:**Tulabandhula and Rudin (2014) analyzed the setting where a bounded linear function class is further constrained by a half space. The motivation there was to study a specific type of side knowledge, namely knowledge about the cost to solve a decision problem associated with the learning problem. The result there extends well beyond operational costs and is applicable to our setting where we have a \(\ell _2\) bounded linear function class with a single half space constraint.

**Theorem 1**

*Intuition:* The function \(\alpha (p,a,\epsilon )\) can be considered to be the normalized volume of the ball (which is 1) minus the portion that is the spherical cap cut off by the linear constraint. It comes directly from formulae for the volume of spherical caps. We are integrating over the volume of a \(p-1\) dimensional sphere of radius \(r\sin \theta \) and the height term is \(d(r\cos \theta )\).

This bound shows that the covering number bound can depend on \(a\), which is a direct function of the unlabeled examples \(\{\tilde{x}_{i}\}_{i=1}^{m}\). As the norm \(\Vert a\Vert _2\) increases, \(\Vert a\Vert _2^{-1}\) decreases, thus \(\alpha (p,a,\epsilon )\) decreases, and the whole bound decreases. This is a mechanism by which side information on the labels of the unlabeled examples influences the complexity measure of the hypothesis set, potentially improving generalization.

*Relation to standard results:* It is known (Kolmogorov and Tikhomirov 1959) that set \(\mathcal {B} = \{\beta : \Vert \beta \Vert _{2} \le B_{b}\}\) (with \(B_b > 0\) being a fixed constant as before) has a bound on its covering number of the form \(N(\epsilon ,\mathcal {B},\Vert \cdot \Vert _{2}) \le \left( \frac{2B_{b}}{\epsilon } + 1\right) ^{p}\). Since in Theorem 1 the same term appears, multiplied by a factor that is at most one and that can be substantially less than one, the bound in Theorem 1 can be tighter.

The above result bounds the covering number complexity for the hypothesis set. Next, we will bound the empirical Rademacher complexity for the same hypothesis set as above.

#### 3.3.1 Convex duality based upper bound on empirical Rademacher complexity

Consider the setting in Theorem 1. Let \(x_i \in {\mathcal {X}}= \{x: \Vert x\Vert _2 \le X_b\}\) for \(i=1,...,n\) as before. Our attempt to use convex duality to upper bound empirical Rademacher complexity yields the following bound.

**Proposition 1**

*Intuition:* We can understand the effect of the linear constraint on the upper bound through the magnitude of vector \(a\). Without loss of generality, let the expectation of the optimal value of the first minimization problem be higher (both minimization problems are structurally similar to each other except for a sign change within the norm term). For a fixed value of \(\sigma \), this minimization problem involves the distance of vector \({X}_{L}\sigma \) to the scaled vector \(a\) in the first term and the scaling factor \(\eta \) itself as the second term.

Thus, generally, if \(\Vert a\Vert _2\) is large, the scaling factor \(\eta \) can be small, resulting in a lower optimal value. We also know that larger \(\Vert a\Vert _2\) corresponds to a tighter half space constraint. Thus, as the linear constraint on the hypothesis space becomes tighter, it makes the optimal solution \(\eta \) and the optimal value smaller for each \(\sigma \) vector. As a result, it tightens the upper bound on the empirical Rademacher complexity.

*Relation to standard results:* An upper bound on each term of the \(\max \) operation above can be found by setting \(\eta = 0\) that recovers the standard upper bound of \(\frac{B_b\sqrt{\mathrm{trace }(X_L^TX_L)}}{\sqrt{n}}\) or \(\frac{B_bX_b}{\sqrt{n}}\) without capturing the effect of the linear constraint \(a^T\beta \le 1\).

### 3.4 Complexity results with polygonal/multiple linear constraints and general norm constraints

The following result is from Tulabandhula and Rudin (2013), where the authors analyze the effect of decision making bias on generalization of learning. Again, as in the single linear constraint case, the result extends beyond the setting considered in that paper. In particular, it covers all the motivating scenarios described in Sect. 2.1.

**Theorem 2**

*Intuition:* The linear assumptions on the labels of the unlabeled examples \(\{\tilde{x}_{i}\}_{i=1}^{m}\) determine the parameters \(\{\tilde{c}_{j\nu }\}_{j,\nu }\) that in turn influence the complexity measure bound. In particular, as the linear constraints given by the \(c_{j\nu }\)’s force the hypothesis space to be smaller, they force \(|P_{c}^{K}|\) to be smaller. This leads to a tighter upper bound on the covering number.

*Relation to standard results:* We recover the covering number bound for linear function classes given in (Zhang 2002) when there are no linear constraints. In this case, the polytope \(P^{K}\) is well structured and the number of integer points in it can be upper bounded in an explicit way combinatorially.

It is possible to convex duality to upper bound the empirical Rademacher complexity as we did in Proposition 1. However, the intuition is less clear, and thus, we omit the bound here.

### 3.5 Complexity results with quadratic constraints

Consider the set \({\mathcal {F}}= \{f: f=\beta ^{T} x, \beta ^{T}A_{1}\beta \le 1, \beta ^{T}A_{2} \beta \le 1 \}.\) Assume that at least one of the matrices is positive definite and both are positive-semidefinite, symmetric. Let \(\varXi _{1} = \{\beta : \beta ^{T}A_{1}\beta \le 1\}\) and \(\varXi _{2} = \{\beta : \beta ^{T}A_{2}\beta \le 1\}\) be the corresponding ellipsoid sets.

#### 3.5.1 Upper bound on empirical Rademacher complexity

We first find an ellipsoid \(\varXi _{\mathrm{int }\gamma }\) (with matrix \(A_{\mathrm{int }\gamma }\)) circumscribing the intersection of the two ellipsoids \(\varXi _{1}\) and \(\varXi _{2}\) and then find a bound on the Rademacher complexity of a corresponding function class leading to our result for the quadratic constraint case. We will pick matrix \(A_{\mathrm{int }\gamma }\) to have a particularly desirable property, namely that it is *tight*. We will call a circumscribing ellipsoid *tight* when no other ellipsoidal boundary comes between its boundary and the intersection (\(\varXi _{1}\cap \varXi _{2}\)). If we thus choose this property as our criterion for picking the ellipsoid, then according to the following result, we can do so by a convex combination of the original ellipsoids:

**Theorem 3**

(Circumscribing ellipsoids, Kahan 1968) There is a family of circumscribing ellipsoids that contains every tight ellipsoid. Every ellipsoid \(\varXi _{\mathrm{int }\gamma }\) in this family has \(\varXi _{\mathrm{int }\gamma } \supseteq (\varXi _{1}\cap \varXi _{2})\) and is generated by matrix \(A_{\mathrm{int }\gamma } = \gamma A_{1} + (1-\gamma ) A_{2}\), \(\gamma \in [0,1]\).

Using the above theorem, we can find a tight ellipsoid \(\{\beta : \beta ^{T}A_{\mathrm{int }\gamma }\beta \le 1\}\) that contains the set \(\{ \beta : \beta ^{T}A_{1}\beta \le 1, \beta ^{T}A_{2} \beta \le 1\}\) easily. Note that the right hand sides of the quadratic constraints defining these ellipsoids can be equal to one without loss of generality.

**Theorem 4**

*Intuition:* If the quadratic constraints are such they correspond to small ellipsoids, then the circumscribing ellipsoid will also be small. Correspondingly, the eigenvalues of \(A_{\mathrm{int}\gamma }\) will be large. Since, the upper bound depends inversely on the magnitude of these eigenvalues (since it depends on \(A_{\mathrm{int}\gamma }^{-1}\)), it becomes tighter. Also, in the setting where the original ellipsoids are large and elongated but their intersection region is small and can be bounded by a small circumscribing ellipsoid, the upper bound is again tighter.

*Relation to standard results:* If \(A_{\mathrm{int }\gamma }\) is diagonal (or axis-aligned), then we can write the empirical complexity \(\mathcal {\bar{R}}({\mathcal {F}}_{|S})\) in terms of the eigenvalues \(\{\lambda _{i}\}_{i=1}^{p}\) as \(\mathcal {\bar{R}}({\mathcal {F}}_{|S})\le \frac{1}{n}\sqrt{\sum _{j=1}^{n}\sum _{i=1}^{p}\frac{x_{ji}^{2}}{\lambda _{i}}}\) and this can be bounded by \(\frac{X_{b}B_{b}}{\sqrt{n}}\) (Kakade et al. 2008) when \(A_{2} = \mathbf {0}\). In that case, all of the \(\lambda _i\) are \(\frac{1}{B_{b}^{2}}\).

*Remark 2*

#### 3.5.2 Lower bound on empirical Rademacher complexity

We will now show that the dependence of the complexity on \(A_{\mathrm{int }\gamma }^{-1}\) is near optimal.

Since \(A_{\mathrm{int}\gamma }\) is a real symmetric matrix, let us decompose \(A_{\mathrm{int}\gamma }\) into a product \(P^T DP\) where \(D\) is a diagonal matrix with the eigenvalues of \(A_{\mathrm{int}\gamma }\) as its entries and \(P\) is an orthogonal matrix (i.e., \(P^T P=I\)). Our result, which is similar in form to the upper bound of Theorem 4, is as follows.

**Theorem 5**

*Intuition:* The lower bound is showing that the dependence on \(\sqrt{\mathrm{trace }({X}_{L}^TA_{\mathrm{int}\gamma }^{-1}{X}_{L})}\) is tight modulo a \(\log n\) factor and a factor (\(\kappa \)). The \(\log n\) factor is essentially due to the use of the relation between Gaussian and Rademacher complexities in our proof technique. On the other hand, \(\kappa \) depends on the interaction between the side knowledge about the unlabeled examples (captured through matrix \(P\)) and the feature matrix \({X}_{L}\). If there is no interaction, that is, \(P{X}_{L}\) has zero valued rows for all \(j=1,...,p\), then the lower bound on empirical Rademacher complexity becomes equal to 0. On the other hand, when there is higher interaction between \(A_{\mathrm{int}\gamma }\) (or equivalently, \(P\)) and \({X}_{L}\), then the factor \(\kappa \) grows larger, tightening the lower bound on the empirical Rademacher complexity.

The dependence of the lower bound on the strength of the additional convex quadratic constraint is captured via \(A_{\mathrm{int}\gamma }\) and behaves in a similar way to the upper bound. That is, when the constraint leads to a small circumscribing ellipsoid, the eigenvalues of \(A_{\mathrm{int}\gamma }^{-1}\) are small and the lower bound is small (just like the upper bound). On the other hand, if the constraint leads to a larger circumscribing ellipsoid, the eigenvalues of \(A_{\mathrm{int}\gamma }^{-1}\) are large, leading to a higher values of the lower bound (the upper bound also increases similarly).

*Relation to standard results:* As with the upper bound, when there is no second quadratic constraint, \(A_{\mathrm{int}\gamma }= \frac{1}{B_b^2}\mathbb {I}\). The lower bound depends on the training data through the term \(\sqrt{\mathrm{trace }({X}_{L}^T{X}_{L})}\) in this case.

*Comparison to the upper bound:*For comparison, we see that the upper bound in Theorem 4 is of the form \(\frac{1}{n}\sqrt{\mathrm{trace }({X}_{L}^{T}A_{\mathrm{int }\gamma }^{-1} {X}_{L})}\) while the lower bound of Theorem 5 is of the form

The proof for the lower bound is similar to what one would do for estimating the complexity of a ellipsoid itself (without regard to a corresponding linear function class). See also the work of Wainwright (2011) for handling single ellipsoids.

#### 3.5.3 Comparison of empirical Rademacher complexity upper bound with a covering number based bound

First, \(A_{1}\) and \(A_{2}\) are simultaneously diagonalized by congruence (say with a non-singular matrix called \(C\)) to obtain diagonal matrices \(\mathrm{Diag }(a_{1i})\) and \(\mathrm{Diag }(a_{2i})\). We can guarantee that the set of ratios \(\{\frac{a_{1i}}{a_{2i}}\}\) obtained will be unique.

- The desired ellipsoid \(A_{\mathrm{int }\gamma ^*}\) can then be obtained by computingand then multiplying the optimal diagonal matrix \(\mathrm{Diag }(\gamma ^* a_{1i} + (1-\gamma ^*)a_{2i})\) with the congruence matrix \(C\) appropriately. Optimal \(\gamma ^*\) can be found in polynomial time (for example, using Newton-Raphson).$$\begin{aligned} \gamma ^* \in \arg \max _{\gamma \in [0,1]} \Pi _{i=1}^{p}(\gamma a_{1i} + (1-\gamma )a_{2i}) \end{aligned}$$

**Comparison with the duality approach to upper bounding empirical Rademacher complexity:**A convex duality based upper bound can be derived as shown below.

**Theorem 6**

This upper bound looks similar to the result in Eq. (1). Note that \(A_{\mathrm{int }\eta }\) is different from \(A_{\mathrm{int}\gamma }\) in Theorem 4. \(A_{\mathrm{int}\gamma }\) comes from a circumscribing ellipsoid, whereas \(A_{\mathrm{int }\eta }\) does not.

Instead, the matrix \(A_{\mathrm{int }\eta }\) is picked such that \(\eta \) minimizes the right hand side of the bound in Eq. 3. Qualitatively, we can see that if the matrix \(A_2\) corresponding to the second ellipsoid constraint has large eigenvalues (for instance, when the second ellipsoid is a smaller sphere, or is an elongated thin ellipsoid), then \(A_{\mathrm{int }\eta }^{-1}\) is ‘small’ (the eigenvalues are small) leading to a tighter upper bound on the empirical Rademacher complexity.

#### 3.5.4 Extension to multiple convex quadratic constraints

Although Sect. 3.5 deals with only two convex quadratic constraints, the same strategy can be used to upper bound the complexity of hypothesis class constrained by multiple convex quadratic constraints. In particular, let \({\mathcal {F}}= \{f: f=\beta ^{T} x, \beta ^{T}A_{k}\beta \le 1 \;\;\forall k=1,...,K \}\). Again, assume one of the matrices \(A_k\) is positive definite. We can approach this problem in two stages. In the first step, we find an ellipsoid \(\varXi _{\mathrm{int }\gamma }\) (with matrix \(A_{\mathrm{int}\gamma }\)) circumscribing the intersections of the \(K\) original ellipsoids and in the second step, we reuse Theorem 4 to obtain an upper bound in \(\mathcal {\bar{R}}({\mathcal {F}}_{|S})\).

### 3.6 Complexity results with linear and quadratic constraints

Let matrix \(A_{\mathrm{int }\gamma }\) be such that \(\{ \beta : \beta ^{T}A_{1}\beta \le 1, \beta ^{T}A_{2} \beta \le 1\}\) is circumscribed by \(\{\beta : \beta ^{T}A_{\mathrm{int }\gamma }\beta \le 1\}\). Defining \(\{\tilde{c}_{j\nu }\}\) and \({X_{sL}}\) in the same way as in Sect. 3.3, we get the following corollary.

**Corollary 1**

The corollary holds for any \(A_{\mathrm{int }\gamma }\) that satisfies the circumscribing requirement. In particular, we can construct the ellipsoid \(\{\beta : \beta ^{T}A_{\mathrm{int }\gamma }\beta \le 1\}\) such that it ‘tightly’ circumscribes the set \(\{ \beta : \beta ^{T}A_{1}\beta \le 1, \beta ^{T}A_{2} \beta \le 1\}\) using Theorem 3 in the same way as we did in Sect. 3.5. The intuition for how the parameters of our side knowledge, namely, the linear inequality coefficients and the matrices corresponding to the ellipsoids, is the same as in Sects. 3.4 and 3.5. Relation to standard results have also been discussed in these sections.

#### 3.6.1 Extension to arbitrary convex constraints

There are at least three ways to reuse the results we have with linear, polygonal, quadratic and conic constraints to give upper bounds on covering number or empirical Rademacher complexity of function classes with arbitrary convex constraints. Such arbitrary convex constraints can arise in many settings. For instance, when the convex quadratic constraints in Sect. 2.2 are not symmetric around the origin, we cannot use the results of Sect. 3.5 directly, but the following techniques apply. Other typical convex constraints include those arising from likelihood models, entropy biases and so on.

The first approach involves constructing an outer polyhedral approximation of the convex constraint set. For instance, if we are given a separation oracle for the convex constraint, constructing an outer polyhedral approximation is relatively straightforward. We can also optimize for properties like the number of facets or vertices of the polyhedron during such a construction. Given such an outer approximation, we can apply Theorem 2 to get an upper bound on the covering number of the hypothesis space with the given convex constraint.

The second approach involves constructing a circumscribing ellipsoid for the constraint set. This is possible for any convex set in general (John 1948). In addition if the convex set is symmetric around the origin, the ‘tightness’ of the circumscribing ellipsoid improves by a factor \(\sqrt{p}\), where \(p\) is the dimension of the linear coefficient vector \(\beta \). Given such a circumscribing ellipsoid, we can apply Theorem 4 to get an upper bound on the empirical Rademacher complexity of the original function class with the convex constraint. The quality of both of these outer relaxation approaches depends on the structure and form of the convex constraint we are given.

The third approach is to analyze the empirical Rademacher complexity directly using convex duality as we have done for the linear and quadratic cases, and as we will do for the conic case next.

### 3.7 Complexity results with multiple conic constraints

**Theorem 7**

*Intuition:*When \(\Vert a_k\Vert _2\) and \(d_k\) are \(o(\lambda _{\min }(A_k))\), the effect of conic constraints can influence the upper bound on the empirical Rademacher complexity and make the corresponding generalization bounds tighter. From a geometric point of view, we can infer the following: if the cones are sharp, then \(\lambda _{\min }(A_k)\) are high, implying a smaller empirical Rademacher complexity. Figure 2 illustrates this in two dimensions.

*Relation to standard results:* The looser unconstrained version of the upper bound \(\frac{X_bB_b}{\sqrt{n}}\) is recovered when there are no conic constraints or when the conic constraints are ineffective (for instance, when \(\Vert a_k\Vert _2\) is high, \(d_k\) is a large offset or \(\lambda _{\min }(A_k)\) is small).

*Remark 3*

There have been some recent attempts to obtain bounds on a related measure, similar to the empirical Gaussian complexity defined here, in the compressed sensing literature that also involves conic constraints (Stojnic 2009). Their objective (minimum number of measurements for signal recovery assuming sparsity) is very different from our objective (function class complexity and generalization). In the former context, there are a few results (Chandrasekaran et al. 2012) dealing with the intersection of a single generic cone with a sphere (\(\mathbb {S}^{p-1}\)) whereas in this context, we look at the intersection of multiple second order cones (explicitly parameterized by \(\{A_k,a_k,d_k\}_{k=1}^{K}\)) with balls (\(\{\beta ^T\beta \le B_b^2\}\)).

## 4 Related work

It is well-known that having additional unlabeled examples can aid in learning (Shental et al. 2004; Nguyen and Caruana 2008b; Gómez-Chova et al. 2008), and this has been the subject of research in semi-supervised learning (Zhu 2005). The present work is fundamentally different than semi-supervised learning, because semi-supervised learning exploits the distributional properties of the set of unlabeled examples. In this work, we do not necessarily have enough unlabeled examples to study these distributional properties, but these unlabeled examples do provide us information about the hypothesis space. Distributional properties used in semi-supervised learning include cluster assumptions (Singh et al. 2008; Rigollet 2007) and manifold assumptions (Belkin and Niyogi 2004; Belkin et al. 2004). In our work, the information we get from the unlabeled examples allows us to restrict the hypothesis space, which lets us be in the framework of empirical risk minimization and give theoretical generalization bounds via complexity measures of the restricted hypothesis spaces (Bartlett and Mendelson 2002; Vapnik 1998). While the focus of many works [e.g., Zhang 2002; Maurer 2006] is on complexity measures for ball-like function classes, our hypothesis spaces are more complicated, and arise here from constraints on the data.

Researchers have also attempted to incorporate domain knowledge directly into learning algorithms, where this domain knowledge does not necessarily arise from unlabeled examples. For instance, the framework of knowledge based SVMs (Fung et al. 2002; Le et al. 2006) motivates the use of various constraints or modifications in the learning procedure to incorporate specific kinds of knowledge (without using unlabeled examples). The focus of Fung et al. (2002) is algorithmic and they consider linear constraints. Le et al. (2006) incorporate knowledge by modifying the function class itself, for instance, from linear function to non-linear functions.

In a different framework, that of Valiant’s PAC learning, there are concentration statements about the risks in the presence of unlabeled examples (Balcan and Blum 2005; Kääriäinen 2005), though in these results, the unlabeled points are used in a very different way than in our work. Specifically, in the work of Balcan and Blum (2005), the authors introduce the notion of incompatibility \(\mathbb {E}_{x\sim D}[1 - \chi (h,x)]\) between a function \(h\) and the input distribution \(D\). The unlabeled examples are used to estimate the distribution dependent quantity \(\mathbb {E}_{x\sim D}[1 - \chi (h,x)]\). By imposing the constraint that models have their incompatibility with the distribution of the data source \(D\) below a desired level, we restrict the hypothesis space. Their result for a finite hypothesis space is as follows:

**Theorem 8**

Here \(C\) is the finite hypothesis space of which \(h\) is an element and \(C_{D,\chi }(\epsilon ) = \{h \in C: \mathbb {E}_{x\sim D}[1-\chi (h,x)] \le \epsilon \}\). In the work of Kääriäinen (2005), the author obtains a generalization bound by approximating the disagreement probability of pairs of classifiers using unlabeled data. Again, here the unlabeled data is used to estimate a distribution dependent quantity, namely, the true disagreement probability between consistent models. In particular, the disagreement between two models \(h\) and \(g\) is defined to be \(d(h,g) = \frac{1}{m}\sum _{i=1}^{m}1_{[h(\tilde{x}_i) \ne g(\tilde{x}_i)]}\). The following theorem about generalization is proposed.

**Theorem 9**

Note that the randomization in both Theorems 8 and 9 is also over unlabeled data. In our theorems, we do not randomize with respect to the unlabeled data. For us, they serve a different purpose and do not need to be chosen randomly. While their results focus on exploiting unlabeled data to estimate distribution dependent quantities, our technology focuses on exploiting unlabeled data to restrict the hypothesis space directly.

## 5 Proofs

### 5.1 Proof of Proposition 1

*Proof*

Instead of working with the maximization problem in the definition of empirical Rademacher complexity, we will work with a couple of related maximization problems, due to the following lemma.

**Lemma 1**

*Proof*

*Continuing with the proof of Proposition*1: Let \(g = \sum _{i=1}^{n}\sigma _ix_i = X_L\sigma \) so that \(\mathcal {\bar{R}}({\mathcal {F}}_{|S})= \frac{1}{n}\mathbb {E}[\sup _{\beta \in {\mathcal {F}}} |g^T\beta |]\). We will attempt to dualize the two maximization problems in the upper bound provided by Lemma 1 to get a bound on the empirical Rademacher complexity. Both maximization problems are very similar except for the objective. Let \(\omega (g,{\mathcal {F}})\) be the optimal value of the following optimization problem:

### 5.2 Proof of Theorem 4

*Proof*

### 5.3 Proof of Theorem 5

*Proof*

We will analyze the Gaussian function \(F(\omega (\sigma )) := \sup _{\alpha ^T D \alpha \le 1} \alpha ^{T}\omega (\sigma )\) and show it is Lipschitz in \(\sigma \). This is proved in Lemma 2.

Then we apply Lemma 3, which is about Gaussian function concentration, to the above function. In particular, we will upper bound the variance of the Gaussian function \(F(\omega (\sigma ))\) in terms of its parameters (Lipschitz constant, matrix \(D\), etc).

We then generate a candidate lower bound for the empirical Gaussian complexity.

The upper bound on the variance of \(F(\omega (\sigma ))\) we found earlier is used to make this bound proportional to \(\sqrt{\mathrm{trace }({X}_{L}A_{\mathrm{int}\gamma }^{-1}{X}_{L})}\).

Finally, we use a relation between empirical Rademacher complexity and empirical Gaussian complexity to obtain the desired result.

#### 5.3.1 Computing a Lipschitz constant for \(F(\omega (\sigma ))\)

The following lemma gives an upper bound on the Lipschitz constant of \(F(\omega (\sigma ))\).

**Lemma 2**

The function \(F(\omega (\sigma )):= \sup _{\alpha ^T D \alpha \le 1} \alpha ^{T}\omega (\sigma )\) is Lipschitz in \(\sigma \) with a Lipschitz constant \(\mathcal {L}\) bounded by \(X_b\sqrt{\frac{p\cdot n}{\lambda _{min}(D)}}\).

*Proof*

**Upper bounding the variance of**\(F(\omega (\sigma ))\)**using Gaussian concentration**: The following lemma describes concentration for Lipschitz functions of gaussian random variables.

**Lemma 3**

Let \(Y=|(F(\omega )-\mathbb {E}_{\sigma }[F(\omega )]|\). Then from the above tail bound, \(P(Y^{2} \ge s) \le 2 e^{-\frac{s}{2\mathcal {L}^2}}\) is true. Now we can bound the variance of \(F(\omega )\) using the above inequality and the following lemma.

**Lemma 4**

For a random variable \(Y^2\), \(\mathbb {E}[Y^2]=\int ^{+\infty }_{0}P(Y^2\ge s)ds\).

*Proof*

#### 5.3.2 Lower bounding the empirical Gaussian complexity

Now we will lower bound the empirical Gaussian complexity by constructing a feasible candidate \(\alpha '\) to substitute for the \(\sup \) operation in Eq. (6). Later, we will use the variance upper bound on \(F(\omega )\) we found in the earlier section to make the bound more specific.

#### 5.3.3 Making the lower bound more specific using variance of \(F(\omega (\sigma ))\)

#### 5.3.4 Using the relation between Rademacher and Gaussian complexities

The empirical Gaussian complexity is related to the empirical Rademacher complexity as follows.

**Lemma 5**

### 5.4 Proof of Corollary 1

*Proof*

### 5.5 Proof of Theorem 6

*Proof*

### 5.6 Proof of Theorem 7

*Proof*

In the second step, we will minimize \( \sup _{\beta }\mathcal {L}(\beta ,\gamma ,\{z_k,\theta _k\}_{k=1}^{K})\) over variable \(\gamma \) (one of the dual variables) to get an upper bound on \(\omega (g,{\mathcal {F}})\) in terms of \(\{z_k,\theta _k\}_{k=1}^{K}\). These two steps are shown below:

*First step:*After rearranging terms and completing squares, we get the following dual objective to be minimized over dual variables \(\gamma \) and \(\{z_k,\theta _k\}_{k=1}^{K}\).

*Second step:*Since \(\min _{x,y}f(x,y) = \min _x(\min _y f(x,y))\) when \(f(x,y)\) is convex and the feasible set is convex, we now minimize with respect to \(\gamma \) to get the following upper bound:

## 6 Conclusion

In this paper, we have outlined how various side information about a learning problem can effectively help in generalization. We focused our attention on several types of side information, leading to linear, polygonal, quadratic and conic constraints, giving motivating examples and deriving complexity measure bounds. This work goes beyond the traditional paradigm of ball-like hypothesis spaces to study more exotic, yet realistic, hypothesis spaces, and is a starting point for more work on other interesting hypothesis spaces.

Note that this notation is not the usual notation where observations \(\tilde{x}_i\)’s are stacked as rows.