1 Introduction

Surely, for many applications the amount of domain knowledge we could potentially use within our learning processes is vastly larger than the amount of domain knowledge we actually use. One reason for this is that domain knowledge may be nontrivial to incorporate into algorithms or analysis. A few types of domain knowledge that do permit analysis have been explored quite in depth in the past few years and used very successfully in a variety of learning tasks; this includes knowledge about the sparsity properties of linear models (\(\ell _{1}\)-norm constraints, minimum description length) or smoothness properties (\(\ell _{2}\)-norm constraints, maximum entropy). A reason that domain knowledge is not usually incorporated in theoretical analysis is that it can be very problem specific; it may be too specific to the domain to have an overarching theory of interest. For example, researchers in NLP (Natural Language Processing) have long figured out various exotic domain specific knowledge that one can use while performing a learning task (Chang et al. 2008a, b). The present work aims to provide theoretical guarantees for a large class of problems with a general type of domain knowledge that goes beyond sparsity and smoothness.

To define this large class of problems, we will keep the usual supervised learning assumption that the training examples are drawn i.i.d. Additionally in our setting, we have a different set of examples without labels, not necessarily chosen randomly. For this set of unlabeled examples, we have some prior knowledge about the relationships between their labels, which affects the space of hypotheses we are searching over within our learning algorithms. We motivate this knowledge as being obtained from domain experts. These assumptions can, for example, take into account our partial knowledge about how any learned model should predict on the unlabeled examples if they were encountered. We consider many types of side knowledge, namely constraints on the unlabeled examples leading to (1) linear constraints on a linear function class, (2) quadratic constraints on a linear function class, and (3) conic constraints on a linear function class. Our main contributions are:

  • To show that linear, polygonal, quadratic and conic constraints on a linear hypothesis space can arise naturally in many circumstances, from constraints on a set of unlabeled examples. This is in Sect. 2. We connect these with relevant semi-supervised learning settings.

  • To provide upper bounds on covering number and empirical Rademacher complexity for linearly constrained linear function classes. Bounds for the case of linear and polygonal constraints are found in Sects. 3.3 and 3.4 respectively. Two of the three bounds in these sections are not original to this paper, but their application to general side knowledge with linear constraints is novel.

  • To provide two upper bounds on the complexity of the hypothesis space for the quadratic constraint case This can be used directly in generalization bounds. The use of a certain family of circumscribing ellipsoids and the quadratic bounds of Sect. 3.5 are novel to this paper.

  • To show that one of the upper bounds on the quadratically constrained hypothesis space we provided has a matching lower bound, also in Sect. 3.5. This is novel to this paper.

  • To provide a bound on the complexity of the hypothesis space for the conic constraint case. These bounds are in Sect. 3.7 and are novel to this paper.

  • We develop a novel proof technique for upper bounding linear, quadratic and conic constraint cases based on convex duality.

Figure 1 illustrates the various types of side knowledge.

Fig. 1
figure 1

This figure illustrates constraints on our hypothesis space. These constraints arise from side knowledge available about a set of unlabeled examples. The \(\ell _2\) balls in (a), (b), (c) and (d) represent coefficients of linear functions in two dimensions. a and b represent intersection of a ball and one or several half spaces. Theorems 1, 2 and Proposition 1 analyze these situations. c shows the intersection of a ball and an ellipsoid. Theorems 4, 5 and 6 correspond to this setting. (d) shows the intersection of a ball with a second order cone. Theorem 7 corresponds to this setting

Side knowledge can be particularly helpful in cases where data are scarce; these are precisely circumstances when data themselves cannot fully define the predictive model, and thus domain knowledge can make an impact in predictive accuracy. That said, for any type of side knowledge (sparsity, smoothness, and the side knowledge considered here), the examples and hypothesis space may not conform in reality to the side knowledge. (Similarly, the training data may not be truly random in practice.) However, if they do, we can claim lower sample complexities, and potentially improve our model selection efforts. Thus, we cannot claim that our side knowledge is always true knowledge, but we can claim that if it is true, we are able to gain some benefit in learning.

1.1 Motivating examples

Fung et al. (2002) added multiple linear constraints (polygonal constraints) to a specific ERM algorithm, the linear SVM, as a way to incorporate prior knowledge. They investigated the effect of using this type of prior knowledge for classification on a DNA promoter recognition dataset (Towell et al. 1990). In this classification task, the linear constraints result from precomputed rules that are separate from the training data (this is similar to our polygonal setting where constraints are generated from knowledge about the unlabeled examples). The “leave-one-out” error from the 1-norm SVM with the additional constraints was less than that of the plain 1-norm SVM and other training-data-based classifiers such as decision trees and neural networks. This and other types of knowledge incorporation in SVMs are reviewed by Lauer and Bloch (2008) and also Le et al. (2006).

James et al. (2014) motivated the use of linear constraints with LASSO, which is also an ERM procedure. In their experiment, they estimated a demand probability function using an on-line auto lending dataset. They ensured monotonicity of the demand function by applying a set of linear constraints (similar to the poset constraints in 2.1) and compared the output to two other methods: logistic regression and the unconstrained LASSO, both of which output non-monotonic demand probability curves.

Nguyen and Caruana (2008a) considered additional unlabeled examples whose labels are partially known. In particular, they worked on a type of multi-class classification task where they know that the label of each unlabeled example belongs to a known subset of the set of all class labels. This knowledge about the unlabeled examples translates into multiple linear constraints (polygonal constraints). They provided experimental results on five datasets showing improvements over multi-class SVMs.

Gómez-Chova et al. (2008) implemented a technique (known as LapSVMs) that uses Laplacian regularization augmented with standard SVMs for two image classification tasks related to urban monitoring and cloud screening (which are both remote sensing tasks). Laplacian regularization means that the regularization term is a quadratic function of the model, derived from a set of unlabeled examples, like our quadratic setting (see Sect. 2.2). In both tasks, the Laplacian-regularized linear SVMs outperformed the standard SVMs in terms of overall accuracy (these improvements are of the order of 2–3 % in both cases).

Shivaswamy et al. (2006) formulated robust classification and regression problems as described in Sect. 2.3 leading to conic constraints on the model class. For classification, they used the OCR, Heart, Ionosphere and Sonar datasets from the UCI repository to illustrate the effect of missing values and how robust SVM classification (which introduces second order conic constraints) provides better classification accuracy than the standard SVM classifier after imputation. For regression, they showed improvements in prediction accuracy of a robust version of SVR (again introducing conic constraints on the hypothesis space) as compared to a standard SVR trained after imputation on the Boston housing dataset (also from the UCI repository).

Finally, “Appendix” also provides experimental results showing the advantage of using side knowledge in a ridge regression problem.

2 Linear, polygonal, quadratic and conic constraints

We are given training sample \(S\) of \(n\) examples \(\{(x_{i},y_{i})\}_{i=1}^{n}\) with each observation \(x_{i}\) belong to a set \({\mathcal {X}}\) in \(\mathbb {R}^{p}\). Let the label \(y_{i}\) belong to a set \({\mathcal {Y}}\) in \(\mathbb {R}\). In addition, we are given a set of \(m\) unlabeled examples \(\{\tilde{x}_{i}\}_{i=1}^{m}\). We are not given the true labels \(\{\tilde{y}_{i}\}_{i=1}^{m}\) for these observations. Let \({\mathcal {F}}\) be the function class (set of hypotheses) of interest, from which we want to choose a function \(f\) to predict the label of future unseen observations. Let it be linear, parameterized by coefficient vector \(\beta \) and its description will change based on the constraints we place on \(\beta \).

Consider the empirical risk minimization problem: \(\min _{f \in {\mathcal {F}}} \frac{1}{n}\sum _{i=1}^{n}{l(f(x_{i}),y_{i})}\). Here the loss function is a Lipschitz continuous function such as the squared, exponential or hinge loss among others. This supervised learning setup encompasses both supervised classification (\({\mathcal {Y}}\) is a discrete set) and regression (\({\mathcal {Y}}\) is equal to \(\mathbb {R}\)). Regularization on \(f\) acts to enforce assumptions that the true model comes from a restricted class, so that \({\mathcal {F}}\) is now defined as

$$\begin{aligned} \{ f | f:{\mathcal {X}}\mapsto {\mathcal {Y}}, f(x) = \beta ^{T}x, R_{l}(f) \le c_{l} \,\mathrm{ for } \, l=1,...,L \}, \end{aligned}$$

where \(()^{T}\) represents the transpose operation. Here we have appended \(L\) additional constraints for regularization to the description of the hypothesis set \({\mathcal {F}}\). Especially if the training set is small, side knowledge can be very powerful in reducing the size of \({\mathcal {F}}\). Particularly if constants \(\{c_{l}\}_{l=1}^{L}\) are small, the size of \({\mathcal {F}}\) be reduced substantially.

2.1 Assumptions leading to linear and polygonal constraints

We will provide three settings to demonstrate that linear constraints arise in a variety of natural settings: poset, must-link, and sparsity on \(\{\tilde{y}_{i}\}_{i=1}^{m}\). In all three, we will include standard regularization of the form \(\Vert \beta \Vert _q\le c_1\) by default.

2.1.1 Poset

Partial order information about the labels \(\{\tilde{y}_{i}\}_{i=1}^{m}\) can be captured via the following constraints: \(f(\tilde{x}_{i}) \le f(\tilde{x}_{j}) + c_{i,j}\) for any collection of pairs \((i,j) \in [1,...,m]\times [1,...,m]\). This gives us up to \(m^2\) constraints of the form \(\beta ^{T}(\tilde{x}_{i} - \tilde{x}_{j}) \le c_{i,j}.\) \({\mathcal {F}}\) can be described as: \({\mathcal {F}}:=\{ f | f(x) = \beta ^{T}x, \Vert \beta \Vert _{q} \le c_{1}, \beta ^{T}(\tilde{x}_{i} - \tilde{x}_{j}) \le c_{i,j}, \forall (i,j) \in E\}\), where \(E\) is the set of pairs of indices of unlabeled data that are constrained.

2.1.2 Must-link

Here we bound the absolute difference of labels between pairs of unlabeled examples: \( |f(\tilde{x}_{i}) - f(\tilde{x}_{j})| \le c_{i,j}\). This captures knowledge about the nearness of the labels. This leads to two linear constraints: \(-c_{i,j} \le \beta ^{T}(\tilde{x}_{i}-\tilde{x}_{j}) \le c_{i,j}.\) These constraints have been used extensively within the semi-supervised (Zhu 2005) and constrained clustering settings (Lu and Leen 2004, Basu et al. 2006) as must-link or ‘in equivalence’ constraints. For must-link constraints, \({\mathcal {F}}\) is defined as: \( {\mathcal {F}}:=\{ f | f(x) = \beta ^{T}x, \Vert \beta \Vert _{q} \le c_{1}, -c_{i,j} \le \beta ^{T}(\tilde{x}_{i}-\tilde{x}_{j}) \le c_{i,j}, \forall (i,j) \in E\}\), where \(E\) is again the set of pairs of indices of unlabeled data that are constrained.

2.1.3 Sparsity and its variants on a subset of \(\{\tilde{y}_{i}\}_{i=1}^{m}\)

Similar to sparsity assumptions on \(\beta \), here we want that only a small set of labels is nonzero among a set of unlabeled examples. In particular, we want to bound the cardinality of the support of the vector \([\tilde{y}_{{1}} \ldots \tilde{y}_{{|{\mathcal {I}}|}}]\) for some index set \({\mathcal {I}} \subset \{1,...,m\}\). Such a constraint is nonlinear. Nonetheless, a convex constraint of the form \(\Vert [\tilde{y}_{{1}} \ldots \tilde{y}_{{|{\mathcal {I}}|}}]\Vert _{1} \le c_{{\mathcal {I}}} \) (\(2^{|{\mathcal {I}}|}\) linear constraints) can be used as a proxy to encourage sparsity. The function class is defined as: \( {\mathcal {F}}:=\{ f | f(x) = \beta ^{T}x, \Vert \beta \Vert _{q} \le c_{1}, \Vert [\beta ^T\tilde{x}_{{1}} \ldots \beta ^T\tilde{x}_{{|{{\mathcal {I}}}|}}]\Vert _{1} \le c_{{\mathcal {I}}}\}\). A similar constraint can be obtained if we instead had partial information with respect to the dual norm: \(\Vert [\tilde{y}_{{1}} \ldots \tilde{y}_{{ |{\mathcal {I}}| }}]\Vert _{\infty } \le c_{{\mathcal {I}}}\).

2.2 Assumptions leading to quadratic constraints

We will provide several settings to show that quadratic constraints arise naturally.

2.2.1 Must-link

A constraint of the form \((f(\tilde{x}_{i}) - f(\tilde{x}_{j}))^{2} \le c_{i,j}\) can be written as \( 0 \le \beta ^{T}A \beta \le c_{i,j}\) with \(A = (\tilde{x}_{i}-\tilde{x}_{j})(\tilde{x}_{i}-\tilde{x}_{j})^T\). Here \(A\) is rank-deficient as it is an outer product, which leads to an unbounded ellipse; however, its intersection with a full ellipsoid (for instance, an \(\ell _{2}\)-norm ball) is not unbounded and indeed can be a restricted hypothesis set. Set \({\mathcal {F}}\) is defined by: \({\mathcal {F}}= \{\beta : \beta ^{T}\beta \le c_{1}, \beta ^{T} (\tilde{x}_{i}-\tilde{x}_{j})(\tilde{x}_{i}-\tilde{x}_{j})^T \beta \le c_{i,j}; (i,j) \in E\}\), where \(E\) is again the set of pairs of indices of unlabeled data that are constrained.

2.2.2 Constraining label values for a pair of examples

We can define the following relationship between the labels of two unlabeled examples using quadratic constraints: if one of them is large in magnitude, the other is necessarily small. This can be encoded using the inequality: \(f(\tilde{x}_{i})\cdot f(\tilde{x}_{j}) \le c_{i,j}\). If \(f(x) \in {\mathcal {Y}}\subset \mathbb {R}_{+}\), then \(f(\tilde{x}_{i})\cdot f(\tilde{x}_{j}) \le c_{i,j}\) gives the following quadratic constraint on \(\beta \) with the associated rank 1 matrix being \(A = \tilde{x}_{i}\tilde{x}_{j}^{T}\): \(\beta ^T A\beta \le c_{i,j}.\) This is not quite an ellipsoidal constraint yet because matrices associated with ellipsoids are symmetric positive semidefinite. Matrix \(A\) on the other hand is not symmetric. Nonetheless, the quadratic constraint remains intact when we replace matrix \(A\) with the symmetric matrix \(\frac{1}{2}(A + A^{T})\). If in addition, the symmetric matrix is also positive-definite (which can be verified easily), then this leads to an ellipsoidal constraint. The hypothesis space \({\mathcal {F}}\) becomes: \({\mathcal {F}}= \left\{ \beta : \beta ^{T}\beta \le c_{1}, \beta ^{T} \tilde{x}_{i}\tilde{x}_{j}^{T}\beta \le c_{i,j}; (i,j) \in E \right\} .\)

2.2.3 Energy of estimated labels

We can place an upper bound constraint on the sum of squares (the “energy”) of the predictions, which is: \(||{X}_{U}^{T}\beta ||_{2}^{2} = \sum _{i}(\beta ^{T}\tilde{x}_{i})^{2} = \beta ^T(\sum _{i}\tilde{x}_{i}\tilde{x}_{i}^{T})\beta \) where \(X_{U}\) is a \(p \times m\) dimensional matrix with \(\tilde{x}_i\)’s as its columns.Footnote 1 The set \({\mathcal {F}}\) is \({\mathcal {F}}= \left\{ \beta : \beta ^{T}\beta \le c_{1}, ||{X}_{U}^{T}\beta ||_{2}^{2} \le c \right\} \). Extensions like the use of Mahalanobis distance or having the norm act on only a subset of the estimates of \(\{\tilde{y}\}_{i=1}^{m}\) follow accordingly.

2.2.4 Smoothness and other constraints on \(\{\tilde{y}_{i}\}_{i=1}^{m}\)

Consider the general ellipsoid constraint \(\Vert \Gamma {X}_{U}^{T}\beta \Vert _{2}^{2} \le c\) where we have added an additional transformation matrix \(\Gamma \) in front of \({X}_{U}^{T}\beta \). If \(\Gamma \) is set to the identity matrix, we get the energy constraint previously discussed. If \(\Gamma \) is a banded matrix with \(\Gamma _{i,i} = 1\) and \(\Gamma _{i,i+1} = -1\) for all \(i=1,...,m\) and remaining entries zero, then we are encoding the side knowledge that the variation in the labels of the unlabeled examples is smoothly varying: we are encouraging the unlabeled examples with neighboring indices to have similar predicted values. This matrix \(\Gamma \) is an instance of a difference operator in the numerical analysis literature. In this context, banded matrices like \(\Gamma \) model discrete derivatives. By including this type of constraint, problems with identifiability and ill-posedness of an optimal solution \(\beta \) are alleviated. That is, as with the Tikhonov regularization on \(\beta \) in least squares regression, constraints derived from matrices like \(\Gamma \) reduce the condition number. The set \({\mathcal {F}}\) is defined as: \({\mathcal {F}}= \left\{ \beta : \beta ^{T}\beta \le c_{1}, \Vert \Gamma {X}_{U}^{T}\beta \Vert _{2}^{2} \le c \right\} .\)

2.2.5 Graph based methods

Some graph regularization methods such as manifold regularization (Belkin and Niyogi 2004) also encode information about the labels of the unlabeled data. They also lead to convex quadratic constraints on \(\beta \). Here, along with the unlabeled examples \(\{\tilde{x}_{i}\}_{i=1}^{m}\), our side knowledge consists of an \(m\)-node weighted graph \(G = (V,E)\) with the Laplacian matrix \(L_{G} = D - A\). Here, \(D\) is a \(m\times m\)-dimensional diagonal matrix with the diagonal entry for each node equal to the sum of weights of the edges connecting it. Further, \(A\) is the adjacency matrix containing the edge weights \(a_{ij}\), where \(a_{ij} = 0\) if \((i,j) \notin E\) and \(a_{ij} = e^{-c\Vert \tilde{x}_{i}-\tilde{x}_{j}\Vert _{q}}\) if \((i,j) \in E\) (other choices for the weights are also possible). The quadratic function \(({X}_{U}^{T}\beta )^{T} L_{G}({X}_{U}^{T}\beta )\) is then twice the sum over all edges, of the weighted squared difference between the two node labels corresponding to the edge: \(2\sum _{(i,j) \in E}a_{ij}\left( f(\tilde{x}_{i}) - f(\tilde{x}_{j})\right) ^{2}.\) Intuitively, if we have the side knowledge that this quantity is small, it means that a node should have similar labels to its neighbors. For classification, this typically encourages the decision boundary to avoid dense regions of the graph. The set \({\mathcal {F}}\) is defined as: \({\mathcal {F}}= \{\beta : \beta ^{T}\beta \le c_{1}, \beta ^{T}{X}_{U}^{T}L_{G}{X}_{U}^{T}\beta \le c\}\).

2.3 Assumptions leading to conic constraints

We provide two scenarios that naturally lead to conic constraints on the model class: robustness against uncertainty and stochastic constraints.

2.3.1 Robustness against uncertainty in linear constraints

Consider any of the linear constraints considered in Sect. 2.1. All of these can be generically represented as: \(\{a_k^T \beta \le 1\;\; \forall k=1,..,K\}\) where for each \(k\), \(a_k\) is a function of the unlabeled sample \(\{\tilde{x}_j\}_{j=1}^{m}\) (for instance, \(a_k = \tilde{x}_i - \tilde{x}_k\) for Poset constraints). Further assume that each \(a_k\) is only known to belong to an ellipsoid \(\varXi _{k} = \{\overline{a}_k + A_ku: u^Tu \le 1\}\) with both parameters \(\overline{a}_k\) and \(A_k\) known. This can happen due to measurement limitations, noise and other factors. We want to guarantee that, irrespective of the true value of \(a_{k} \in \varXi _k\), we still have \(a_k^T \beta \le 1\).

Borrowing a trick used in the robust linear programming literature, we can encode (Lanckriet et al. 2003) the above requirement succinctly as:

$$\begin{aligned} \overline{a}_k^T \beta + \Vert A_k^T \beta \Vert _2 \le 1, \forall k=1,\ldots ,K \end{aligned}$$

which is a set of second-order cone constraints. The feasible set becomes smaller when the linear constraints \(\{a_k^T \beta \le 1\; \forall k=1,\ldots ,K\}\) are replaced with the conic constraints above.

2.3.2 Stochastic Programming

Consider a probabilistic constraint of the form \(\mathbb {P}_{a_k}(a_k^T\beta \le 1) \ge \eta _k\), where \(a_k\) is now considered a random vector. The motivation for \(a_k\) is the same as before (see Sect. 2.1). If we know that \(a_k\) is normally distributed (for instance, due to additive noise) with mean \(\overline{a}_k\) and covariance matrix \(B_k\), then the probabilistic constraint is the same as: \(\overline{a}_k^T\beta + \varPhi ^{-1}(1 - p)\Vert B_k^{1/2}\beta \Vert _2 \le 1\), where \(\varPhi ^{-1}()\) is the inverse error function. To see this, let \(u_k = a_k^T\beta \) be a scalar random variable with mean \(\overline{u_k}\) and variance \(\sigma _k^2\) (this is equal to \(\beta ^T B_k\beta \)). Then, our original constraint can be written as \(\mathbb {P}\left( \frac{u_k-\overline{u}_k}{\sigma _k} \le \frac{1-\overline{u}_k}{\sigma _k}\right) \ge \eta _k\). Since \(\frac{u_k-\overline{u}_k}{\sigma _k}\sim \mathcal {N}(0,1)\), we can rewrite our constraint as: \(\varPhi \left( \frac{1-\overline{u}_k}{\sigma _k}\right) \ge \eta _k\) where \(\varPhi (z)\) is the cumulative distribution function for the standard normal. Further \(\varPhi \Big (\frac{1-\overline{u}_k}{\sigma _k}\Big ) \ge \eta _k\) implies \( \frac{1-\overline{u}_k}{\sigma _k} \ge \varPhi ^{-1}(\eta _k)\). Rearranging terms, we get \(\overline{u}_k + \varPhi ^{-1}(\eta _k)\sigma _k \le 1\). Finally, substituting the values for \(\overline{u}_k\) and \(\sigma _k\) gives us the following constraint:

$$\begin{aligned} \overline{a}_k^T\beta + \varPhi ^{-1}(\eta _k)\Vert B_k^{1/2}\beta \Vert _2 \le 1, \end{aligned}$$

which is a second order conic constraint (Lobo et al. 1998).

Remark 1

A question of practical interest would be about ways to impose constraints seen in Sects. 2.12.2 and 2.3 in a computationally efficient manner. Fortunately, for all the cases we have considered thus far, the side knowledge can be encoded as a set of convex constraints leading to efficient algorithms (if the original empirical risk minimization problem is convex). Further, note that unlike must-link and similarity side knowledge that lead to convex constraints, cannot-link and dissimilarity knowledge is relatively harder to impose and is typically non-convex.

3 Generalization bounds

In each of the scenarios considered in Sect. 2, essentially we are given \(m\) unlabeled examples \(\tilde{x}\) whose subsets satisfy various properties or side knowledge (for instance, linear ordering, quadratic neighborhood similarity, etc). This side knowledge is also shown to constrain the hypothesis space in various ways. In this section, we will attempt to answer the following statistical question: what effect do these constraints have on the generalization ability of the learned model? We will compute bounds on the complexity of the hypothesis space when the types of constraints seen in Sect. 2 are included.

3.1 Definition of complexity measures

We will look at two complexity measures: the covering number of a hypothesis set and the Rademacher complexity of a hypothesis set. Their definitions are as follows:

Definition 1

Covering Number (Kolmogorov and Tikhomirov 1959): Let \(A \subseteq \varOmega \) be an arbitrary set and \((\varOmega , \rho )\) a (pseudo-)metric space. Let \(|\cdot |\) denote set size. For any \(\epsilon > 0\), an \(\epsilon \) -cover for \(A\) is a finite set \(U \subseteq \varOmega \) (not necessarily \( \subseteq A\)) s.t. \( \forall \omega \in A, \exists u \in U\) with \(d_{\rho }(\omega , u) \le \epsilon \). The covering number of \(A\) is \(N(\epsilon ,A,\rho ) := \inf _{U} |U|\) where \(U\) is an \(\epsilon \)-cover for \(A\).

Definition 2

Rademacher Complexity (Bartlett and Mendelson 2002): Given a training sample \(S = \{x_{1},...,x_{n}\}\), with each \(x_i\) drawn i.i.d. from \(\mu _{\mathcal {X}}\), and hypothesis space \({\mathcal {F}}\), \({\mathcal {F}}_{|S}\) is the defined as the restriction of \({\mathcal {F}}\) with respect to \(S\). The empirical Rademacher complexity of \({\mathcal {F}}_{|S}\) is

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})= \mathbb {E}_{\sigma }\left[ \sup _{f \in {\mathcal {F}}} \frac{1}{n}{\left| \sum _{i=1}^{n}\sigma _{i}f(x_{i}) \right| } \right] \end{aligned}$$

where \(\{\sigma _i\}\) are Rademacher random variables (\(\sigma _i = 1\) with probability \(1/2\) and \(\sigma _i =-1\) with probability \(1/2\)). The Rademacher complexity of \({\mathcal {F}}\) is its expectation:

$$\begin{aligned} {\mathcal {R}}({\mathcal {F}}) = \mathbb {E}_{S\sim (\mu _{\mathcal {X}})^{n}}[\mathcal {\bar{R}}({\mathcal {F}}_{|S})]. \end{aligned}$$

If instead we let \(\sigma _{i} \sim \mathcal {N}(0,1)\) in the definition, this is the Gaussian complexity of the function class. Generalization bounds often use both these quantities in their statements (Bartlett and Mendelson 2002). Unless otherwise specified, the feature vectors in feature space \({\mathcal {X}}\) are bounded in norm by constant \(X_b > 0\) and the coefficient vectors of the linear function class \({\mathcal {F}}\) are bounded in norm with constant \(B_b > 0\).

3.2 Complexity measures within generalization bounds

Given these definitions, a generalization bound statement can be written as follows (Bartlett and Mendelson 2002): With probability at least \(1-\delta \) over the training sample \(S\),

$$\begin{aligned} \forall f \in {\mathcal {F}},\;\; \mathbb {E}_{x,y}[l(f(x),y)] \le \frac{1}{n}\sum _{i=1}^{n}l(f(x_i),y_i) + 4\mathcal {L}\mathcal {\bar{R}}({\mathcal {F}}_{|S})+ O\left( \sqrt{\frac{\log \frac{1}{\delta }}{2n}}\right) , \end{aligned}$$

where \(\mathcal {L}\) is the Lipschitz constant of the loss function \(l\). A relation between the empirical Rademacher complexity and covering number can be used to state the above uniform convergence statement in terms of the covering number. The relation (also known as Dudley’s entropy integral) is (Talagrand 2005):

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})\le c\int _{0}^{\infty }\sqrt{\frac{\log N(\sqrt{n}\epsilon ,{\mathcal {F}}_{|S},\Vert \cdot \Vert _2)}{n}}d\epsilon , \end{aligned}$$

where \({\mathcal {F}}_{|S} = \{(f(x_{1}),\ldots ,f(x_{n})): f \in {\mathcal {F}}\}\) and \(c\) is a constant. Thus, we study upper bounds for covering numbers and empirical Rademacher complexities interchangeably through the rest of the paper.

3.3 Complexity results with a single linear constraint

We state two results: the first is based on volumetric arguments and bounds the covering number and the second is based on convex duality and bounds the empirical Rademacher complexity. The first is a result from Tulabandhula and Rudin (2014) while the second is new to this paper.

Volumetric upper bound on the covering number:Tulabandhula and Rudin (2014) analyzed the setting where a bounded linear function class is further constrained by a half space. The motivation there was to study a specific type of side knowledge, namely knowledge about the cost to solve a decision problem associated with the learning problem. The result there extends well beyond operational costs and is applicable to our setting where we have a \(\ell _2\) bounded linear function class with a single half space constraint.

Theorem 1

(Theorem 2 of Tulabandhula and Rudin 2014) Let \({\mathcal {X}}= \{x \in \mathbb {R}^{p}: \Vert x\Vert _{2} \le X_{b}\}\) with \(X_b > 0\), and let \(\mu _{{\mathcal {X}}}\) be the marginal probability measure on \({\mathcal {X}}\). Let

$$\begin{aligned} {\mathcal {F}}= \left\{ f | f:{\mathcal {X}}\mapsto {\mathcal {Y}}, f(x) = \beta ^{T}x, \Vert \beta \Vert _{2} \le B_{b},\; a^{T}\beta \le 1 \right\} , \end{aligned}$$

with \(B_b > 0\). Let \({\mathcal {F}}_{|S} = \{(f(x_{1}),\ldots ,f(x_{n})): f \in {\mathcal {F}}\}\). Then for all \(\epsilon > 0\), for any sample \(S\),

$$\begin{aligned} {N(\sqrt{n}\epsilon ,{\mathcal {F}}_{|S}, \Vert \cdot \Vert _{2}) \le \alpha (p,a,\epsilon )\left( \frac{2B_{b}X_b}{\epsilon } + 1\right) ^{p} .} \end{aligned}$$

Also, defining \(r = B_{b} + \frac{\epsilon }{2X_b}\) and \(V_{p}(r) = \frac{\pi ^{p/2}}{\Gamma (p/2 + 1)}r^{p}\), the function \(\alpha \) above is:

$$\begin{aligned}&\alpha (p,a,\epsilon ) \\&\quad =1 - \frac{1}{V_{p}(r)} \int _{\theta = \cos ^{-1}\left( \frac{\Vert a\Vert _{2}^{-1} + \frac{\epsilon }{2X_b}}{r}\right) }^{0}V_{p-1}(r\sin \theta )d(r\cos \theta ). \end{aligned}$$

Intuition: The function \(\alpha (p,a,\epsilon )\) can be considered to be the normalized volume of the ball (which is 1) minus the portion that is the spherical cap cut off by the linear constraint. It comes directly from formulae for the volume of spherical caps. We are integrating over the volume of a \(p-1\) dimensional sphere of radius \(r\sin \theta \) and the height term is \(d(r\cos \theta )\).

This bound shows that the covering number bound can depend on \(a\), which is a direct function of the unlabeled examples \(\{\tilde{x}_{i}\}_{i=1}^{m}\). As the norm \(\Vert a\Vert _2\) increases, \(\Vert a\Vert _2^{-1}\) decreases, thus \(\alpha (p,a,\epsilon )\) decreases, and the whole bound decreases. This is a mechanism by which side information on the labels of the unlabeled examples influences the complexity measure of the hypothesis set, potentially improving generalization.

Relation to standard results: It is known (Kolmogorov and Tikhomirov 1959) that set \(\mathcal {B} = \{\beta : \Vert \beta \Vert _{2} \le B_{b}\}\) (with \(B_b > 0\) being a fixed constant as before) has a bound on its covering number of the form \(N(\epsilon ,\mathcal {B},\Vert \cdot \Vert _{2}) \le \left( \frac{2B_{b}}{\epsilon } + 1\right) ^{p}\). Since in Theorem 1 the same term appears, multiplied by a factor that is at most one and that can be substantially less than one, the bound in Theorem 1 can be tighter.

The above result bounds the covering number complexity for the hypothesis set. Next, we will bound the empirical Rademacher complexity for the same hypothesis set as above.

3.3.1 Convex duality based upper bound on empirical Rademacher complexity

Consider the setting in Theorem 1. Let \(x_i \in {\mathcal {X}}= \{x: \Vert x\Vert _2 \le X_b\}\) for \(i=1,...,n\) as before. Our attempt to use convex duality to upper bound empirical Rademacher complexity yields the following bound.

Proposition 1

Let \({\mathcal {X}}= \{x \in \mathbb {R}^{p}: \Vert x\Vert _{2} \le X_{b}\}\) with \(X_b > 0\) and

$$\begin{aligned} {\mathcal {F}}= \left\{ f | f:{\mathcal {X}}\mapsto {\mathcal {Y}}, f(x) = \beta ^{T}x, \Vert \beta \Vert _{2} \le B_{b}, a^{T}\beta \le 1 \right\} , \end{aligned}$$

with \(B_b > 0\). Then,

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})\le \max \left( \mathbb {E}_{\sigma }\left[ \min _{\eta \ge 0} (B_b\Vert X_L\sigma -\eta a\Vert _2 + \eta ) \right] ,\mathbb {E}_{\sigma }\left[ \min _{\eta \ge 0} (B_b\Vert X_L\sigma +\eta a\Vert _2 + \eta )\right] \right) , \end{aligned}$$

where \(X_L = [x_1\; \ldots \;x_n]\) is a \(p\times n\) dimensional feature matrix and \(\sigma \) is a \(n\times 1\) dimensional vector of Bernoulli random variables taking values in \(\{-1,1\}\).

Intuition: We can understand the effect of the linear constraint on the upper bound through the magnitude of vector \(a\). Without loss of generality, let the expectation of the optimal value of the first minimization problem be higher (both minimization problems are structurally similar to each other except for a sign change within the norm term). For a fixed value of \(\sigma \), this minimization problem involves the distance of vector \({X}_{L}\sigma \) to the scaled vector \(a\) in the first term and the scaling factor \(\eta \) itself as the second term.

Thus, generally, if \(\Vert a\Vert _2\) is large, the scaling factor \(\eta \) can be small, resulting in a lower optimal value. We also know that larger \(\Vert a\Vert _2\) corresponds to a tighter half space constraint. Thus, as the linear constraint on the hypothesis space becomes tighter, it makes the optimal solution \(\eta \) and the optimal value smaller for each \(\sigma \) vector. As a result, it tightens the upper bound on the empirical Rademacher complexity.

Relation to standard results: An upper bound on each term of the \(\max \) operation above can be found by setting \(\eta = 0\) that recovers the standard upper bound of \(\frac{B_b\sqrt{\mathrm{trace }(X_L^TX_L)}}{\sqrt{n}}\) or \(\frac{B_bX_b}{\sqrt{n}}\) without capturing the effect of the linear constraint \(a^T\beta \le 1\).

3.4 Complexity results with polygonal/multiple linear constraints and general norm constraints

The following result is from Tulabandhula and Rudin (2013), where the authors analyze the effect of decision making bias on generalization of learning. Again, as in the single linear constraint case, the result extends beyond the setting considered in that paper. In particular, it covers all the motivating scenarios described in Sect. 2.1.

Let us define the matrix \([x_{1}\;\ldots \; x_{n}]\) as matrix \({X}_{L}\) where \(x_i \in {\mathcal {X}}= \{x: \Vert x\Vert _r \le X_b\}\) and \(X_b > 0\). Then, \({X}_{L}^T\) can be written as \([h_{1}{ } \cdots { }h_{p} ]\) with \(h_{j} \in \mathbb {R}^{n},j=1,...,p\). Define function class \({\mathcal {F}}\) as

$$\begin{aligned} {\mathcal {F}}&= \Big \{f | f(x) = \beta ^{T}x, \beta \in \mathbb {R}^{p}, \Vert \beta \Vert _q \le B_{b},\\&\sum _{j=1}^{p}c_{j\nu }\beta _{j} +\delta _{\nu } \le 1, \delta _{\nu } > 0, \nu =1,...,V\Big \}, \end{aligned}$$

where \(1/r + 1/q = 1\) and \(\{c_{j\nu }\}_{j,\nu }\), \(\{\delta _{\nu }\}_{\nu }\) and \(B_{b} > 0\) are known constants. In other words, we have \(V\) linear constraints in addition to a \(\ell _q\) norm constraint. As before, let \({\mathcal {F}}_{|S}\) be the restriction of \({\mathcal {F}}\) with respect to \(S\).

Let \(\{\tilde{c}_{j\nu }\}_{j,\nu }\) be proportional to \(\{c_{j\nu }\}_{j,\nu }\) in the following manner:

$$\begin{aligned} \tilde{c}_{j\nu }&:= \frac{c_{j\nu }n^{1/r}X_{b}B_{b}}{\Vert h_{j}\Vert _{r}} \;\;\,\, \forall j=1,...,p \, \mathrm{ and }\, nu=1,...,V.\\ \end{aligned}$$

Let \(K\) be a positive number. Further, let the sets \(P^{K}\) parameterized by \(K\) and \(P_{c}^{K}\) parameterized by \(K\) and \(\{\tilde{c}_{j\nu }\}_{j,\nu }\) be: \(P^{K} := \left\{ (k_{1},...,k_{p}) \in \mathbb {Z}^{p}: \sum _{j=1}^{p}|k_{j}| \le K\right\} ,\nonumber \) and \(P_{c}^{K} := \left\{ (k_{1},...,k_{p}) \in P^{K}: \sum _{j=1}^{p}\tilde{c}_{j\nu }k_{j} \le K \; \forall \nu = 1,...,V\right\} .\) Let \(|P^{K}|\) and \(|P_{c}^{K}|\) be the sizes of the sets \(P^{K}\) and \(P_{c}^{K}\) respectively. The subscript \(c\) in \(P_{c}^{K}\) denotes that this polyhedron is a constrained version of \(P^{K}\). Define \({X_{sL}}\) to be equal to the product of a diagonal matrix (whose \(j^{th}\) diagonal element is \(\frac{n^{1/r}X_{b}B_{b}}{\Vert h_{j}\Vert _{r}}\)) and \({X}_{L}\). Define \(\lambda _{\min }({X_{sL}}{X_{sL}}^{T})\) to be the smallest eigenvalue of the matrix \({X_{sL}}{X_{sL}}^{T}\).

Theorem 2

(Theorem 6 of Tulabandhula and Rudin 2013)

$$\begin{aligned} N(\sqrt{n}\epsilon ,{\mathcal {F}}_{|S},\Vert \cdot \Vert _{2}) \le {\left\{ \begin{array}{ll} \min \{|P^{K_{0}}|,|P_{c}^{K}|\} &{} \mathrm{if }\,\, \epsilon < X_{b}B_{b} \\ 1 &{} \,\,\mathrm{ otherwise } \end{array}\right. }, \end{aligned}$$

where \( K_{0} = \left\lceil \frac{X_{b}^{2}B_{b}^{2}}{\epsilon ^{2}}\right\rceil \) and \(K\) is the maximum of \(K_{0}\) and

$$\begin{aligned} \left\lceil \frac{nX_{b}^{2}B_{b}^{2}}{\lambda _{\min }({X_{sL}}{X_{sL}}^{T})\Big [\min _{\nu =1,...,V} \frac{\delta _{\nu }}{\sum _{j=1}^{p}|\tilde{c}_{j\nu }|}\Big ]^{2}}\right\rceil . \end{aligned}$$

Intuition: The linear assumptions on the labels of the unlabeled examples \(\{\tilde{x}_{i}\}_{i=1}^{m}\) determine the parameters \(\{\tilde{c}_{j\nu }\}_{j,\nu }\) that in turn influence the complexity measure bound. In particular, as the linear constraints given by the \(c_{j\nu }\)’s force the hypothesis space to be smaller, they force \(|P_{c}^{K}|\) to be smaller. This leads to a tighter upper bound on the covering number.

Relation to standard results: We recover the covering number bound for linear function classes given in (Zhang 2002) when there are no linear constraints. In this case, the polytope \(P^{K}\) is well structured and the number of integer points in it can be upper bounded in an explicit way combinatorially.

It is possible to convex duality to upper bound the empirical Rademacher complexity as we did in Proposition 1. However, the intuition is less clear, and thus, we omit the bound here.

3.5 Complexity results with quadratic constraints

Consider the set \({\mathcal {F}}= \{f: f=\beta ^{T} x, \beta ^{T}A_{1}\beta \le 1, \beta ^{T}A_{2} \beta \le 1 \}.\) Assume that at least one of the matrices is positive definite and both are positive-semidefinite, symmetric. Let \(\varXi _{1} = \{\beta : \beta ^{T}A_{1}\beta \le 1\}\) and \(\varXi _{2} = \{\beta : \beta ^{T}A_{2}\beta \le 1\}\) be the corresponding ellipsoid sets.

3.5.1 Upper bound on empirical Rademacher complexity

We first find an ellipsoid \(\varXi _{\mathrm{int }\gamma }\) (with matrix \(A_{\mathrm{int }\gamma }\)) circumscribing the intersection of the two ellipsoids \(\varXi _{1}\) and \(\varXi _{2}\) and then find a bound on the Rademacher complexity of a corresponding function class leading to our result for the quadratic constraint case. We will pick matrix \(A_{\mathrm{int }\gamma }\) to have a particularly desirable property, namely that it is tight. We will call a circumscribing ellipsoid tight when no other ellipsoidal boundary comes between its boundary and the intersection (\(\varXi _{1}\cap \varXi _{2}\)). If we thus choose this property as our criterion for picking the ellipsoid, then according to the following result, we can do so by a convex combination of the original ellipsoids:

Theorem 3

(Circumscribing ellipsoids, Kahan 1968) There is a family of circumscribing ellipsoids that contains every tight ellipsoid. Every ellipsoid \(\varXi _{\mathrm{int }\gamma }\) in this family has \(\varXi _{\mathrm{int }\gamma } \supseteq (\varXi _{1}\cap \varXi _{2})\) and is generated by matrix \(A_{\mathrm{int }\gamma } = \gamma A_{1} + (1-\gamma ) A_{2}\), \(\gamma \in [0,1]\).

Using the above theorem, we can find a tight ellipsoid \(\{\beta : \beta ^{T}A_{\mathrm{int }\gamma }\beta \le 1\}\) that contains the set \(\{ \beta : \beta ^{T}A_{1}\beta \le 1, \beta ^{T}A_{2} \beta \le 1\}\) easily. Note that the right hand sides of the quadratic constraints defining these ellipsoids can be equal to one without loss of generality.

Theorem 4

(Rademacher complexity of linear function class with two quadratic constraints) Let

$$\begin{aligned} {\mathcal {F}}= \{f: f(x)=\beta ^{T} x: \beta ^{T}\mathbb {I}\beta \le B_{b}^{2}, \beta ^{T}A_{2} \beta \le 1\} \end{aligned}$$

with \(A_{2}\) symmetric positive-semidefinite and \(B_b > 0\). Then,

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})\le \frac{1}{{n}}\sqrt{\mathrm{trace }({X}_{L}^{T}A_{\mathrm{int }\gamma }^{-1}{X}_{L})}, \end{aligned}$$
(1)

where \(A_{\mathrm{int }\gamma }\) is the matrix of a circumscribing ellipsoid \(\{\beta : \beta ^{T}A_{\mathrm{int }\gamma }\beta \le 1\}\) of the set \(\{ \beta : \beta ^{T}\mathbb {I}\beta \le B_{b}^{2}, \beta ^{T}A_{2} \beta \le 1\}\) and \({X}_{L}\) is the matrix \([x_1\; \ldots \; x_n]\) with examples \(x_i\)’s as its columns.

Intuition: If the quadratic constraints are such they correspond to small ellipsoids, then the circumscribing ellipsoid will also be small. Correspondingly, the eigenvalues of \(A_{\mathrm{int}\gamma }\) will be large. Since, the upper bound depends inversely on the magnitude of these eigenvalues (since it depends on \(A_{\mathrm{int}\gamma }^{-1}\)), it becomes tighter. Also, in the setting where the original ellipsoids are large and elongated but their intersection region is small and can be bounded by a small circumscribing ellipsoid, the upper bound is again tighter.

Relation to standard results: If \(A_{\mathrm{int }\gamma }\) is diagonal (or axis-aligned), then we can write the empirical complexity \(\mathcal {\bar{R}}({\mathcal {F}}_{|S})\) in terms of the eigenvalues \(\{\lambda _{i}\}_{i=1}^{p}\) as \(\mathcal {\bar{R}}({\mathcal {F}}_{|S})\le \frac{1}{n}\sqrt{\sum _{j=1}^{n}\sum _{i=1}^{p}\frac{x_{ji}^{2}}{\lambda _{i}}}\) and this can be bounded by \(\frac{X_{b}B_{b}}{\sqrt{n}}\) (Kakade et al. 2008) when \(A_{2} = \mathbf {0}\). In that case, all of the \(\lambda _i\) are \(\frac{1}{B_{b}^{2}}\).

Remark 2

Since we can choose any circumscribing matrix \(A_{\mathrm{int }\gamma }\) in this theorem, we can perform the following optimization to get a circumscribing ellipsoid that minimizes the bound:

$$\begin{aligned} \min _{\gamma \in [0,1]} \mathrm{trace } ({X}_{L}^{T}(\gamma A_{1} + (1-\gamma )A_{2})^{-1}{X}_{L}). \end{aligned}$$
(2)

This optimization problem is a univariate non-linear program.

3.5.2 Lower bound on empirical Rademacher complexity

We will now show that the dependence of the complexity on \(A_{\mathrm{int }\gamma }^{-1}\) is near optimal.

Since \(A_{\mathrm{int}\gamma }\) is a real symmetric matrix, let us decompose \(A_{\mathrm{int}\gamma }\) into a product \(P^T DP\) where \(D\) is a diagonal matrix with the eigenvalues of \(A_{\mathrm{int}\gamma }\) as its entries and \(P\) is an orthogonal matrix (i.e., \(P^T P=I\)). Our result, which is similar in form to the upper bound of Theorem 4, is as follows.

Theorem 5

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})\ge \frac{\kappa }{n\log n}\sqrt{\mathrm{trace }({X}_{L}^{T}A_{\mathrm{int }\gamma }^{-1} {X}_{L})} \end{aligned}$$

where

$$\begin{aligned} \kappa = \frac{{1}}{C\sqrt{1 + \frac{2\pi pnX_b^2}{(\min _{j=1,...,p}\Vert (P{X}_{L})_{j}\Vert _{2})^{2}}}}, \end{aligned}$$

\(C\) is the constant in Lemma 5, \(P\) is the orthogonal matrix from the decomposition of matrix \(A_{\mathrm{int}\gamma }\) defined in Theorem 4, \(p\) and \(X_b > 0\) are problem constants, \({X}_{L}\) is the matrix \([x_1\; \ldots \; x_n]\) with examples \(x_i\)’s as its columns, and \(n\) is the number of training examples.

Intuition: The lower bound is showing that the dependence on \(\sqrt{\mathrm{trace }({X}_{L}^TA_{\mathrm{int}\gamma }^{-1}{X}_{L})}\) is tight modulo a \(\log n\) factor and a factor (\(\kappa \)). The \(\log n\) factor is essentially due to the use of the relation between Gaussian and Rademacher complexities in our proof technique. On the other hand, \(\kappa \) depends on the interaction between the side knowledge about the unlabeled examples (captured through matrix \(P\)) and the feature matrix \({X}_{L}\). If there is no interaction, that is, \(P{X}_{L}\) has zero valued rows for all \(j=1,...,p\), then the lower bound on empirical Rademacher complexity becomes equal to 0. On the other hand, when there is higher interaction between \(A_{\mathrm{int}\gamma }\) (or equivalently, \(P\)) and \({X}_{L}\), then the factor \(\kappa \) grows larger, tightening the lower bound on the empirical Rademacher complexity.

The dependence of the lower bound on the strength of the additional convex quadratic constraint is captured via \(A_{\mathrm{int}\gamma }\) and behaves in a similar way to the upper bound. That is, when the constraint leads to a small circumscribing ellipsoid, the eigenvalues of \(A_{\mathrm{int}\gamma }^{-1}\) are small and the lower bound is small (just like the upper bound). On the other hand, if the constraint leads to a larger circumscribing ellipsoid, the eigenvalues of \(A_{\mathrm{int}\gamma }^{-1}\) are large, leading to a higher values of the lower bound (the upper bound also increases similarly).

Relation to standard results: As with the upper bound, when there is no second quadratic constraint, \(A_{\mathrm{int}\gamma }= \frac{1}{B_b^2}\mathbb {I}\). The lower bound depends on the training data through the term \(\sqrt{\mathrm{trace }({X}_{L}^T{X}_{L})}\) in this case.

Comparison to the upper bound: For comparison, we see that the upper bound in Theorem 4 is of the form \(\frac{1}{n}\sqrt{\mathrm{trace }({X}_{L}^{T}A_{\mathrm{int }\gamma }^{-1} {X}_{L})}\) while the lower bound of Theorem 5 is of the form

$$\begin{aligned} \frac{\kappa }{n\log n} \sqrt{\mathrm{trace }({X}_{L}^{T}A_{\mathrm{int }\gamma }^{-1} {X}_{L})}, \end{aligned}$$

where \(\kappa \) depends on \(A_{\mathrm{int }\gamma }\) and \({X}_{L}\).

The proof for the lower bound is similar to what one would do for estimating the complexity of a ellipsoid itself (without regard to a corresponding linear function class). See also the work of  Wainwright (2011) for handling single ellipsoids.

3.5.3 Comparison of empirical Rademacher complexity upper bound with a covering number based bound

When matrix \(A_{\mathrm{int }\gamma }\) describing a circumscribing ellipsoid has eigenvalues \(\{\lambda _{i}\}_{i=1}^{p}\), then the covering number can be bounded as:

$$\begin{aligned} {N(\sqrt{n}\epsilon ,{\mathcal {F}}_{|S}, \Vert \cdot \Vert _{2}) \le \Pi _{i=1}^{p}\left( \frac{2X_b}{\epsilon \sqrt{\lambda _{i}}} + 1\right) .} \end{aligned}$$

To get a tight bound, among all circumscribing ellipsoids, we should pick one that minimizes the right hand side of the bound. To do this, we solve an optimization problem involving volume minimization that is different than in (2). For instance, this volume minimization can be done using the following steps if at least one of the matrices among \(A_1\) and \(A_2\) is positive-definite:

  • First, \(A_{1}\) and \(A_{2}\) are simultaneously diagonalized by congruence (say with a non-singular matrix called \(C\)) to obtain diagonal matrices \(\mathrm{Diag }(a_{1i})\) and \(\mathrm{Diag }(a_{2i})\). We can guarantee that the set of ratios \(\{\frac{a_{1i}}{a_{2i}}\}\) obtained will be unique.

  • The desired ellipsoid \(A_{\mathrm{int }\gamma ^*}\) can then be obtained by computing

    $$\begin{aligned} \gamma ^* \in \arg \max _{\gamma \in [0,1]} \Pi _{i=1}^{p}(\gamma a_{1i} + (1-\gamma )a_{2i}) \end{aligned}$$

    and then multiplying the optimal diagonal matrix \(\mathrm{Diag }(\gamma ^* a_{1i} + (1-\gamma ^*)a_{2i})\) with the congruence matrix \(C\) appropriately. Optimal \(\gamma ^*\) can be found in polynomial time (for example, using Newton-Raphson).

Comparison with the duality approach to upper bounding empirical Rademacher complexity: A convex duality based upper bound can be derived as shown below.

Theorem 6

Consider the setting of Theorem 4. Then,

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})\le \inf _{\eta \in [0,1]}\left\{ \frac{1}{4n} \mathrm{trace }(X_L^TA_{\mathrm{int }\eta }^{-1}X_L) + \frac{1}{n}(B_b^2 + \eta (1-B_b^2))\right\} , \end{aligned}$$
(3)

where \(A_{\mathrm{int }\eta } = \mathbb {I} + \eta (A_2 - \mathbb {I})\).

This upper bound looks similar to the result in Eq. (1). Note that \(A_{\mathrm{int }\eta }\) is different from \(A_{\mathrm{int}\gamma }\) in Theorem 4. \(A_{\mathrm{int}\gamma }\) comes from a circumscribing ellipsoid, whereas \(A_{\mathrm{int }\eta }\) does not.

Instead, the matrix \(A_{\mathrm{int }\eta }\) is picked such that \(\eta \) minimizes the right hand side of the bound in Eq. 3. Qualitatively, we can see that if the matrix \(A_2\) corresponding to the second ellipsoid constraint has large eigenvalues (for instance, when the second ellipsoid is a smaller sphere, or is an elongated thin ellipsoid), then \(A_{\mathrm{int }\eta }^{-1}\) is ‘small’ (the eigenvalues are small) leading to a tighter upper bound on the empirical Rademacher complexity.

3.5.4 Extension to multiple convex quadratic constraints

Although Sect. 3.5 deals with only two convex quadratic constraints, the same strategy can be used to upper bound the complexity of hypothesis class constrained by multiple convex quadratic constraints. In particular, let \({\mathcal {F}}= \{f: f=\beta ^{T} x, \beta ^{T}A_{k}\beta \le 1 \;\;\forall k=1,...,K \}\). Again, assume one of the matrices \(A_k\) is positive definite. We can approach this problem in two stages. In the first step, we find an ellipsoid \(\varXi _{\mathrm{int }\gamma }\) (with matrix \(A_{\mathrm{int}\gamma }\)) circumscribing the intersections of the \(K\) original ellipsoids and in the second step, we reuse Theorem 4 to obtain an upper bound in \(\mathcal {\bar{R}}({\mathcal {F}}_{|S})\).

We will generalize Eq. (2) to look for a circumscribing ellipsoid from the family of ellipsoids parameterized by a \(K\) dimensional vector \(\gamma \) constrained to the \(K-1\) simplex. In other words, the family of circumscribing ellipsoids is given by \(\{\beta ^TA_{\mathrm{int}\gamma }\beta \le 1: A_{\mathrm{int}\gamma }= \sum _{k=1}^{K}\gamma _kA_k, \sum _{k=1}^{K}\gamma _k = 1, \gamma _k \ge 0 \;\;\forall k=1,...,K\}\). We can pick one circumscribing ellipsoid from this family by minimizing the right hand side of Eq. 1 over the \(K-1\) simplex similar to Eq. (2):

$$\begin{aligned} \min _{\gamma \in \left\{ \gamma : \sum _{k=1}^{K}\gamma _k = 1, \gamma _k \ge 0 \;\;\forall k=1,...,K\right\} } \mathrm{trace } \left( {X}_{L}^{T}\left( \sum _{k=1}^{K}\gamma _kA_k\right) ^{-1}{X}_{L}\right) . \end{aligned}$$

The above optimization problem is a \(K-1\) dimensional polynomial optimization problem.

3.6 Complexity results with linear and quadratic constraints

Consider now the setting where we have both linear and quadratic constraints. In particular, we can have the assumptions leading to linear constraints and those leading to quadratic constraints hold simultaneously. In such a setting, based on Theorems 2 and 3, we can get a potentially tighter covering number result as follows. Let \(x_i \in {\mathcal {X}}= \{x: \Vert x\Vert _2 \le X_b\}\). Let the function class \({\mathcal {F}}\) be

$$\begin{aligned} {\mathcal {F}}=\Big \{f | f(x)&= \beta ^{T}x, \beta \in \mathbb {R}^{p}, \beta ^{T}A_{1}\beta \le 1, \beta ^{T}A_{2} \beta \le 1,\\&\sum _{j=1}^{p}c_{j\nu }\beta _{j} +\delta _{\nu } \le 1, \delta _{\nu } > 0, \nu =1,...,V\Big \}, \end{aligned}$$

where \(\{c_{j\nu }\}_{j,\nu }\), \(\{\delta _{\nu }\}_{\nu }\), \(A_1\) and \(A_2\) are known beforehand.

Let matrix \(A_{\mathrm{int }\gamma }\) be such that \(\{ \beta : \beta ^{T}A_{1}\beta \le 1, \beta ^{T}A_{2} \beta \le 1\}\) is circumscribed by \(\{\beta : \beta ^{T}A_{\mathrm{int }\gamma }\beta \le 1\}\). Defining \(\{\tilde{c}_{j\nu }\}\) and \({X_{sL}}\) in the same way as in Sect. 3.3, we get the following corollary.

Corollary 1

(of Theorem 2)

$$\begin{aligned} N(\sqrt{n}\epsilon ,{\mathcal {F}}_{|S},\Vert \cdot \Vert _{2}) \le {\left\{ \begin{array}{ll} \min \{|P^{K_{0}}|,|P_{c}^{K}|\} &{} \mathrm{if }\,\, \epsilon < X_{b}\sqrt{\lambda _{\max }(A_{\mathrm{int }\gamma }^{-1})}\\ 1 &{} \mathrm{ otherwise } \end{array}\right. }. \end{aligned}$$

Here, \( K_{0} = \left\lceil \frac{X_{b}^{2}\lambda _{\max }(A_{\mathrm{int }\gamma }^{-1})}{\epsilon ^{2}}\right\rceil \) and \(K\) is the maximum of \(K_{0}\) and

$$\begin{aligned} \left\lceil \frac{nX_{b}^{2}\lambda _{\max }(A_{\mathrm{int }\gamma }^{-1})}{\lambda _{\min }({X_{sL}}{X_{sL}}^{T})\Big [\min _{\nu =1,...,V} \frac{\delta _{\nu }}{\sum _{j=1}^{p}|\tilde{c}_{j\nu }|}\Big ]^{2}}\right\rceil . \end{aligned}$$

The corollary holds for any \(A_{\mathrm{int }\gamma }\) that satisfies the circumscribing requirement. In particular, we can construct the ellipsoid \(\{\beta : \beta ^{T}A_{\mathrm{int }\gamma }\beta \le 1\}\) such that it ‘tightly’ circumscribes the set \(\{ \beta : \beta ^{T}A_{1}\beta \le 1, \beta ^{T}A_{2} \beta \le 1\}\) using Theorem 3 in the same way as we did in Sect. 3.5. The intuition for how the parameters of our side knowledge, namely, the linear inequality coefficients and the matrices corresponding to the ellipsoids, is the same as in Sects. 3.4 and 3.5. Relation to standard results have also been discussed in these sections.

3.6.1 Extension to arbitrary convex constraints

There are at least three ways to reuse the results we have with linear, polygonal, quadratic and conic constraints to give upper bounds on covering number or empirical Rademacher complexity of function classes with arbitrary convex constraints. Such arbitrary convex constraints can arise in many settings. For instance, when the convex quadratic constraints in Sect. 2.2 are not symmetric around the origin, we cannot use the results of Sect. 3.5 directly, but the following techniques apply. Other typical convex constraints include those arising from likelihood models, entropy biases and so on.

The first approach involves constructing an outer polyhedral approximation of the convex constraint set. For instance, if we are given a separation oracle for the convex constraint, constructing an outer polyhedral approximation is relatively straightforward. We can also optimize for properties like the number of facets or vertices of the polyhedron during such a construction. Given such an outer approximation, we can apply Theorem 2 to get an upper bound on the covering number of the hypothesis space with the given convex constraint.

The second approach involves constructing a circumscribing ellipsoid for the constraint set. This is possible for any convex set in general (John 1948). In addition if the convex set is symmetric around the origin, the ‘tightness’ of the circumscribing ellipsoid improves by a factor \(\sqrt{p}\), where \(p\) is the dimension of the linear coefficient vector \(\beta \). Given such a circumscribing ellipsoid, we can apply Theorem 4 to get an upper bound on the empirical Rademacher complexity of the original function class with the convex constraint. The quality of both of these outer relaxation approaches depends on the structure and form of the convex constraint we are given.

The third approach is to analyze the empirical Rademacher complexity directly using convex duality as we have done for the linear and quadratic cases, and as we will do for the conic case next.

3.7 Complexity results with multiple conic constraints

Consider the function class

$$\begin{aligned} {\mathcal {F}}= \{f: f = \beta ^Tx, \beta ^T\beta \le B_b^2, \Vert A_k\beta \Vert _2 \le a_k^T\beta + d_k\;\; \forall k=1,...,K \}, \end{aligned}$$

where we have one convex quadratic constraint and \(K\) conic constraints. We can find an upper bound on the Rademacher complexity as shown below.

Theorem 7

(Rademacher complexity of bounded linear function class with conic constraints) Let \({\mathcal {X}}= \{x:\Vert x\Vert _2 \le X_b\}\) with \(X_b >0\) and let

$$\begin{aligned} {\mathcal {F}}= \{f: f = \beta ^Tx, \beta ^T\beta \le B_b^2, \Vert A_k\beta \Vert _2 \le a_k^T\beta + d_k\;\; \forall k=1,...,K \}, \end{aligned}$$

where \(B_b >0 \),\(\{A_k,a_k,d_k\}_{k=1}^{K}\) are the parameters. Assume \(A_k \succ 0\) and let \(\lambda _{\min }(A_k)\) denote its minimum eigenvalue for \(k=1,...,K\). Also let \(\sup _{x \in {\mathcal {X}}}\Vert x\Vert _2 \le X_b\). Then,

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})\le \frac{X_b}{\sqrt{n}}\cdot \min \left\{ B_b,\sum _{k=1}^{K}\frac{B_b\Vert a_k\Vert _2 + d_k}{K\cdot \lambda _{\min }(A_k)}\right\} . \end{aligned}$$

Intuition: When \(\Vert a_k\Vert _2\) and \(d_k\) are \(o(\lambda _{\min }(A_k))\), the effect of conic constraints can influence the upper bound on the empirical Rademacher complexity and make the corresponding generalization bounds tighter. From a geometric point of view, we can infer the following: if the cones are sharp, then \(\lambda _{\min }(A_k)\) are high, implying a smaller empirical Rademacher complexity. Figure 2 illustrates this in two dimensions.

Fig. 2
figure 2

Here we illustrate the effect of a single conic constraint \(\{\beta : \sqrt{4\mu \beta _1^2 + \mu \beta _2^2} \le \delta (2\beta _1 + 3\beta _2 + 4)\}\) on our hypothesis space \(\{\beta \in \mathbb {R}^2: \beta ^T\beta \le 9\}\) for different scaling values of parameters \(\mu \) and \(\delta \). In our notation, matrix \(A = [2\sqrt{\mu } \;\;0; 0 \;\;\sqrt{\mu }]\), vector \(a = \delta [2\;\;3]^T\) and scalar \(d = 4\delta \). Left: Parameter set \((\mu ,\delta )\) is equal to \((1,1)\). The region covered by the conic constraint is the convex set in the upper part of the circle. Center: Changing the parameters \((\mu ,\delta )\) to \((10,1)\) makes the eigenvalue \(\lambda _{\min }(A)\) larger thus reducing the intersection region further. Right: Changing the parameters \((\mu ,\delta )\) to \((1,10)\) increases the magnitude of \(\Vert a\Vert _2\) and \(d\) relative to the value of \(\lambda _{\min }(A)\) increases the intersection region between the conic constraint and the ball. This leads to a larger empirical Rademacher complexity bound value

Relation to standard results: The looser unconstrained version of the upper bound \(\frac{X_bB_b}{\sqrt{n}}\) is recovered when there are no conic constraints or when the conic constraints are ineffective (for instance, when \(\Vert a_k\Vert _2\) is high, \(d_k\) is a large offset or \(\lambda _{\min }(A_k)\) is small).

Remark 3

There have been some recent attempts to obtain bounds on a related measure, similar to the empirical Gaussian complexity defined here, in the compressed sensing literature that also involves conic constraints (Stojnic 2009). Their objective (minimum number of measurements for signal recovery assuming sparsity) is very different from our objective (function class complexity and generalization). In the former context, there are a few results (Chandrasekaran et al. 2012) dealing with the intersection of a single generic cone with a sphere (\(\mathbb {S}^{p-1}\)) whereas in this context, we look at the intersection of multiple second order cones (explicitly parameterized by \(\{A_k,a_k,d_k\}_{k=1}^{K}\)) with balls (\(\{\beta ^T\beta \le B_b^2\}\)).

4 Related work

It is well-known that having additional unlabeled examples can aid in learning (Shental et al. 2004; Nguyen and Caruana 2008b; Gómez-Chova et al. 2008), and this has been the subject of research in semi-supervised learning (Zhu 2005). The present work is fundamentally different than semi-supervised learning, because semi-supervised learning exploits the distributional properties of the set of unlabeled examples. In this work, we do not necessarily have enough unlabeled examples to study these distributional properties, but these unlabeled examples do provide us information about the hypothesis space. Distributional properties used in semi-supervised learning include cluster assumptions (Singh et al. 2008; Rigollet 2007) and manifold assumptions (Belkin and Niyogi 2004; Belkin et al. 2004). In our work, the information we get from the unlabeled examples allows us to restrict the hypothesis space, which lets us be in the framework of empirical risk minimization and give theoretical generalization bounds via complexity measures of the restricted hypothesis spaces (Bartlett and Mendelson 2002; Vapnik 1998). While the focus of many works [e.g., Zhang 2002; Maurer 2006] is on complexity measures for ball-like function classes, our hypothesis spaces are more complicated, and arise here from constraints on the data.

Researchers have also attempted to incorporate domain knowledge directly into learning algorithms, where this domain knowledge does not necessarily arise from unlabeled examples. For instance, the framework of knowledge based SVMs (Fung et al. 2002; Le et al. 2006) motivates the use of various constraints or modifications in the learning procedure to incorporate specific kinds of knowledge (without using unlabeled examples). The focus of Fung et al. (2002) is algorithmic and they consider linear constraints. Le et al. (2006) incorporate knowledge by modifying the function class itself, for instance, from linear function to non-linear functions.

In a different framework, that of Valiant’s PAC learning, there are concentration statements about the risks in the presence of unlabeled examples (Balcan and Blum 2005; Kääriäinen 2005), though in these results, the unlabeled points are used in a very different way than in our work. Specifically, in the work of Balcan and Blum (2005), the authors introduce the notion of incompatibility \(\mathbb {E}_{x\sim D}[1 - \chi (h,x)]\) between a function \(h\) and the input distribution \(D\). The unlabeled examples are used to estimate the distribution dependent quantity \(\mathbb {E}_{x\sim D}[1 - \chi (h,x)]\). By imposing the constraint that models have their incompatibility with the distribution of the data source \(D\) below a desired level, we restrict the hypothesis space. Their result for a finite hypothesis space is as follows:

Theorem 8

(Theorem 1 of Balcan and Blum 2005) If we see \(m\) unlabeled examples and \(n\) labeled examples, where

$$\begin{aligned} m \ge \frac{1}{\epsilon }\left[ \ln |C| + \ln \frac{2}{\delta }\right] \,\,\mathrm{ and } \,\, n \ge \frac{1}{\epsilon }\left[ \ln |C_{D,\chi }(\epsilon )| + \ln \frac{2}{\delta }\right] , \end{aligned}$$

then with probability \(1-\delta \), all \(h \in C\) with zero training error and zero incompatibility \(\frac{1}{m}\sum _{i=1}^{m}(1-\chi (h,\tilde{x}_i)) = 0\), we have \(\mathbb {E}[l(h(x),y)] \le \epsilon \).

Here \(C\) is the finite hypothesis space of which \(h\) is an element and \(C_{D,\chi }(\epsilon ) = \{h \in C: \mathbb {E}_{x\sim D}[1-\chi (h,x)] \le \epsilon \}\). In the work of  Kääriäinen (2005), the author obtains a generalization bound by approximating the disagreement probability of pairs of classifiers using unlabeled data. Again, here the unlabeled data is used to estimate a distribution dependent quantity, namely, the true disagreement probability between consistent models. In particular, the disagreement between two models \(h\) and \(g\) is defined to be \(d(h,g) = \frac{1}{m}\sum _{i=1}^{m}1_{[h(\tilde{x}_i) \ne g(\tilde{x}_i)]}\). The following theorem about generalization is proposed.

Theorem 9

Let \({\mathcal {F}}\) be the class of consistent models, that is, the set of models with zero training error. Assume the true model belongs to this class. Let \(\hat{f} \in {\mathcal {F}}\) be the function whose distance to the farthest function in \({\mathcal {F}}\) is minimal (via metric \(d\)). Then, for all \(S\), with probability \(1-\delta \) over the choice of unlabeled sample \(S^\mathrm{unlabeled }\),

$$\begin{aligned}&\mathbb {E}_{S^\mathrm{unlabeled }}[l(\hat{f}(x),y)] \le \inf _{f \in {\mathcal {F}}}\sup _{g \in {\mathcal {F}}}d(f,g) \\&\qquad + \mathcal {\bar{R}}(\{1_{[g\ne g']} | g,g' \in F\}_{|S^\mathrm{unlabeled }}) + O\left( \sqrt{\frac{\ln (2/\delta )}{m}}\right) . \end{aligned}$$

Note that the randomization in both Theorems 8 and 9 is also over unlabeled data. In our theorems, we do not randomize with respect to the unlabeled data. For us, they serve a different purpose and do not need to be chosen randomly. While their results focus on exploiting unlabeled data to estimate distribution dependent quantities, our technology focuses on exploiting unlabeled data to restrict the hypothesis space directly.

5 Proofs

5.1 Proof of Proposition 1

Proof

Instead of working with the maximization problem in the definition of empirical Rademacher complexity, we will work with a couple of related maximization problems, due to the following lemma.

Lemma 1

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})\le \mathbb {E}\left[ \max \left( \sup _{f \in {\mathcal {F}}} \frac{1}{n}\sum _{i=1}^{n}\sigma _{i}f(x_{i}),\sup _{f \in {\mathcal {F}}} -\frac{1}{n}\sum _{i=1}^{n}\sigma _{i}f(x_{i})\right) \right] . \end{aligned}$$
(4)

Proof

Since the empirical Rademacher complexity is defined as \(\mathbb {E}_{\sigma }[\sup _{f \in {\mathcal {F}}} \frac{1}{n}| \sum _{i=1}^{n}\sigma _{i}f(x_{i})| ]\), we will show that for any fixed \(\sigma \) vector,

$$\begin{aligned} \sup _{f \in {\mathcal {F}}} \frac{1}{n}\left| \sum _{i=1}^{n}\sigma _{i}f(x_{i}) \right| \le \max \left( \sup _{f \in {\mathcal {F}}} \frac{1}{n}\sum _{i=1}^{n}\sigma _{i}f(x_{i}),\sup _{f \in {\mathcal {F}}} -\frac{1}{n}\sum _{i=1}^{n}\sigma _{i}f(x_{i})\right) . \end{aligned}$$
(5)

The inequality above is straightforward to prove. Let \(f^*\) be the optimal solution to the maximization problem on the left. Then, \(f^*\) is a feasible point for each of the maximization problems on the right. We will look at two cases: In the first case, let \(\frac{1}{n}\sum _{i=1}^{n}\sigma _{i}f^*(x_{i}) \ge 0\). Then, clearly the first maximization problem on the right, namely, \(\sup _{f \in {\mathcal {F}}} \frac{1}{n}\sum _{i=1}^{n}\sigma _{i}f(x_{i})\) will have an optimal value greater than or equal to the left side of Eq. (5). In the second case, let \(\frac{1}{n}\sum _{i=1}^{n}\sigma _{i}f^*(x_{i}) < 0\). Then, the second maximization problem on the right, namely, \(\sup _{f\in {\mathcal {F}}} -\frac{1}{n}\sum _{i=1}^{n}\sigma _{i}f(x_{i})\) will have an optimal value greater than or equal to the left side of Eq. (5). That is, in this case:

$$\begin{aligned} 0 \le \left| \frac{1}{n}\sum _{i=1}^{n}\sigma _{i}f^*(x_{i}) \right| = - \frac{1}{n}\sum _{i=1}^{n}\sigma _{i}f^*(x_{i}) \le \sup _{f\in {\mathcal {F}}} -\frac{1}{n}\sum _{i=1}^{n}\sigma _{i}f(x_{i}). \end{aligned}$$

Combining the two cases, we get the Eq. (5). Taking expectations over \(\sigma \) gives us the desired inequality.

Continuing with the proof of Proposition 1: Let \(g = \sum _{i=1}^{n}\sigma _ix_i = X_L\sigma \) so that \(\mathcal {\bar{R}}({\mathcal {F}}_{|S})= \frac{1}{n}\mathbb {E}[\sup _{\beta \in {\mathcal {F}}} |g^T\beta |]\). We will attempt to dualize the two maximization problems in the upper bound provided by Lemma 1 to get a bound on the empirical Rademacher complexity. Both maximization problems are very similar except for the objective. Let \(\omega (g,{\mathcal {F}})\) be the optimal value of the following optimization problem:

$$\begin{aligned} \max _{\beta } g^T\beta \;\;\; \mathrm{s.t. }\\ \beta ^T\beta \le B_b^2\\ a^T\beta \le 1. \end{aligned}$$

Thus \(\omega (g,{\mathcal {F}})\) represents the optimal value of the maximization problem inside the expectation operation in the first term of Eq. (4). We will now write a dual program to the above and use weak duality to upper bound \(\omega (g,{\mathcal {F}})\). The Lagrangian is:

$$\begin{aligned} \mathcal {L}(\beta ,\gamma ,\eta ) = g^T\beta + \gamma (B_b^2 - \beta ^T\beta ) + \eta (1 - a^T\beta ), \end{aligned}$$

where \(\beta \in \mathbb {R}^p, \gamma \in \mathbb {R}_{+}, \eta \in \mathbb {R}_{+}\). Maximizing the Lagrangian with respect to \(\beta \) gives us:

$$\begin{aligned}&\max _{\beta }\;\mathcal {L}(\beta ,\gamma ,\eta ) \\&\quad = \max _{\beta }\left[ (g - \eta a)^T\beta -\gamma \beta ^T\beta + \gamma B_b^2 + \eta \right] \\&\quad = \max _{\beta }\left[ -\gamma \left[ \beta ^T\beta - \frac{2(g - \eta a)^T\beta }{2\gamma } + \frac{\Vert g - \eta a\Vert _2^2}{4\gamma ^2}\right] + \frac{\Vert g - \eta a\Vert _2^2}{4\gamma } + \gamma B_b^2 + \eta \right] \\&\quad = \max _{\beta }\left[ -\gamma \left\| \beta - \frac{g - \eta a}{2\gamma }\right\| _2^2 + \frac{\Vert g - \eta a\Vert _2^2}{4\gamma } + \gamma B_b^2 + \eta \right] \\&\quad = \frac{\Vert g - \eta a\Vert _2^2}{4\gamma } + \gamma B_b^2 + \eta . \end{aligned}$$

The dual problem is thus

$$\begin{aligned} \min _{\gamma \ge 0, \eta \ge 0} \frac{\Vert g - \eta a\Vert _2^2}{4\gamma } + \gamma B_b^2 + \eta . \end{aligned}$$

Minimizing with respect to one of the decision variables, \(\gamma \), gives the following dual problem

$$\begin{aligned} \min _{\eta \ge 0} B_b\Vert g - \eta a\Vert _2 + \eta . \end{aligned}$$

Thus, \(\omega (g,{\mathcal {F}}) \le \min _{\eta \ge 0} (B_b\Vert g - \eta a\Vert _2 + \eta )\). Similarly we can prove an upper bound on the maximization problem appearing in the second term in the max operation in Eq. (4), which will be \(\min _{\eta \ge 0} (B_b\Vert g + \eta a\Vert _2 + \eta )\). Thus, the empirical Rademacher complexity is upper bounded as:

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})&\le \frac{1}{n}\max \left( \mathbb {E}\left[ \min _{\eta \ge 0} (B_b\Vert g - \eta a\Vert _2 + \eta )\right] , \mathbb {E}\left[ \min _{\eta \ge 0} (B_b\Vert g + \eta a\Vert _2 + \eta )\right] \right) \\&= \frac{1}{n}\max \left( \mathbb {E}_{\sigma }\left[ \min _{\eta \ge 0} (B_b\Vert X_L\sigma - \eta a\Vert _2 + \eta )\right] , \mathbb {E}_{\sigma }\left[ \min _{\eta \ge 0} (B_b\Vert X_L\sigma + \eta a\Vert _2 + \eta )\right] \right) . \end{aligned}$$

\(\square \)

5.2 Proof of Theorem 4

Proof

Consider the set \({\mathcal {F}}_{|S} = \{(\beta ^{T} x_1,..., \beta ^{T} x_n) \in \mathbb {R}^n : \beta ^{T}\mathbb {I}\beta \le B_{b}^{2}, \beta ^{T}A_{2}\beta \le 1 \} \subset \mathbb {R}^n\). Let \(\sigma =[\sigma _{1},...,\sigma _{n}]^{T}\). Also, let \(\alpha = A_{\mathrm{int }\gamma }^{1/2}\beta \).

where \((a)\) follows because we are taking the supremum over the circumscribing ellipsoid; \((b)\) follows because \(A_{\mathrm{int }\gamma }\) is positive definite, hence invertible; (c) is by Cauchy-Schwarz (equality case); (d) uses Jensen’s inequality and (e) uses the linearity of trace and expectation to commute them along with the fact that \(\mathbb {E}[\sigma \sigma ^{T}] = I\). \(\square \)

5.3 Proof of Theorem 5

Proof

Recall that we can decompose \(A_{\mathrm{int}\gamma }\) into a product \(P^T DP\) where \(D\) is a diagonal matrix with the eigenvalues of \(A_{\mathrm{int}\gamma }\) as its entries and \(P\) is an orthogonal matrix (i.e., \(P^T P=I\)). Let us define a new variable: \(\alpha :=P\beta \), which is a linear transformation of linear model parameter \(\beta \). Then, the empirical Gaussian complexity of our function class can be written as:

$$\begin{aligned} \mathcal {\bar{G}}({\mathcal {F}}_{|S})= \mathbb {E}_{\sigma }\left[ \sup _{\alpha ^{T}D\alpha \le 1} \frac{1}{n}\sum _{i=1}^{n}\left| \sigma _i \alpha ^{T}P x_i\right| \right] , \end{aligned}$$

where \(\{\sigma _{i}\}_{i=1}^{n}\) are i.i.d. standard normal random variables. We now define a new vector \(\omega \) to be a transformed version of the random vector \(\sum _{i=1}^{n}\sigma _i x_i\). That is, let \(\omega (\sigma ) := P\sum _{i=1}^{n}\sigma _i x_i\). We will drop the dependence of \(\omega \) on \(\sigma \) from the notation when it is clear from the context. The expression now becomes

$$\begin{aligned} n\cdot \mathcal {\bar{G}}({\mathcal {F}}_{|S})\; {\ge }\; \mathbb {E}_{\sigma }\left[ \sup _{\alpha ^{T}D\alpha \le 1} \alpha ^T \omega \right] , \end{aligned}$$
(6)

where the inequality is because we removed the absolute sign in the right hand side expression before substituting for \(\omega \).

The following are the major steps in our proof:

  • We will analyze the Gaussian function \(F(\omega (\sigma )) := \sup _{\alpha ^T D \alpha \le 1} \alpha ^{T}\omega (\sigma )\) and show it is Lipschitz in \(\sigma \). This is proved in Lemma 2.

  • Then we apply Lemma 3, which is about Gaussian function concentration, to the above function. In particular, we will upper bound the variance of the Gaussian function \(F(\omega (\sigma ))\) in terms of its parameters (Lipschitz constant, matrix \(D\), etc).

  • We then generate a candidate lower bound for the empirical Gaussian complexity.

  • The upper bound on the variance of \(F(\omega (\sigma ))\) we found earlier is used to make this bound proportional to \(\sqrt{\mathrm{trace }({X}_{L}A_{\mathrm{int}\gamma }^{-1}{X}_{L})}\).

  • Finally, we use a relation between empirical Rademacher complexity and empirical Gaussian complexity to obtain the desired result.

5.3.1 Computing a Lipschitz constant for \(F(\omega (\sigma ))\)

The following lemma gives an upper bound on the Lipschitz constant of \(F(\omega (\sigma ))\).

Lemma 2

The function \(F(\omega (\sigma )):= \sup _{\alpha ^T D \alpha \le 1} \alpha ^{T}\omega (\sigma )\) is Lipschitz in \(\sigma \) with a Lipschitz constant \(\mathcal {L}\) bounded by \(X_b\sqrt{\frac{p\cdot n}{\lambda _{min}(D)}}\).

Proof

We have

$$\begin{aligned} F(\omega )=\sup _{\alpha ^T D \alpha \le 1} \alpha ^{T}\omega = \sup _{(D^{1/2}\alpha )^T (D^{1/2}\alpha ) \le 1} \alpha ^{T}\omega . \end{aligned}$$

Using a new dummy variable \(\rho =D^{1/2}\alpha \) we have:

$$\begin{aligned} F(\omega )=\sup _{\rho ^T \rho \le 1} (D^{-1/2}\rho )^{T}\omega =\sup _{\rho ^T \rho \le 1} \rho ^{T}(D^{-1/2})^{T}\omega =\Vert D^{-1/2}\omega \Vert _{2} . \end{aligned}$$

Thus,

$$\begin{aligned} |F(\omega _1)-F(\omega _2)|&= \left| \Vert D^{-1/2}\omega _{1}\Vert _{2} - \Vert D^{-1/2}\omega _{2}\Vert _{2}\right| \le \Vert D^{-1/2}(\omega _{1} - \omega _{2})\Vert _{2}\\&\overset{(a)}{\le } {\left\| \frac{1}{\sqrt{\lambda _{min}(D)}}I(\omega _{1} - \omega _{2})\right\| _2} = \frac{1}{\sqrt{\lambda _{min}(D)}} \Vert \omega _1-\omega _2\Vert _{2}. \end{aligned}$$

At (a), we used the fact that \(D^{-1} \preceq \frac{1}{\lambda _{min}(D)}I\).

Now, we will upper bound \(\Vert \omega _1-\omega _2\Vert _{2}\) using \(\sigma _1\) and \(\sigma _2\) as follows. Using the definition of \(\omega = P{X}_{L}\sigma \) we get,

Here, (b) follows because \(P\) is an orthonormal matrix, (c) because \( {X}_{L}^T {X}_{L}\preceq \lambda _{max}({X}_{L}^T {X}_{L})I \) and (d) because \( \lambda _{max}({X}_{L}^T {X}_{L}) \le \mathrm{trace }({X}_{L}^T {X}_{L}) = \sum _{i=1}^{n}({X}_{L}^T {X}_{L})_{ii}\). Since, each diagonal element of \({X}_{L}^T {X}_{L}\) is a sum of \(p\) terms each upper bounded by \(X_b^2\), we have \( \lambda _{max}({X}_{L}^T {X}_{L}) \le n\cdot p \cdot X_b^2\). \(\square \)

Upper bounding the variance of \(F(\omega (\sigma ))\) using Gaussian concentration: The following lemma describes concentration for Lipschitz functions of gaussian random variables.

Lemma 3

(Concentration, Tsirelson et al. 1976) If \(\sigma \) is a vector with i.i.d. standard normal entries and \(G\) is any function with Lipschitz constant \(\mathcal {L}\) (with respect to the Euclidean norm), then

$$\begin{aligned} \mathbb {P}[\left| (G(\sigma )-\mathbb {E}[G(\sigma )]\right| \ge t] \le 2 e^{-\frac{t^2}{2\mathcal {L}^{2}}}. \end{aligned}$$

The proof of Lemma 3 is omitted here. Using Lemmas 2 and 3 with \(G(\sigma ) = F(\omega )\), we have

$$\begin{aligned} \mathbb {P}[\left| (F(\omega )-\mathbb {E}_{\sigma }[F(\omega )]\right| \ge t] \le 2 e^{-\frac{t^2}{2\mathcal {L}^2}}, \end{aligned}$$
(7)

where \(\mathcal {L} = X_b\sqrt{\frac{p\cdot n}{\lambda _{min}(D)}}\).

Let \(Y=|(F(\omega )-\mathbb {E}_{\sigma }[F(\omega )]|\). Then from the above tail bound, \(P(Y^{2} \ge s) \le 2 e^{-\frac{s}{2\mathcal {L}^2}}\) is true. Now we can bound the variance of \(F(\omega )\) using the above inequality and the following lemma.

Lemma 4

For a random variable \(Y^2\), \(\mathbb {E}[Y^2]=\int ^{+\infty }_{0}P(Y^2\ge s)ds\).

Proof

This is an alternate expression for the expectation of a non-negative univariate random variable in terms of its distribution function. To show this, let us assume that the density function of \(Y^2\) is \(\mu _{Y^2}\). We then have \(P(Y^2 \ge s)=1-P(Y^2\le s)=1-\int ^{s}_{0}\mu _{Y^2}(s')ds'\) and thus: \(\mu _{Y^2}(s)=-\frac{dP(Y^2 \ge s)}{ds}.\) So,

$$\begin{aligned} \mathbb {E}[Y^2]&= \int ^{+\infty }_{0}s \mu _{Y^2}(s)ds=-\int ^{+\infty }_{0}s \frac{dP(Y^2\ge s)}{ds}ds\\&= -[sP(Y^2 \ge s)]^{+\infty }_{0}+\int ^{+\infty }_{0}P(Y^2\ge s)ds. \end{aligned}$$

The first term is zero and we obtain our expression. \(\square \)

The variance of \(F(\omega )\), which is the same as the expectation of \(Y^{2}\), can thus be upper bounded as follows:

(8)

where we used Lemma 4 for step (a) and Eq. (7) for step (b) and finally substituting \(X_b\sqrt{\frac{p\cdot n}{\lambda _{min}(D)}}\) for \(\mathcal {L}\).

5.3.2 Lower bounding the empirical Gaussian complexity

Now we will lower bound the empirical Gaussian complexity by constructing a feasible candidate \(\alpha '\) to substitute for the \(\sup \) operation in Eq. (6). Later, we will use the variance upper bound on \(F(\omega )\) we found in the earlier section to make the bound more specific.

Let \(j^{*} \in \{1,...,p\}\) be the index at which the diagonal element \(D(j^{*},j^{*}) = \lambda _{min}(D)\). For each realization of \(\sigma \) (or equivalently \(\omega \)) let \(\alpha ' = \left[ 0\ldots \frac{|\omega _{j^{*}}|}{\omega _{j^{*}}\sqrt{\lambda _{min}(D)}}\ldots 0\right] \) with the non-zero entry at coordinate \(j^{*}\). Clearly \(\alpha '\) is a feasible vector in the ellipsoidal constraint \(\{\alpha : \alpha ^{T}D\alpha \le 1\}\) seen in the complexity expression, Eq. (6). Substituting it and using the definition of \(F(\omega )\), we get a lower bound on the empirical Gaussian complexity:

Step (a) comes from the fact that \(\alpha '\) is feasible in \(\{\alpha : \alpha ^{T}D\alpha \le 1\}\) but not necessarily the maximum, and step (b) comes from the definition of \(\alpha '\).

5.3.3 Making the lower bound more specific using variance of \(F(\omega (\sigma ))\)

Note that compared to the upper bound on the related Rademacher complexity obtained in Theorem 4, the dependence of empirical Gaussian complexity on \(A_{\mathrm{int }\gamma }\) is weak (only via \(\lambda _{min}(D)\)). We will use the variance of \(F(\omega )\) to obtain a lower bound very similar to the upper bound in Eq. (1). Rearranging the terms in the previous inequality, we get:

$$\begin{aligned} \frac{(\mathbb {E}_{\sigma }[F(\omega )])^2}{ (\mathbb {E}_{\sigma }|\omega _{j^{*}}|)^2} \ge \frac{1}{\lambda _{min}(D)}. \end{aligned}$$
(9)

By rewriting the variance in terms of the second and first moments, using expression (8) and then using (9) we get

$$\begin{aligned} \mathrm{Var }(F(\omega ))&= \mathbb {E}_{\sigma }[F^{2}(\omega )]-(\mathbb {E}_{\sigma }[F(\omega )])^2\\&\le 4X_b^2{\frac{p\cdot n}{\lambda _{min}(D)}} \le 4p n X_b^2\frac{(\mathbb {E}_{\sigma }[F(\omega )])^2}{(\mathbb {E}_{\sigma }|\omega _{j^{*}}|)^2}. \end{aligned}$$

Using expression (6) again, and then rearranging the terms in the previous expression, we obtain another lower bound on the scaled Gaussian complexity, which is:

$$\begin{aligned} \left( n\cdot \mathcal {\bar{G}}({\mathcal {F}}_{|S})\right) ^{2}&\ge (\mathbb {E}_{\sigma }[F(\omega )])^{2} \ge \frac{\mathbb {E}_{\sigma }[(F(\omega ))^2]}{1+\frac{4pnX_b^2}{(\mathbb {E}_{\sigma }|\omega _{j^{*}}|)^2}}\nonumber \\&= \frac{\mathbb {E}_{\sigma }[(\sup _{\alpha ^{T}D\alpha \le 1}\omega ^T \alpha )^2]}{1+\frac{4pnX_b^2}{(\mathbb {E}_{\sigma }|\omega _{j^{*}}|)^2}}. \end{aligned}$$
(10)

We can now try to bound two easier quantities \(\mathbb {E}_{\sigma }[(\sup _{\alpha ^{T}D\alpha \le 1}\omega ^T \alpha )^2]\) and \(\mathbb {E}_{\sigma }|\omega _{j^{*}}|\) to get an expression for scaled Gaussian complexity and consequently for the empirical Rademacher complexity.

Let us start first with \(\mathbb {E}|\omega _{j^{*}}|\). By definition \(\omega \) equals \(P{X}_{L}\sigma \). Thus, the \(j^{*}\)th coordinate of \(\omega \) will be \(\sum _{i}\sigma _{i}(Px_{i})_{j^{*}}\) where \((\cdot )_{j^*}\) represents the \(j^{*}\)th coordinate of the vector. Since the \(\sigma _{i}\) are independent standard normal, their weighted sum \(\omega \) is also standard normal with variance \(\sum _{i}(Px_{i})_{j^{*}}^{2}\). Since for any normal random variable \(z\) with mean zero and variance \(d\) it is true that \(\mathbb {E}[|z|] = \sqrt{\frac{2d}{\pi }}\), we have

$$\begin{aligned} \mathbb {E}_{\sigma }[|w_{j^{*}}|]&= \sqrt{\frac{2}{\pi }}\left( \sum _{i}(Px_{i})_{j^{*}}^{2}\right) ^{\frac{1}{2}}\nonumber \\&\ge \sqrt{\frac{2}{\pi }}\min _{j=1,...,p}\Vert (P{X}_{L})_{j}\Vert _{2} \end{aligned}$$
(11)

where \((P{X}_{L})_{j}\) represents the \(j^{th}\) row of the matrix \(P{X}_{L}\). For the second moment term of (10) that we need to bound, \(\mathbb {E}_{\sigma }[(\sup _{\alpha ^{T}D\alpha \le 1}\omega ^T \alpha )^2]\), we can see that

$$\begin{aligned} \sup _{\alpha ^{T}D\alpha \le 1}\omega ^T \alpha&= \sup _{\tilde{\alpha }^{T}\tilde{\alpha } \le 1}(P{X}_{L}\sigma )^{T}D^{-1/2}\tilde{\alpha } \\&= \Vert D^{-1/2}P{X}_{L}\sigma \Vert _{2}. \end{aligned}$$

Thus,

$$\begin{aligned} \mathbb {E}_{\sigma }\Big [\Big (\sup _{\alpha ^{T}D\alpha \le 1}\omega ^T \alpha \Big )^2 \Big ]&= \mathbb {E}_{\sigma }[\Vert D^{-1/2}P{X}_{L}\sigma \Vert _{2}^{2}] \nonumber \\&= \mathbb {E}_{\sigma }[(D^{-1/2}P{X}_{L}\sigma )^{T}D^{-1/2}P{X}_{L}\sigma ]\nonumber \\&= \mathbb {E}_{\sigma }[\sigma ^{T}{X}_{L}^{T} A_{\mathrm{int }\gamma }^{-1}{X}_{L}\sigma ] \nonumber \\&= \mathbb {E}_{\sigma }[ \mathrm{trace }(\sigma ^{T}{X}_{L}^{T} A_{\mathrm{int }\gamma }^{-1}{X}_{L}\sigma )] \nonumber \\&= \mathbb {E}_{\sigma }[ \mathrm{trace }({X}_{L}^{T}A_{\mathrm{int }\gamma }^{-1} {X}_{L}\sigma \sigma ^{T} )]\nonumber \\&= \mathrm{trace }({X}_{L}^{T}A_{\mathrm{int }\gamma }^{-1} {X}_{L}). \end{aligned}$$
(12)

Substituting the two bounds we just derived, (11) and (12), into (10) gives us a lower bound on the scaled Gaussian complexity:

$$\begin{aligned} \left( n\cdot \mathcal {\bar{G}}({\mathcal {F}}_{|S})\right) ^{2}&\ge \frac{\mathrm{trace }({X}_{L}^{T}A_{\mathrm{int }\gamma }^{-1} {X}_{L})}{ 1 + \frac{4pnX_b^2}{(\sqrt{\frac{2}{\pi }}\min _{j=1,...,p}\Vert (P{X}_{L})_{j}\Vert _{2})^{2}}}\\ n\cdot \mathcal {\bar{G}}({\mathcal {F}}_{|S})&\ge \sqrt{\frac{\mathrm{trace }({X}_{L}^{T}A_{\mathrm{int }\gamma }^{-1} {X}_{L})}{ 1 + \frac{4pnX_b^2}{(\sqrt{\frac{2}{\pi }}\min _{j=1,...,p}\Vert (P{X}_{L})_{j}\Vert _{2})^{2}}}}. \end{aligned}$$

5.3.4 Using the relation between Rademacher and Gaussian complexities

The empirical Gaussian complexity is related to the empirical Rademacher complexity as follows.

Lemma 5

(Lemma 4 of Bartlett and Mendelson 2002) There are absolute constants \(C\) and \(C'\) such that for every \({\mathcal {F}}_{|S}\) with \(|S| = n\),

$$\begin{aligned} C'\mathcal {\bar{R}}({\mathcal {F}}_{|S})\le \mathcal {\bar{G}}({\mathcal {F}}_{|S})\le C\log (n) \mathcal {\bar{R}}({\mathcal {F}}_{|S}). \end{aligned}$$

Using the above result gives:

$$\begin{aligned} {n}C\log (n) \mathcal {\bar{R}}({\mathcal {F}}_{|S})\ge \sqrt{\frac{\mathrm{trace }({X}_{L}^{T}A_{\mathrm{int }\gamma }^{-1} {X}_{L})}{ 1 + \frac{4pnX_b^2}{(\sqrt{\frac{2}{\pi }}\min _{j=1,...,p}\Vert (P{X}_{L})_{j}\Vert _{2})^{2}}}} \end{aligned}$$

Thus, we get our desired result:

$$\begin{aligned}&\mathcal {\bar{R}}({\mathcal {F}}_{|S})\ge \frac{\kappa }{n\log n}\sqrt{\mathrm{trace }({X}_{L}^{T}A_{\mathrm{int }\gamma }^{-1} {X}_{L})},\\&\mathrm{where }\\&\kappa = \frac{{1}}{C\sqrt{1 + \frac{2\pi pnX_b^2}{(\min _{j=1,...,p}\Vert (P{X}_{L})_{j}\Vert _{2})^{2}}}}. \end{aligned}$$

\(\square \)

5.4 Proof of Corollary 1

Proof

Since the ellipsoid defined using \(A_{\mathrm{int }\gamma }\) circumscribes the region of intersection of ellipsoids determined by \(A_1\) and \(A_2\), we have

$$\begin{aligned} {\mathcal {F}}=\Big \{f | f(x)&= \beta ^{T}x, \beta \in \mathbb {R}^{p}, \beta ^{T}A_{1}\beta \le 1, \beta ^{T}A_{2} \beta \le 1,\\&\sum _{j=1}^{p}c_{j\nu }\beta _{j} +\delta _{\nu } \le 1, \delta _{\nu } > 0, \nu =1,...,V\Big \}\\ \subseteq \\ \Big \{f | f(x)&= \beta ^{T}x, \beta \in \mathbb {R}^{p}, \beta ^{T}A_{\mathrm{int }\gamma }\beta \le 1,\\&\sum _{j=1}^{p}c_{j\nu }\beta _{j} +\delta _{\nu } \le 1, \delta _{\nu } > 0, \nu =1,...,V\Big \} =: {\mathcal {F}}'. \end{aligned}$$

Further, \(\beta ^{T}\lambda _{\min }(A_{\mathrm{int }\gamma })I\beta \le \beta ^{T}A_{\mathrm{int }\gamma }\beta \le 1\) since \(\lambda _{\min }(A_{\mathrm{int }\gamma })I \preceq A_{\mathrm{int }\gamma }\). That is, the set \(\beta ^{T}\lambda _{\min }(A_{\mathrm{int }\gamma })I\beta \le 1\) is bigger than the ellipsoid defined using \( A_{\mathrm{int }\gamma }\). Thus,

$$\begin{aligned} {\mathcal {F}}'=\Big \{f | f(x)&= \beta ^{T}x, \beta \in \mathbb {R}^{p}, \beta ^{T}A_{\mathrm{int }\gamma }\beta \le 1,\\&\sum _{j=1}^{p}c_{j\nu }\beta _{j} +\delta _{\nu } \le 1, \delta _{\nu } > 0, \nu =1,...,V\Big \}\\ \subseteq \\ \Big \{f | f(x)&= \beta ^{T}x, \beta \in \mathbb {R}^{p}, \beta ^{T}\beta \le \frac{1}{\lambda _{\min }(A_{\mathrm{int }\gamma })},\\&\sum _{j=1}^{p}c_{j\nu }\beta _{j} +\delta _{\nu } \le 1, \delta _{\nu } > 0, \nu =1,...,V\Big \} =: {\mathcal {F}}''. \end{aligned}$$

Noting that \(\beta ^{T}\beta \le \frac{1}{\lambda _{\min }(A_{\mathrm{int }\gamma })}\) is the same as \(\Vert \beta \Vert _2 \le \sqrt{\lambda _{\max }(A_{\mathrm{int }\gamma }^{-1})}\), we can use Theorem 2 on \({\mathcal {F}}''\) with \(r=2,q=2\) and \({\mathcal {B}}_b := \sqrt{\lambda _{\max }(A_{\mathrm{int }\gamma }^{-1})}\) to get a bound on \(N(\sqrt{n}\epsilon ,{\mathcal {F}}''_{|S},\Vert \cdot \Vert _{2}) \ge N(\sqrt{n}\epsilon ,{\mathcal {F}}_{|S},\Vert \cdot \Vert _{2})\) giving us the stated result. \(\square \)

5.5 Proof of Theorem 6

Proof

Let \(g = \sum _{i=1}^{n}\sigma _ix_i = X_L\sigma \) so that \(\mathcal {\bar{R}}({\mathcal {F}}_{|S})= \frac{1}{n}\mathbb {E}[\sup _{\beta \in {\mathcal {F}}} |g^T\beta |]\). Instead of directly working with the empirical Rademacher complexity, we will dualize the two maximization problems in the upper bound given by Eq. (4) of Lemma 1. Both maximization problems are very similar except for the objective. Let \(\omega (g,{\mathcal {F}})\) be the optimal value of the following optimization problem:

$$\begin{aligned} \max _{\beta } g^T\beta \;\;\; \mathrm{s.t. }\\ \beta ^T\beta \le B_b^2\\ \beta ^TA_2\beta \le 1. \end{aligned}$$

Thus \(\omega (g,{\mathcal {F}})\) is proportional to the first term inside the max operation in Eq. (4), which gives an upper bound in the empirical Rademacher complexity. We will now write a dual program to the above and use weak duality to upper bound \(\omega (g,{\mathcal {F}})\). The Lagrangian is:

$$\begin{aligned} \mathcal {L}(\beta ,\gamma ,\eta ) = g^T\beta + \gamma (B_b^2 - \beta ^T\beta ) + \eta (1 - \beta ^TA_2\beta ), \end{aligned}$$

where \(\beta \in \mathbb {R}^p, \gamma \in \mathbb {R}_{+}, \eta \in \mathbb {R}_{+}\). Maximizing the Lagrangian with respect to \(\beta \) gives us:

$$\begin{aligned}&\max _{\beta }\;\mathcal {L}(\beta ,\gamma ,\eta ) = \\&\quad = \max _{\beta }\left[ g^T\beta -\gamma \beta ^T\beta -\eta \beta ^TA_2\beta + \gamma B_b^2 + \eta \right] \\&\quad = \max _{\beta }\left[ -\left( -g^T\beta +\beta ^T(\gamma \mathbb {I} +\eta A_2)\beta \right) + \gamma B_b^2 + \eta \right] \\&\quad = \max _{\beta }\left[ -\left( -g^T(\gamma \mathbb {I} +\eta A_2)^{-1/2}(\gamma \mathbb {I} +\eta A_2)^{1/2}\beta \right. \right. \\&\qquad \left. \left. +\beta ^T(\gamma \mathbb {I} +\eta A_2)^{1/2}(\gamma \mathbb {I} +\eta A_2)^{1/2}\beta \right) + \gamma B_b^2 + \eta \right] \\&\quad = \max _{\beta }\left[ -\left\| (\gamma \mathbb {I} +\eta A_2)^{1/2}\beta - \frac{(\gamma \mathbb {I} +\eta A_2)^{-1/2}g}{2}\right\| _2^2 \right. \\&\qquad \left. + \frac{\Vert (\gamma \mathbb {I} +\eta A_2)^{-1/2}g\Vert _2^2}{4} + \gamma B_b^2 + \eta \right] \\&\quad = \frac{\Vert (\gamma \mathbb {I} +\eta A_2)^{-1/2}g\Vert _2^2}{4} + \gamma B_b^2 + \eta , \end{aligned}$$

where in the last step we set \(\beta = \frac{(\gamma \mathbb {I} +\eta A_2)^{-1}g}{2}\). The dual problem is thus:

$$\begin{aligned} \min _{\gamma \ge 0, \eta \ge 0} \frac{\Vert (\gamma \mathbb {I} +\eta A_2)^{-1/2}g\Vert _2^2}{4} + \gamma B_b^2 + \eta&\hbox {, or equivalently,}\\ \min _{\gamma \ge 0, \eta \ge 0} \frac{1}{4}g^T(\gamma \mathbb {I} +\eta A_2)^{-1}g + \gamma B_b^2 + \eta .&\\ \end{aligned}$$

If we let \(\gamma = 1-\eta \), we are further constraining the minimization problem, yielding another upper bound of the form:

$$\begin{aligned} \omega (g,{\mathcal {F}}) \le \min _{\eta \in [0,1]} \frac{1}{4}g^T(\mathbb {I} +\eta (A_2 -\mathbb {I}))^{-1}g + B_b^2 + \eta (1-B_b^2). \end{aligned}$$

If we consider the second maximization problem \(\sup _{\beta \in {\mathcal {F}}} -g^T\beta \) that appears in Eq. (4), we can similarly upper bound its optimal value with the same minimization problem as \(\omega (g,{\mathcal {F}})\). One intuitive reason why the same minimization problem serves as an upper bound is because the hypothesis class \({\mathcal {F}}\) is closed under negation. Thus, we get an upper bound on the empirical Rademacher complexity as:

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})&\le \mathbb {E}\left[ \frac{1}{n}\omega (g,{\mathcal {F}})\right] \\&\le \mathbb {E}\left[ \frac{1}{n}\min _{\eta \in [0,1]} \frac{1}{4}g^T(\mathbb {I} +\eta (A_2 -\mathbb {I}))^{-1}g + B_b^2 + \eta (1-B_b^2)\right] , \end{aligned}$$

where recall that \(g = \sum _{i=1}^{n}\sigma _ix_i\). Fix any feasible \(\eta \). Let \(A_{\mathrm{int }\eta } := (\mathbb {I} +\eta (A_2 -\mathbb {I}))\) (it corresponds to an ellipsoid as well since \(\eta \in [0,1]\)). Then,

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})&\le \mathbb {E}\left[ \frac{1}{4n} \sigma ^T X_L^TA_{\mathrm{int }\eta }^{-1}X_L\sigma + \frac{1}{n}(B_b^2 + \eta (1-B_b^2))\right] \\&= \frac{1}{4n} \mathrm{trace }(X_L^TA_{\mathrm{int }\eta }^{-1}X_L) + \frac{1}{n}(B_b^2 + \eta (1-B_b^2)). \end{aligned}$$

We can minimize the right hand side over \(\eta \in [0,1]\) to get the desired result. \(\square \)

5.6 Proof of Theorem 7

Proof

The core idea of the proof is to come up with an intuitive upper bound on the empirical Rademacher complexity of \({\mathcal {F}}\) using convex duality. We have already seen the use of convex duality in Proposition 1 and Theorem 6. Recall the definition of the empirical Rademacher complexity of a function class \({\mathcal {F}}\):

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})= \frac{1}{n}\mathbb {E}_{\sigma }\left[ \sup _{\beta \in {\mathcal {F}}} \left| \sum _{i=1}^{n}\sigma _i(\beta ^Tx_i)\right| \right] , \end{aligned}$$

where \(\{\sigma _{i}\}_{i=1}^{n}\) are i.i.d. Bernoulli random variables taking values in \(\{\pm 1\}\) with equal probability. Now define a new vector \(g\) to be the random vector \(\sum _{i=1}^{n}\sigma _i x_i\). As in the previous proofs, instead of directly working with the empirical Rademacher complexity, we will dualize the two maximization problems in the upper bound given by Eq. (4) of Lemma 1. Let \(\omega (g,{\mathcal {F}}) = \sup _{\beta \in {\mathcal {F}}}g^T\beta \). That is, \(\omega (g,{\mathcal {F}})\) is the optimal value of the first maximization problem (ignoring factor \(1/n\)) appearing on the right hand side of Eq. (4):

$$\begin{aligned} \max _{\beta }\;\;&g^T\beta \;\;\; \,\,\mathrm{ s.t. }\nonumber \\&\beta ^T\beta \le B_b^2\nonumber \\&\Vert A_k\beta \Vert _2 \le a_k^T\beta + d_k\;\; \forall k=1,...,K. \end{aligned}$$
(13)

The Lagrangian of the problem can be written as (Boyd and Vandenberghe 2004):

$$\begin{aligned} \mathcal {L}(\beta ,\gamma ,\{z_k,\theta _k\}_{k=1}^{K}) = g^T\beta + \gamma (B_b^2 - \beta ^T\beta ) + \sum _{k=1}^{K}\Big [z_k^TA_k\beta + \theta _k\cdot ( a_k^T\beta + d_k)\Big ], \end{aligned}$$

where \(\beta \in \mathbb {R}^p, \gamma \in \mathbb {R}_{+}\) and for \(k=1,...,K\) we have \(\Vert z_k\Vert _2 \le \theta _k\). For any set of feasible values of \((\beta ,\gamma ,\{z_k,\theta _k\}_{k=1}^{K})\), the objective of the SOCP in Eq. (13) is upper bounded by \(\mathcal {L}(\beta ,\gamma ,\{z_k,\theta _k\}_{k=1}^{K})\). Thus, \(\omega (g,{\mathcal {F}}) \le \sup _{\beta }\mathcal {L}(\beta ,\gamma ,\{z_k,\theta _k\}_{k=1}^{K})\). We will analyze this maximization problem as the first step towards a tractable bound on \(\omega (g,{\mathcal {F}})\).

In the second step, we will minimize \( \sup _{\beta }\mathcal {L}(\beta ,\gamma ,\{z_k,\theta _k\}_{k=1}^{K})\) over variable \(\gamma \) (one of the dual variables) to get an upper bound on \(\omega (g,{\mathcal {F}})\) in terms of \(\{z_k,\theta _k\}_{k=1}^{K}\). These two steps are shown below:

First step: After rearranging terms and completing squares, we get the following dual objective to be minimized over dual variables \(\gamma \) and \(\{z_k,\theta _k\}_{k=1}^{K}\).

$$\begin{aligned}&\sup _{\beta \in \mathbb {R}^p}\mathcal {L}(\beta ,\gamma ,\{z_k,\theta _k\}_{k=1}^{K}) \\&\quad = \sup _{\beta \in \mathbb {R}^p} \left[ \left( g + \sum _{k=1}^{K}(A_k^Tz_k + \theta _ka_k)\right) ^T\beta + \gamma B_b^2 + \sum _{k=1}^{K}\theta _kd_k - \gamma \beta ^T\beta \right] \\&\quad = \sup _{\beta \in \mathbb {R}^p}\left[ -\gamma \left\| \beta - \frac{g +\sum _{k=1}^{K}(A_k^Tz_k + \theta _ka_k)}{2\gamma }\right\| _2^2 \right. \\&\qquad \left. + \frac{\Vert g + \sum _{k=1}^{K}(A_k^Tz_k + \theta _ka_k)\Vert _2^2}{4\gamma } + \left( \gamma B_b^2 + \sum _{k=1}^{K}\theta _kd_k\right) \right] \\&\quad = \frac{\Vert g + \sum _{k=1}^{K}(A_k^Tz_k + \theta _ka_k)\Vert _2^2}{4\gamma } + \gamma B_b^2 + \sum _{k=1}^{K}\theta _kd_k. \end{aligned}$$

The second to last equality above is obtained by completing the squares (in terms of \(\beta \)) and the last equality is due to the fact that the optimal value is obtained when \(\beta = \frac{g + \sum _{k=1}^{K}(A_k^Tz_k + \theta _ka_k)}{2\gamma }\). The resulting term is now a function of the remaining variables (\(\gamma \) and \(\{z_k,\theta _k\}_{k=1}^{K}\)) and serves as an upper bound to \(\omega (g,{\mathcal {F}})\) for any feasible values of \(\gamma \) and \(\{z_k,\theta _k\}_{k=1}^{K}\).

Second step: Since \(\min _{x,y}f(x,y) = \min _x(\min _y f(x,y))\) when \(f(x,y)\) is convex and the feasible set is convex, we now minimize with respect to \(\gamma \) to get the following upper bound:

$$\begin{aligned}&\inf _{\gamma \in \mathbb {R}_+}\sup _{\beta \in \mathbb {R}^p}\mathcal {L}(\beta ,\gamma ,\{z_k,\theta _k\}_{k=1}^{K})\\&\quad = B_b\left\| g + \sum _{k=1}^{K}(A_k^Tz_k + \theta _ka_k)\right\| _2 + \sum _{k=1}^{K}\theta _kd_k, \end{aligned}$$

where the above statement follows because for a problem of the form \(\min _{\gamma \in \mathbb {R}_+} \frac{a}{\gamma } + b\gamma +c\) with \(a>0, b>0\), the optimal solution is \(\gamma ^* = +\sqrt{\frac{a}{b}}\).

Continuing, we now optimize over the remaining variables \(\{z_k,\theta _k\}_{k=1}^{K}\) as follows:

$$\begin{aligned} \omega (g,{\mathcal {F}})&= \sup _{\beta \in {\mathcal {F}}}g^T\beta \nonumber \\&\le \inf _{\{(z_k,\theta _k): \Vert z_k\Vert _2 \le \theta _k, k=1,..,K\}} B_b\left\| g + \sum _{k=1}^{K}(A_k^Tz_k + \theta _ka_k)\right\| _2 + \sum _{k=1}^{K}\theta _kd_k. \end{aligned}$$
(14)

An upper bound on \(\omega (g,{\mathcal {F}})\) can be obtained by finding a set of optimal or feasible values for \(\{z_k,\theta _k\}_{k=1}^{K}\). Note that since \(A_k \succ 0\), \(A_k^T = A_k\) and \(A_k^{-1}\) exists. Obtaining the optimal value of the minimization in Eq. (14) is difficult analytically. Instead, we will pick a suitable feasible value for \(\{z_k,\theta _k\}_{k=1}^{K}\). Plugging this feasible value will give us an upper bound on \(\omega (g,{\mathcal {F}})\). In particular, let \(z_k = - \frac{1}{K}A_k^{-1}g\). Then, setting \(\theta _k = \frac{1}{K}\Vert A_k^{-1}g\Vert _2\) gives us a feasible value for each \(\{z_k,\theta _k\}\). Thus,

$$\begin{aligned} \omega (g,{\mathcal {F}})&\le B_b\left\| g + \sum _{k=1}^{K}A_k^T\left( -\frac{1}{K}A_k^{-1}g\right) + \sum _{k=1}^{K}\frac{1}{K}\Vert A_k^{-1}g\Vert _2a_k\right\| _2 + \sum _{k=1}^{K} \frac{1}{K}\Vert A_k^{-1}g\Vert _2d_k\\&= B_b\left\| g -g + \sum _{k=1}^{K}\frac{\Vert A_k^{-1}g\Vert _2}{K}a_k\right\| _2 + \sum _{k=1}^{K}\frac{\Vert A_k^{-1}g\Vert _2}{K}d_k\\&= B_b\left\| \sum _{k=1}^{K}\frac{\Vert A_k^{-1}g\Vert _2}{K}a_k\right\| _2 + \sum _{k=1}^{K}\frac{\Vert A_k^{-1}g\Vert _2}{K}d_k\\&\le \sum _{k=1}^{K}\frac{\Vert A_k^{-1}g\Vert _2}{K}(B_b\Vert a_k\Vert _2 + d_k)\\&\le \Vert g\Vert _2\sum _{k=1}^{K}\frac{B_b\Vert a_k\Vert _2 + d_k}{K\cdot \lambda _{\min }(A_k)}. \end{aligned}$$

Dualizing the second maximization problem in Eq. (4) also gives us the same upper bound as obtained above for \(\omega (g,{\mathcal {F}})\). That is, if \(\omega '(g,{\mathcal {F}}) := \sup _{\beta \in {\mathcal {F}}} -g^T\beta \), then the same analysis as above (replacing \(g\) with \(-g\)) gives:

$$\begin{aligned} \omega '(g,{\mathcal {F}}) \le \Vert g\Vert _2\sum _{k=1}^{K}\frac{B_b\Vert a_k\Vert _2 + d_k}{K\cdot \lambda _{\min }(A_k)}. \end{aligned}$$

We can now come up with the desired upper bound for the empirical Rademacher complexity using Eq. (4):

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})&\le \mathbb {E}\left[ \max \left( \frac{1}{n}\omega (g,{\mathcal {F}}),\frac{1}{n}\omega '(g,{\mathcal {F}})\right) \right] \\&\le \frac{1}{n}\mathbb {E}\left[ \Vert g\Vert _2\sum _{k=1}^{K}\frac{B_b\Vert a_k\Vert _2 + d_k}{K\cdot \lambda _{\min }(A_k)} \right] \;\;\; \mathrm{(since upper bounds are the same) }\\&= \frac{1}{n} \mathbb {E}_{\sigma }\left[ \Big \Vert \sum _{i=1}^{n}\sigma _i x_i\Big \Vert _2\right] \sum _{k=1}^{K}\frac{B_b\Vert a_k\Vert _2 + d_k}{K\cdot \lambda _{\min }(A_k)}\\&\le \frac{1}{n} \sqrt{\mathbb {E}_{\sigma }\Big [ \Big \Vert \sum _{i=1}^{n}\sigma _i x_i\Big \Vert _2^2}\Big ] \sum _{k=1}^{K}\frac{B_b\Vert a_k\Vert _2 + d_k}{K\cdot \lambda _{\min }(A_k)} \;\;\;\mathrm{ (by Jensen's inequality) }\\&\le \frac{X_b}{\sqrt{n}}\sum _{k=1}^{K}\frac{B_b\Vert a_k\Vert _2 + d_k}{K\cdot \lambda _{\min }(A_k)}. \end{aligned}$$

In the case when there are no active conic constraints, we cannot use this bound. Instead, we can recover the well known standard bound by removing the terms related to conic constraints in Eq. (14) and obtain only \(\frac{X_bB_b}{\sqrt{n}}\). Combining both bounds we get,

$$\begin{aligned} \mathcal {\bar{R}}({\mathcal {F}}_{|S})\le \frac{X_b}{\sqrt{n}}\cdot \min \left\{ B_b,\sum _{k=1}^{K}\frac{B_b\Vert a_k\Vert _2 + d_k}{K\cdot \lambda _{\min }(A_k)}\right\} . \end{aligned}$$

\(\square \)

6 Conclusion

In this paper, we have outlined how various side information about a learning problem can effectively help in generalization. We focused our attention on several types of side information, leading to linear, polygonal, quadratic and conic constraints, giving motivating examples and deriving complexity measure bounds. This work goes beyond the traditional paradigm of ball-like hypothesis spaces to study more exotic, yet realistic, hypothesis spaces, and is a starting point for more work on other interesting hypothesis spaces.