1 Introduction

The structural risk minimization (SRM) principle, formulated by Vapnik in the Statistical Learning Theory (SLT) framework (Vapnik 1998, 2000), requires a learning procedure to search for the hypothesis space that guarantees the best trade-off between its complexity and its fitting capabilities on the training samples (Vapnik 1998; Anguita et al. 2012). Consequently, according to the SRM principle, the control of the hypothesis space size assumes a central role in learning (Guyon et al. 2010; Anguita et al. 2011a). Vapnik’s original approach to the derivation of an actual learning algorithm, like the support vector machine (SVM) (Cortes and Vapnik 1995; Vapnik 1998; Pontil and Verri 1998), consisted in implementing the SRM principle through an Ivanov regularization scheme (Vapnik 1998; Ivanov 1976). This is a logical approach because the Ivanov regularization framework allows to handle directly the two main forces guiding the SRM-based learning: on one hand, the minimization of the empirical risk and, on the other hand, the control of the hypothesis space, where the hypothesis minimizing the risk is chosen from. For the sake of brevity, we will refer to this formulation, when addressing the SVM learning algorithm, as the Ivanov-based SVM (I-SVM).

However, in his seminal works, Vapnik resorted to an alternative formulation, based on Tikhonov regularization, which quickly became very successful, due to its excellent performance in real-world problems, and which is commonly referred to as the SVM algorithm (to avoid any confusion we will make reference to this formulation as T-SVM) (Vapnik 1998; Tikhonov et al. 1977). The main argument in favor of this option is that the T-SVM learning problem is easier to solve than the I-SVM one: in fact, the amount of effective solvers for T-SVM, which appeared in the literature in the following decades, support this claim (Platt 1998, 1999; Keerthi et al. 2001; Shawe-Taylor and Sun 2011).

In this paper we will show that the Ivanov regularization approach is directly linked with one of the most powerful measure of the generalization ability of a learning algorithm: the Rademacher Complexity (Bartlett and Mendelson 2003; Koltchinskii 2006; Bartlett et al. 2005). In particular we will show that it is possible to bound these quantities even for the case of Tikhonov and Morozov regularization, but, in order the direct control of the generalization ability of the learning algorithm, we have to resort to the Ivanov formulation. Since the Tikhonov regularization scheme for SRM learning does not allow to directly control the size of the hypothesis space, this produces a soft mismatch in the theory (Shawe-Taylor et al. 1998; Bartlett 1998). This is usually considered an acceptable price to pay in order to foster the applicability of the SVM learning algorithm to practical problems. However, being able to carefully fine-tuning the complexity of the hypothesis space can lead, in the data-dependent SRM framework (Bartlett and Mendelson 2003; Koltchinskii 2006), to remarkable improvements in the quality of the identified SVM solution. This is especially true when dealing with difficult classification problems where, for example, only few high-dimensional samples are available to train a reliable and effective classifier (Anguita et al. 2011b, 2012). In this case, the requirement of the SRM principle of precisely considering a series of hypothesis spaces of increasing size, for identifying the optimal class of functions, becomes of paramount importance (Vapnik 1998; Duan et al. 2003). Therefore, a desirable objective would be to achieve the best of the two worlds: addressing the I-SVM learning problem by exploiting the efficiency of T-SVM solvers.

Furthermore, as showed by Pelckmans et al. (2004) for the particular case of Least Square SVM (LS-SVM) (Suykens and Vandewalle 1999), a third approach can be taken in account, based on Morozov regularization. The Morozov regularization schema, despite being seldom used in practice, has been shown to be effective when reliable estimations of the noise afflicting the data are available, so it is worth considering as a further and alternative learning formulation (Morozov et al. 1984).

In reaching the above mentioned objectives, we propose in this work some more general results, which are valid for any convex loss function, including the SVM hinge loss as a particular case, and prove the equivalence between Tikhonov, Ivanov and Morozov regularization schemas in a general setting. Then, we apply our findings to the particular case of SVM classifiers and propose several ways to solve I-SVM and M-SVM learning problems through use of efficient T-SVM solvers.

The paper is organized as follows: after introducing the supervised learning framework in Sect. 2, we revise the formulations of Tikhonov, Ivanov and Morozov regularization approaches in Sect. 3. Then, in Sect. 4, we prove the equivalence of the regularization paths of the three approaches and derive some general properties relating the corresponding optimal solutions. In Sect. 6 we specialize our findings to the particular case of SVM training and, in Sect. 7, we show experimentally the advantages and disadvantages of using the well-known T-SVM Sequential Minimal Optimization (SMO) solver (Platt 1998; Keerthi et al. 2001; Keerthi and Gilbert 2002; Fan et al. 2005) for addressing both I-SVM and M-SVM problems. Finally, Sect. 8 summarizes some concluding remarks.

2 The supervised learning framework

We recall the standard framework of supervised learning, where the goal is to approximate a relationship between inputs from a set \({\mathcal {X}} \subseteq {\mathbb {R}}^d\), and outputs from a set \({\mathcal {Y}} \subseteq {\mathbb {R}}\). A special case of interest is the discrete case, where \({\mathcal {Y}} \equiv \left\{ -1,+1\right\} \) (i.e. the binary classification problem). The relationship between inputs and outputs is encoded by a fixed, but unknown, probability distribution \(\mu \) on \({\mathcal {Z}} = {\mathcal {X}} \times {\mathcal {Y}}\). Each element \(({\varvec{x}}, y) = {\varvec{z}} \in {\mathcal {Z}}\) is defined as a labeled example: the training phase consists of a learning algorithm, which exploits a sequence \({\mathcal {D}}_n = \{ {\varvec{z}}_1,\ldots ,{\varvec{z}}_n \} \in {\mathcal {Z}}^n\) of labeled examples and returns a function \(h: {\mathcal {X}} \rightarrow {\mathbb {R}}\) chosen from a fixed set \({\mathcal {H}}\) of possible hypotheses. The learning algorithm maps \(({\varvec{z}}_1,\ldots ,{\varvec{z}}_n)\) to \({\mathcal {H}}\), and the accuracy in representing the hidden relationship \(\mu \) is measured with reference to a loss function \(\ell : {\mathbb {R}} \times {\mathbb {R}} \rightarrow [0, \infty )\).

For any \(h \in {\mathcal {H}}\), we define the generalization error L(h) as the expectation of \(\ell (h({\varvec{x}}),y)\) with respect to \(\mu \), \(L(h) = {\mathbb {E}}_{\mu } \ell (h({\varvec{x}}),y)\), where we assume that each labelled sample is generated according to \(\mu \). Our scope is to find the best \(h \in {\mathcal {H}}\) for which L(h) is minimum. Unfortunately, L(h) cannot be computed since \(\mu \) is unknown, but we can easily compute its empirical version \(\hat{L}(h)\):

$$\begin{aligned} \hat{L}(h) = \frac{1}{n}\sum _{i = 1}^n \ell \left( h({\varvec{x}}_i),y_i\right) . \end{aligned}$$
(1)

We focus in this paper on convex (Boyd and Vandenberghe 2004; Bauschke and Combettes 2011), and Lipschitz continuos (Goldstein 1977) loss functions only, as they are quite common and allow to solve complex learning tasks with effective and efficient approaches (Cortes and Vapnik 1995; Bartlett et al. 2006; Lee et al. 1998; Shawe-Taylor and Cristianini 2004; Suykens and Vandewalle 1999): examples are the well-known hinge (Cortes and Vapnik 1995) and logistic (Collins et al. 2002) loss functions. On the contrary, the use of a non-convex loss leads to NP-problems, which cannot be exactly solved for sample sets whose cardinality exceeds few tens of data (e.g. \(n > 30\)) (Anthony 2001; Feldman et al. 2009), but for which approximate solutions can be eventually found (Lawler and Wood 1966; Yuille and Rangarajan 2003). As a matter of fact, if one has to cope with a non-convex loss, a convex relaxation is often used in order to reformulate the problem so to make it computationally tractable.

3 Tikhonov, Ivanov and Morozov regularization problems

We address a supervised learning framework, where the class of functions is parameterized as follows:

$$\begin{aligned} h({\varvec{x}}) = {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}) + b, \end{aligned}$$
(2)

\({\varvec{\phi }}: {\mathbb {R}}^d \rightarrow {\mathbb {R}}^D\) is a mapping function, \({\varvec{w}} \in {\mathbb {R}}^D\), and \(b \in {\mathbb {R}}\). The naïve approach to learning, namely the Empirical Risk Minimization (ERM) (Vapnik 1998; Bousquet et al. 2004), consists in searching for the function h that minimizes the empirical error:

$$\begin{aligned} h: \quad \arg \min _{{\varvec{w}},b}\ \hat{L}(h). \end{aligned}$$
(3)

As Problem (3) is convex, local minima are avoided (Boyd and Vandenberghe 2004) even though the solution \(\left\{ {\varvec{w}},b\right\} \) is not unique, in general. Unfortunately ERM is well known to lead to a severe overfitting and then to poor performance in classifying new data, originated by the same distribution \(\mu \) but previously unseen.

Alternatively, in order to avoid the overfitting issue that afflicts the ERM procedure, the Tikhonov regularization technique (Tikhonov et al. 1977) can be exploited, which was proposed to solve ill-posed problems (Bishop 1995):

$$\begin{aligned} h: \quad \arg \min _{{\varvec{w}},b}\ \hat{L}(h) + \frac{\lambda }{2} \left\| {\varvec{w}} \right\| ^2 \quad \text {or} \quad \arg \min _{{\varvec{w}},b}\ \frac{1}{2} \left\| {\varvec{w}} \right\| ^2 + C \hat{L}(h), \end{aligned}$$
(4)

where \(\left\| {\varvec{w}} \right\| \) is the Euclidean norm of \({\varvec{w}}\), and implements an underfitting tendency, so that the regularization parameter \(\lambda \in [0, \infty )\), or equivalently \(C = \frac{1}{\lambda } \in (0, \infty ]\), balances the influence of the underfitting and the overfitting terms (Vorontsov 2010; Bousquet et al. 2004).

A consequence of this formulation is that \(\lambda \) implicitly defines the class of functions \({\mathcal {H}}\), from which the models \(h({\varvec{x}})\) are selected by the optimization procedure (Tikhonov et al. 1977; Vapnik 1998), but the relation between the regularization parameter and the size of the hypothesis space is not evident at all.

Differently to the Tikhonov scheme, the method of quasi-solutions, originally proposed by Ivanov and also known as Ivanov regularization (Ivanov 1976), allows to explicitly control the size of \({\mathcal {H}}\) by upper bounding the square norm of the admissible hypotheses (Pelckmans et al. 2004; Vapnik 1998; Anguita et al. 2012):

$$\begin{aligned} h: \quad \arg \min _{{\varvec{w}},b} \quad&\hat{L}(h) \nonumber \\ s.t. \quad&\left\| {\varvec{w}} \right\| ^2 \le w^2_{\text {MAX}}, \end{aligned}$$
(5)

by means of the regularization parameter \(w^2_{\text {MAX}} \in [0, \infty )\).

It is worthwhile noting that the solution \(\left\{ {\varvec{w}}^*,b^*\right\} \) for Problem (5) is not unique, in general (Boyd and Vandenberghe 2004). In order to eliminate such potential ambiguity, we can simply opt for the function \(h({\varvec{x}})\) characterized by minimum \(\left\| {\varvec{w}} \right\| \), namely the simplest (smoothest) possible solution. In order to highlight this, without modifying the nature of the regularization procedure, we propose an equivalent formulation to Problem (5):

$$\begin{aligned} h: \quad \arg \min _{{\varvec{w}},b} \quad&\left\| {\varvec{w}} \right\| \nonumber \\ s.t. \quad&h \in {\mathcal {S}} \nonumber \\&{\mathcal {S}} = \left\{ h: \hat{L}(h) = \arg \min _{{\varvec{w}},b} \hat{L}(h)\ s.t.\ \left\| {\varvec{w}} \right\| ^2 \le w^2_{\text {MAX}} \right\} . \end{aligned}$$
(6)

In order to simplify the notation of Problem (6) we simply add \(\left\| {\varvec{w}} \right\| \) to the argument of the minimum in Problem (5):

$$\begin{aligned} h: \quad \arg \min _{{\varvec{w}},b,\left\| {\varvec{w}} \right\| } \quad&\hat{L}(h) \nonumber \\ s.t. \quad&\left\| {\varvec{w}} \right\| ^2 \le w^2_{\text {MAX}}. \end{aligned}$$
(7)

A third way to write our regularization problem is the less-known approach proposed by Morozov (Morozov et al. 1984; Pelckmans et al. 2004):

$$\begin{aligned} h: \quad \arg \min _{{\varvec{w}},b} \quad&\frac{1}{2} \left\| {\varvec{w}} \right\| ^2 \nonumber \\ s.t. \quad&\hat{L}(h) \le \hat{L}_{\text {MAX}}. \end{aligned}$$
(8)

In this case, the size of the hypothesis space is implicitly controlled by imposing an upper bound to the empirical error, namely \(\hat{L}_{\text {MAX}} \in [0, \infty )\).

It is worthwhile noting that also the solution \(\left\{ {\varvec{w}}^*,b^*\right\} \) for Problem (8) is not unique, in general (Boyd and Vandenberghe 2004). In order to eliminate such potential ambiguity, we can simply opt for the function \(h({\varvec{x}})\) characterized by minimum \(\hat{L}(h)\), namely the solution with minimum error. In order to highlight this, without modifying the nature of the regularization procedure, we propose an equivalent formulation to Problem (8):

$$\begin{aligned} h: \quad \arg \min _{{\varvec{w}},b} \quad&\hat{L}(h) \nonumber \\ s.t. \quad&h \in {\mathcal {S}} \nonumber \\&{\mathcal {S}} = \left\{ h: \left\| {\varvec{w}} \right\| ^2 = \arg \min _{{\varvec{w}},b} \left\| {\varvec{w}} \right\| ^2\ s.t.\ \hat{L}(h) \le \hat{L}_{\text {MAX}}. \right\} . \end{aligned}$$
(9)

In order to simplify the notation of Problem (9) we simply add \(\hat{L}(h)\) to the argument of the minimum in Problem (8):

$$\begin{aligned} h: \quad \arg \min _{{\varvec{w}},b,\hat{L}(h)} \quad&\frac{1}{2} \left\| {\varvec{w}} \right\| ^2 \nonumber \\ s.t. \quad&\hat{L}(h) \le \hat{L}_{\text {MAX}}. \end{aligned}$$
(10)

The philosophy underlying the Morozov regularization approach consists in choosing the simplest function, by minimizing \(\left\| {\varvec{w}} \right\| ^2\), which performs better than a pre-determined performance threshold on the training set. Clearly, if the threshold \(\hat{L}_{\text {MAX}}\) is too small, a solution could not exist: therefore, for the sake of simplicity, we will assume in the rest of the paper that \(\hat{L}_{\text {MAX}}\) is large enough so that a solution can be found. This hypothesis does not modify the nature of Morozov regularization, while it helps simplifying the subsequent analysis.

It is important to note that the Representer Theorem holds for all the previous regularization approaches (Aronszajn 1951; Schölkopf et al. 2001; Dinuzzo and Schölkopf 2012). Consequently, the solution to the Tikhonov, Ivanov and Morozov optimization problems can be expressed as:

$$\begin{aligned} {\varvec{w}} = \sum _{i = 1}^n p_i {\varvec{\phi }}({\varvec{x}}_i), \end{aligned}$$
(11)

where \(p_i \in {\mathbb {R}}\ \forall i \in \left\{ 1,\ldots ,n\right\} \). Because of this property, the solution function \(h({\varvec{x}})\) can be written as:

$$\begin{aligned} h({\varvec{x}}) = {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}) + b = \sum _{i = 1}^n p_i {\varvec{\phi }}({\varvec{x}}_i) \cdot {\varvec{\phi }}({\varvec{x}}) + b = \sum _{i = 1}^n p_i K({\varvec{x}}_i,{\varvec{x}}) + b \end{aligned}$$
(12)

where we made use of the well-known kernel trick (Berlinet and Thomas-Agnan 2004; Vapnik 1998; Schölkopf 2001; Shawe-Taylor and Cristianini 2004) and \(K(\cdot ,\cdot )\) is the kernel function.

4 A general approach for solving Ivanov and Morozov problems through the Tikhonov formulation

Although, according to the SRM framework, learning can be easily implemented by an Ivanov regularization approach, a Tikhonov formulation has been usually preferred as it is easier to solve, and several effective methods have been developed throughout the years for this purpose (Platt 1998, 1999; Keerthi et al. 2001; Shawe-Taylor and Sun 2011). In this section, we show that the Tikhonov, Ivanov and Morozov regularization approaches are three faces of the same problem: in particular, we will show how the Ivanov and Morozov problems can be solved through the procedures originally designed for the Tikhonov based formulation.

4.1 Equivalence of Tikhonov, Ivanov and Morozov formulations

At first, we show that a value of the Tikhonov regularization parameter exists such that the three problems are equivalent.Footnote 1

Theorem 1

Let us consider an Ivanov (or Morozov) regularization problem, as formulated in Eqs. (7) and (10); then, there exists a value of \(C = \frac{1}{\lambda }\) for the Tikhonov regularization Problem (4) such that the formulations are equivalent.

Proof

As a first step, let us consider the Ivanov Problem (7). Because of its convexity, we can compute the Lagrange dual function and solve the associate optimization problem (Boyd and Vandenberghe 2004):

$$\begin{aligned} \left( {\varvec{w}}^*_I,b^*_I,\left\| {\varvec{w}}^*_I \right\| ,\lambda ^*_I \right) : \quad \arg \min _{{\varvec{w}},b,\left\| {\varvec{w}} \right\| } \max _{\lambda \ge 0} \quad&\hat{L}(h) + \frac{\lambda }{2} \left( \left\| {\varvec{w}} \right\| ^2 - w^2_{\text {MAX}} \right) \nonumber \\ s.t. \quad&\lambda \left( \left\| {\varvec{w}} \right\| ^2 - w^2_{\text {MAX}} \right) = 0, \end{aligned}$$
(13)

where \(\lambda \) is the Lagrange multiplier of the constraint on the class of functions. Then, if we exploit in the Tikhonov problem of Eq. (4) the value of \(\lambda ^*_I\), obtained by the minimization of the dual function of the Ivanov regularization problem shown above, we obtain:

$$\begin{aligned} \left( {\varvec{w}}^*_T,b^*_T\right) : \quad&\arg \min _{{\varvec{w}},b} \quad \hat{L}(h) + \frac{\lambda ^*_I}{2} \left\| {\varvec{w}} \right\| ^2 = \nonumber \\&\arg \min _{{\varvec{w}},b} \quad \hat{L}(h) + \frac{\lambda ^*_I}{2} \left( \left\| {\varvec{w}} \right\| ^2 - w^2_{\text {MAX}} \right) , \end{aligned}$$
(14)

since \(w^2_{\text {MAX}}\) is constant with respect to the minimization problem. As Problems (13) and (14) are equal, they also have the same solution and then a value of \(\lambda =\frac{1}{C}\) exists such that the two formulations are equivalent.

Concerning the Morozov approach, the proof is analogous. We can compute the Lagrange dual function of Problem (10):

$$\begin{aligned} \left( {\varvec{w}}^*_M,b^*_M,C^*_M,\hat{L}^*_M \right) : \quad \arg \min _{{\varvec{w}},b,\hat{L}(h)} \max _{C \ge 0} \quad&\frac{1}{2} \left\| {\varvec{w}} \right\| ^2 + C \left( \hat{L}(h) - \hat{L}_{\text {MAX}}\right) \nonumber \\ s.t. \quad&\hat{L}(h) \le \hat{L}_{\text {MAX}} \nonumber \\&C \left( \hat{L}(h) - \hat{L}_{\text {MAX}}\right) = 0. \end{aligned}$$
(15)

Then, if we exploit the optimal value of the Lagrange multiplier \(C^*_M\), obtained through the previous problem, in the Tikhonov formulation of Eq. (4), we obtain:

$$\begin{aligned} \left( {\varvec{w}}^*_T,b^*_T\right) : \quad&\arg \min _{{\varvec{w}},b} \quad \frac{1}{2} \left\| {\varvec{w}} \right\| ^2 + C^*_M \hat{L}(h) = \nonumber \\&\arg \min _{{\varvec{w}},b} \quad \frac{1}{2} \left\| {\varvec{w}} \right\| ^2 + C^*_M \left( \hat{L}(h) - \hat{L}_{\text {MAX}}\right) . \end{aligned}$$
(16)

Problems (15) and (16) are consequently equal, thus they have the same solution, i.e. there exists a value of \(C=\frac{1}{\lambda }\) such that the two formulations are equivalent. \(\square \)

Note that it is possible to find the same results, in a more general framework, in Bauschke and Combettes (2011). The following theorems allow to prove that the solutions of the Ivanov–Tikhonov and the Morozov–Tikhonov approaches coincide.

Theorem 2

Let us consider the Tikhonov and Ivanov formulations. Let \(\left( \left\| {\varvec{w}}^*_T \right\| ,\right. \) \(\left. \hat{L}^*_T \right) \) and \(\left( \left\| {\varvec{w}}^*_I \right\| , \hat{L}^*_I \right) \) be the solutions of, respectively, the Tikhonov and the Ivanov problem. If \(\left\| {\varvec{w}}^*_T \right\| = \left\| {\varvec{w}}^*_I \right\| \) for a given \(C=\frac{1}{\lambda }\) and for a given \(w_{\text {MAX}}\), then \(\hat{L}^*_T = \hat{L}^*_I\) and vice-versa.

Proof

Based on the definition of minimum of the Tikhonov problem, from Eq. (4) we have that

$$\begin{aligned} \frac{1}{2} \left( \left\| {\varvec{w}}^*_I \right\| \right) ^2 + C \hat{L}^*_I \ge \frac{1}{2} \left( \left\| {\varvec{w}}^*_T \right\| \right) ^2 + C \hat{L}^*_T. \end{aligned}$$
(17)

As we supposed that C is such that \(\left\| {\varvec{w}}^*_T \right\| = \left\| {\varvec{w}}^*_I \right\| \), then:

$$\begin{aligned} \hat{L}^*_I \ge \hat{L}^*_T. \end{aligned}$$
(18)

However, as \(\hat{L}^*_I\) is the solution to Problem (7), we must have that

$$\begin{aligned} \hat{L}^*_I = \hat{L}^*_T. \end{aligned}$$
(19)

If, instead, we suppose that C is such that \(\hat{L}^*_I = \hat{L}^*_T\), then:

$$\begin{aligned} \left\| {\varvec{w}}^*_I \right\| \ge \left\| {\varvec{w}}^*_T \right\| . \end{aligned}$$
(20)

However, since in Problem (7) we supposed to search for the solution characterized by minimum \(\left\| {\varvec{w}} \right\| \), we have:

$$\begin{aligned} \left\| {\varvec{w}}^*_I \right\| = \left\| {\varvec{w}}^*_T \right\| . \end{aligned}$$
(21)

\(\square \)

Note that, if we did not opt for the smoothest solution, it would be impossible to prove this last property. In fact, a counterexample where different values of \(\left\| {\varvec{w}} \right\| \) allow to achieve the same minimum \(\hat{L}^*_T\) can be found in Anguita et al. (2011b).

Theorem 3

Let us consider the Tikhonov and Morozov formulations. Let \(\left( \left\| {\varvec{w}}^*_T \right\| ,\right. \) \(\left. \hat{L}^*_T \right) \) and \(\left( \left\| {\varvec{w}}^*_M \right\| , \hat{L}^*_M \right) \) be the solutions of, respectively, the Tikhonov and the Morozov problems. If \(\left\| {\varvec{w}}^*_T \right\| = \left\| {\varvec{w}}^*_M \right\| \) for a given \(C=\frac{1}{\lambda }\) and for a given \(\hat{L}_{\text {MAX}}\), then \(\hat{L}^*_T = \hat{L}^*_M\) and vice-versa.

Proof

Based on the definition of minimum of the Tikhonov problem, from Eq. (4) we have that

$$\begin{aligned} \frac{1}{2} \left( \left\| {\varvec{w}}^*_M \right\| \right) ^2 + C \hat{L}^*_M \ge \frac{1}{2} \left( \left\| {\varvec{w}}^*_T \right\| \right) ^2 + C \hat{L}^*_T. \end{aligned}$$
(22)

As we supposed that \(\lambda \) is such that \(\hat{L}^*_T = \hat{L}^*_M\), then:

$$\begin{aligned} \left\| {\varvec{w}}^*_M \right\| \ge \left\| {\varvec{w}}^*_T \right\| , \end{aligned}$$
(23)

However, as \(\hat{L}^*_M\) is the solution to Problem (10), we must have that

$$\begin{aligned} \left\| {\varvec{w}}^*_M \right\| = \left\| {\varvec{w}}^*_T \right\| . \end{aligned}$$
(24)

If instead we suppose that \(\lambda \) is such that \(\left\| {\varvec{w}}^*_M \right\| = \left\| {\varvec{w}}^*_T \right\| \), then:

$$\begin{aligned} \hat{L}^*_M \ge \hat{L}^*_T. \end{aligned}$$
(25)

In particular, since in problem Problem (10) we force \(\left\| {\varvec{w}}^*_M \right\| = \left\| {\varvec{w}}^*_T \right\| \), we have that \(\hat{L}^*_M\) is forced to be as small as possible. Consequently \(\hat{L}^*_M > \hat{L}^*_T\) is not possible and we have that:

$$\begin{aligned} \hat{L}^*_M = \hat{L}^*_T. \end{aligned}$$
(26)

\(\square \)

4.2 Solving Ivanov and Morozov problems with Tikhonov solvers

In the following, we prove some properties that allow us to define general procedures for solving an Ivanov or a Morozov problem through the techniques designed for Tikhonov formulations. We start by depicting the behavior of the Tikhonov problem solution as the regularization parameter C grows. The following theorem is propaedeutic to the derivation of such results.

Theorem 4

Let us consider the Tikhonov formulation. Let us solve Problem (4) for two given values of the regularization parameter \(C_1\) and \(C_2 > C_1\). In particular, let the solutions of the problem be, respectively, \(\left( \left\| {\varvec{w}}^*_{C_1} \right\| , \hat{L}^*_{C_1}\right) \) for \(C_1\) and \(\left( \left\| {\varvec{w}}^*_{C_2} \right\| , \hat{L}^*_{C_2}\right) \) for \(C_2\) so that the corresponding values of the objective functions are:

$$\begin{aligned} K^{C_1} = \frac{1}{2} \left( \left\| {\varvec{w}}^*_{C_1} \right\| \right) ^2 + C_1 \hat{L}^*_{C_1}, \quad K^{C_2} = \frac{1}{2} \left( \left\| {\varvec{w}}^*_{C_2} \right\| \right) ^2 + C_2 \hat{L}^*_{C_2}. \end{aligned}$$
(27)

Then:

$$\begin{aligned} K^{C_2} \ge K^{C_1}. \end{aligned}$$
(28)

Proof

Since \(C_2 > C_1\) \(\forall ({\varvec{w}},b) \in {\mathbb {R}}^D \times {\mathbb {R}}\) we have that:

$$\begin{aligned} \frac{1}{2} \left\| {\varvec{w}} \right\| ^2 + C_1 \hat{L}(h) \le \frac{1}{2} \left\| {\varvec{w}} \right\| ^2 + C_2 \hat{L}(h). \end{aligned}$$
(29)

The statement follows from taking the infimum over \(({\varvec{w}},b)\) on both sides. \(\square \)

By exploiting the result of the previous theorem, we prove the following property, which we will exploit in the following to design actual learning algorithms.

Theorem 5

Let us consider the Tikhonov formulation. Given \(C_1,C_2 \in [0, + \infty ]\) such that \(C_2 > C_1\), let us solve Problem (4) and let \(K^{C_1}\) and \(K^{C_2}\) be the corresponding values of the objective functions, then:

$$\begin{aligned} \left( \left\| {\varvec{w}}^*_{C_2} \right\| > \left\| {\varvec{w}}^*_{C_1} \right\| \implies \hat{L}^*_{C_2} < \hat{L}^*_{C_1} \right) \ \vee \ \left( \left\| {\varvec{w}}^*_{C_2} \right\| = \left\| {\varvec{w}}^*_{C_1} \right\| \implies \hat{L}^*_{C_2} = \hat{L}^*_{C_1} \right) . \end{aligned}$$
(30)

Proof

In this proof, we proceed by considering all the possible cases, proving by contradiction that configurations other than the ones of the thesis are not admissible.

As a first step, suppose \(\left\| {\varvec{w}}^*_{C_2} \right\| < \left\| {\varvec{w}}^*_{C_1} \right\| \). If \(\hat{L}^*_{C_2} < \hat{L}^*_{C_1}\), then \(K^{C_2} < K^{C_1}\), which is impossible (see Theorem 4). If \(\hat{L}^*_{C_2} = \hat{L}^*_{C_1}\), then:

$$\begin{aligned} K'^{C_1} = \frac{1}{2} \left( \left\| {\varvec{w}}^*_{C_2} \right\| \right) ^2 + C_1 \hat{L}^*_{C_2} < K^{C_1}, \end{aligned}$$
(31)

which contradicts the hypothesis that \(K^{C_1}\) is the global minimum and, then, is not admissible.

If \(\hat{L}^*_{C_2} > \hat{L}^*_{C_1}\), then:

$$\begin{aligned} K'^{C_1} = \frac{1}{2} \left( \left\| {\varvec{w}}^*_{C_2} \right\| \right) ^2 + C_1 \hat{L}^*_{C_2} \ge \frac{1}{2} \left( \left\| {\varvec{w}}^*_{C_1} \right\| \right) ^2 + C_1 \hat{L}^*_{C_1} = K^{C_1}. \end{aligned}$$
(32)

From Eq. (32), we get:

$$\begin{aligned} C_1 \ge \frac{\left( \left\| {\varvec{w}}^*_{C_1} \right\| \right) ^2 - \left( \left\| {\varvec{w}}^*_{C_2} \right\| \right) ^2}{2\left( \hat{L}^*_{C_2} - \hat{L}^*_{C_1} \right) }. \end{aligned}$$
(33)

Analogously, we have for \(K^{C_2}\):

$$\begin{aligned} \frac{1}{2} \left( \left\| {\varvec{w}}^*_{C_1} \right\| \right) ^2 + C_2 \hat{L}^*_{C_1} \ge \frac{1}{2} \left( \left\| {\varvec{w}}^*_{C_2} \right\| \right) ^2 + C_2 \hat{L}^*_{C_2} = K^{C_2}, \end{aligned}$$
(34)

from which we obtain:

$$\begin{aligned} C_2 \le \frac{\left( \left\| {\varvec{w}}^*_{C_1} \right\| \right) ^2 - \left( \left\| {\varvec{w}}^*_{C_2} \right\| \right) ^2}{2\left( \hat{L}^*_{C_2} - \hat{L}^*_{C_1} \right) } \end{aligned}$$
(35)

By joining Eqs. (33) and (35), we have that \(C_2 < C_1\), which contradicts the hypotheses.

Suppose now that \(\left\| {\varvec{w}}^*_{C_2} \right\| = \left\| {\varvec{w}}^*_{C_1} \right\| \). If \(\hat{L}^*_{C_2} < \hat{L}^*_{C_1}\), then

$$\begin{aligned} K'^{C_1} = \frac{1}{2} \left( \left\| {\varvec{w}}^*_{C_1} \right\| \right) ^2 + C_1 \hat{L}^*_{C_2} < K^{C_1} \end{aligned}$$
(36)

which is impossible, as we supposed that \(K^{C_1}\) is the global minimum. Analogously, if \(\hat{L}^*_{C_2} > \hat{L}^*_{C_1}\), then:

$$\begin{aligned} K'^{C_2} = \frac{1}{2} \left( \left\| {\varvec{w}}^*_{C_2} \right\| \right) ^2 + C_2 \hat{L}^*_{C_1} < K^{C_2}, \end{aligned}$$
(37)

or, in other words, \(K^{C_2}\) is not the global minimum. This contradicts the hypothesis and, thus, this configuration is not admissible.

Finally, let us consider \(\left\| {\varvec{w}}^*_{C_2} \right\| > \left\| {\varvec{w}}^*_{C_1} \right\| \). If \(\hat{L}^*_{C_2} = \hat{L}^*_{C_1}\), then

$$\begin{aligned} K'^{C_2} = \frac{1}{2} \left( \left\| {\varvec{w}}^*_{C_1} \right\| \right) ^2 + C_2 \hat{L}^*_{C_2} < K^{C_2} \end{aligned}$$
(38)

which is, again, impossible as \(K^{C_2}\) is supposed to be the global minimum. As a last step, let us consider the case \(\hat{L}^*_{C_2} > \hat{L}^*_{C_1}\), for which we have:

$$\begin{aligned} K'^{C_2} = \frac{1}{2} \left( \left\| {\varvec{w}}^*_{C_1} \right\| \right) ^2 + C_2 \hat{L}^*_{C_1} < K^{C_2}, \end{aligned}$$
(39)

which violates the hypotheses of the theorem. Thus, all the configurations of \(\left\| {\varvec{w}}^*_{C_{1,2}} \right\| \) and \(\hat{L}^*_{C_{1,2}}\) other than the ones of the thesis are not admissible. \(\square \)

The following theorem proves that, if \(\left\| {\varvec{w}}^*_C \right\| \) stops increasing as C increases, it will remain the same, regardless of the value assumed by the regularization parameter.

Theorem 6

Let us consider the Tikhonov formulation. Let \(\left\| {\varvec{w}}^*_{C_\infty } \right\| \) be the solution to the regularization problem for a given value of \(C_\infty \). If \(\exists C>C_\infty \) such that \(\left\| {\varvec{w}}^*_{C} \right\| = \left\| {\varvec{w}}^*_{C_\infty } \right\| \), then \(\left\| {\varvec{w}}^*_{C} \right\| \) will not vary \(\forall C \ge C_\infty \).

Proof

Let us consider a value \(C_1\) of the regularization parameter such that \(C_1 < C_\infty < C < \infty \) and \(\left\| {\varvec{w}}^*_{C} \right\| = \left\| {\varvec{w}}^*_{C_\infty } \right\| > \left\| {\varvec{w}}^*_{C_1} \right\| \). Then, we have that \(\hat{L}^*_{C} = \hat{L}(h^*_{C_\infty }) < \hat{L}(h^*_{C_1})\). Moreover let us consider \(C_2 \in \left[ C_1, C_\infty \right] \). Thanks to the convexity of the problem, we have that \(\forall \alpha \in \left[ 0,1\right] \):

$$\begin{aligned} \hat{L}(h_\alpha ) = \hat{L}\left( h^*_{C_1} (1 - \alpha ) + h^*_{C_\infty } \alpha \right) \le (1 - \alpha ) \hat{L}\left( h^*_{C_1} \right) + \alpha \hat{L}\left( h^*_{C_\infty } \right) . \end{aligned}$$
(40)

Then we can define another Tikhonov problem, whose solution is constrained on the line that connects \(h^*_{C_1}\) and \(h^*_{C_\infty }\) and which can be parametrized as \(h_\alpha = h^*_{C_1} (1 - \alpha ) + h^*_{C_\infty } \alpha \) (\({\varvec{w}}_\alpha = (1 - \alpha ) {\varvec{w}}^*_{C_1} + \alpha {\varvec{w}}^*_{C_\infty } \)) with \(\alpha \in [0,1]\). Then restriction of Problem (4) on the segment joining \({\varvec{w}}^*_{C_1}\) and \({\varvec{w}}^*_{C_\infty }\) can be formulated as:

$$\begin{aligned} \min _{\alpha }\quad \frac{1}{2} \left\| {\varvec{w}}_\alpha \right\| ^2 + C \hat{L}(h_\alpha ). \end{aligned}$$
(41)

Based on these considerations it is easy to show that (see also Fig. 1):

$$\begin{aligned}&\min _{{\varvec{w}},b} \quad \frac{1}{2} \left\| {\varvec{w}} \right\| ^2 + C_2 \hat{L}(h) \end{aligned}$$
(42)
$$\begin{aligned}&\quad \le \min _{\alpha }\quad \frac{1}{2} \left\| {\varvec{w}}_\alpha \right\| ^2 + C_2 \hat{L}(h_\alpha ) \end{aligned}$$
(43)
$$\begin{aligned}&\quad \le \min _{\alpha }\quad \frac{1}{2} \left\| (1 - \alpha ) {\varvec{w}}^*_{C_1} + \alpha {\varvec{w}}^*_{C_\infty } \right\| ^2 \nonumber \\&\qquad + C_2 \left[ (1 - \alpha ) \hat{L}\left( h^*_{C_1} \right) + \alpha \hat{L}\left( h^*_{C_\infty } \right) \right] \end{aligned}$$
(44)
$$\begin{aligned}&\quad =\min _{\alpha }\quad \frac{1}{2} \left\| {\varvec{w}}^*_{C_1} + \alpha \left( {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right) \right\| ^2 \nonumber \\&\qquad + C_2 \left[ \hat{L}\left( h^*_{C_1} \right) + \alpha \left( \hat{L}\left( h^*_{C_\infty } \right) - \hat{L}\left( h^*_{C_1} \right) \right) \right] \end{aligned}$$
(45)
$$\begin{aligned}&\quad = \min _{\alpha }\quad \frac{1}{2} \left[ \left\| {\varvec{w}}^*_{C_1} \right\| ^2 + \alpha ^2 \left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2 + 2 \alpha {\varvec{w}}^*_{C_1} \cdot \left( {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right) \right] \nonumber \\&\qquad + C_2 \left[ \hat{L}\left( h^*_{C_1} \right) + \alpha \left( \hat{L}\left( h^*_{C_\infty } \right) - \hat{L}\left( h^*_{C_1} \right) \right) \right] \end{aligned}$$
(46)

This last minimization problem (Eq. (46)) can be easily solved:

$$\begin{aligned}&\frac{d}{d \alpha } \left\{ \frac{1}{2} \left[ \left\| {\varvec{w}}^*_{C_1} \right\| ^2 + \alpha ^2 \left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2 + 2 \alpha {\varvec{w}}^*_{C_1} \cdot \left( {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right) \right] \right. \nonumber \\&\qquad \left. + \,C_2 \left[ \hat{L}\left( h^*_{C_1} \right) + \alpha \left( \hat{L}\left( h^*_{C_\infty } \right) - \hat{L}\left( h^*_{C_1} \right) \right) \right] \right\} \end{aligned}$$
(47)
$$\begin{aligned}&\quad =\alpha \left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2 + {\varvec{w}}^*_{C_1} \cdot \left( {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right) + C_2 \left[ \hat{L}\left( h^*_{C_\infty } \right) - \hat{L}\left( h^*_{C_1} \right) \right] = 0 \end{aligned}$$
(48)
$$\begin{aligned}&\alpha = \frac{{\varvec{w}}^*_{C_1} \cdot \left( {\varvec{w}}^*_{C_1} - {\varvec{w}}^*_{C_\infty } \right) + C_2 \left[ \hat{L}\left( h^*_{C_1} \right) - \hat{L}\left( h^*_{C_\infty } \right) \right] }{\left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2} \end{aligned}$$
(49)

Then we obtain that:

$$\begin{aligned}&\min _{\alpha }\quad \frac{1}{2} \left[ \left\| {\varvec{w}}^*_{C_1} \right\| ^2 + \alpha ^2 \left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2 + 2 \alpha {\varvec{w}}^*_{C_1} \cdot \left( {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right) \right] \nonumber \\&\qquad + C_2 \left[ \hat{L}\left( h^*_{C_1} \right) + \alpha \left( \hat{L}\left( h^*_{C_\infty } \right) - \hat{L}\left( h^*_{C_1} \right) \right) \right] \end{aligned}$$
(50)
$$\begin{aligned}&\quad = \frac{1}{2} \left\| {\varvec{w}}^*_{C_1} \right\| ^2 + C_2 \hat{L}\left( h^*_{C_1} \right) - \frac{1}{2} \alpha ^2 \left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2 \end{aligned}$$
(51)

Based on these last results, we derive that:

$$\begin{aligned}&\min _{{\varvec{w}},b} \quad \frac{1}{2} \left\| {\varvec{w}} \right\| ^2 + C_2 \hat{L}(h) \end{aligned}$$
(52)
$$\begin{aligned}&\quad \le \frac{1}{2} \left\| {\varvec{w}}^*_{C_1} \right\| ^2 + C_2 \hat{L}\left( h^*_{C_1} \right) - \frac{1}{2} \alpha ^2 \left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2 \end{aligned}$$
(53)

since \(\frac{1}{2} \alpha ^2 \left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2 \ge 0\). Now we can observe that:

$$\begin{aligned}&C_2 = C_1 \rightarrow \alpha = 0 \rightarrow \frac{{\varvec{w}}^*_{C_1} \cdot \left( {\varvec{w}}^*_{C_1} - {\varvec{w}}^*_{C_\infty } \right) + C_2 \left[ \hat{L}\left( h^*_{C_1} \right) - \hat{L}\left( h^*_{C_\infty } \right) \right] }{\left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2} = 0 \nonumber \\&\quad {\varvec{w}}^*_{C_1} \cdot \left( {\varvec{w}}^*_{C_1} - {\varvec{w}}^*_{C_\infty } \right) = -C_1 \left[ \hat{L}\left( h^*_{C_1} \right) - \hat{L}\left( h^*_{C_\infty } \right) \right] \end{aligned}$$
(54)
$$\begin{aligned}&C_2 = C_\infty \rightarrow \alpha = 1 \rightarrow \frac{{\varvec{w}}^*_{C_1} \cdot \left( {\varvec{w}}^*_{C_1} - {\varvec{w}}^*_{C_\infty } \right) + C_2 \left[ \hat{L}\left( h^*_{C_1} \right) - \hat{L}\left( h^*_{C_\infty } \right) \right] }{\left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2} = 1 \nonumber \\&\quad {\varvec{w}}^*_{C_1} \cdot \left( {\varvec{w}}^*_{C_1} - {\varvec{w}}^*_{C_\infty } \right) = -C_\infty \left[ \hat{L}\left( h^*_{C_1} \right) - \hat{L}\left( h^*_{C_\infty } \right) \right] + \left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2 \end{aligned}$$
(55)

Based on these two limit cases we can observe that \(\forall C_2 \in (C_1, C_\infty )\) \(\alpha \) falls, as hypothesized, in (0, 1). Let us take \(\alpha \)

$$\begin{aligned} \alpha = \frac{{\varvec{w}}^*_{C_1} \cdot \left( {\varvec{w}}^*_{C_1} - {\varvec{w}}^*_{C_\infty } \right) + C_2 \left[ \hat{L}\left( h^*_{C_1} \right) - \hat{L}\left( h^*_{C_\infty } \right) \right] }{\left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2}, \end{aligned}$$
(56)

by exploiting Eq. (54) we have that

$$\begin{aligned} \alpha = \frac{-C_1 \left[ \hat{L}\left( h^*_{C_1} \right) - \hat{L}\left( h^*_{C_\infty } \right) \right] + C_2 \left[ \hat{L}\left( h^*_{C_1} \right) - \hat{L}\left( h^*_{C_\infty } \right) \right] }{\left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2} > 0. \end{aligned}$$
(57)

By exploiting, instead, Eq. (55) we have that

$$\begin{aligned}&\frac{-C_\infty \left[ \hat{L}\left( h^*_{C_1} \right) {-} \hat{L}\left( h^*_{C_\infty } \right) \right] {+} \left\| {\varvec{w}}^*_{C_\infty } {-} {\varvec{w}}^*_{C_1} \right\| ^2 + C_2 \left[ \hat{L}\left( h^*_{C_1} \right) {-} \hat{L}\left( h^*_{C_\infty } \right) \right] }{\left\| {\varvec{w}}^*_{C_\infty } {-} {\varvec{w}}^*_{C_1} \right\| ^2} = \nonumber \\&1 - \frac{ \left( C_\infty - C_1 \right) \left[ \hat{L}\left( h^*_{C_1} \right) - \hat{L}\left( h^*_{C_\infty } \right) \right] }{\left\| {\varvec{w}}^*_{C_\infty } - {\varvec{w}}^*_{C_1} \right\| ^2} < 1 \end{aligned}$$
(58)

From this last properties it is possible to state that \(\forall C_2 \in (C_1, C_\infty )\) we have that:

$$\begin{aligned} \left\| {\varvec{w}}^*_{C_1} \right\| < \left\| {\varvec{w}}^*_{C_2} \right\| < \left\| {\varvec{w}}^*_{C_\infty } \right\| . \end{aligned}$$
(59)

Consequently

$$\begin{aligned} \hat{L}(h^*_{C_1}) > \hat{L}(h^*_{C_2}) > \hat{L}(h^*_{C_\infty }). \end{aligned}$$
(60)

This means that if \(\exists C_3 > C > C_\infty \) such that \(\left\| {\varvec{w}}^*_{C_3} \right\| > \left\| {\varvec{w}}^*_{C_\infty } \right\| \) it is not possible that \(\left\| {\varvec{w}}^*_{C} \right\| = \left\| {\varvec{w}}^*_{C_\infty } \right\| \) because in this case \(\forall C \in (C_\infty , C_3)\) we must have that:

$$\begin{aligned} \left\| {\varvec{w}}^*_{C_\infty } \right\| < \left\| {\varvec{w}}^*_{C} \right\| < \left\| {\varvec{w}}^*_{C_3} \right\| . \end{aligned}$$
(61)

Consequently the statement of the Theorem is proved.

A particular case is when there is no \(C_1\), which satisfies the property discussed in the proof (\(C_1 < C_\infty < C < \infty \) and \(\left\| {\varvec{w}}^*_{C} \right\| = \left\| {\varvec{w}}^*_{C_\infty } \right\| > \left\| {\varvec{w}}^*_{C_1} \right\| \)). In this case we have to suppose, by contradiction, that \(\exists C_2\) such that \(C_\infty < C < C_2 < \infty \) and \(\left\| {\varvec{w}}^*_{C} \right\| = \left\| {\varvec{w}}^*_{C_\infty } \right\| < \left\| {\varvec{w}}^*_{C_2} \right\| \). Then, by using the same argument exploited above, it is possible to show that \(\left\| {\varvec{w}}^*_{C} \right\| < \left\| {\varvec{w}}^*_{C_\infty } \right\| < \left\| {\varvec{w}}^*_{C_2} \right\| \), which contradict the hypothesis. Therefore \(\left\| {\varvec{w}}^*_{C} \right\| = \left\| {\varvec{w}}^*_{C_\infty } \right\| = \left\| {\varvec{w}}^*_{C_2} \right\| \). We did not report the details because they are analogous to the one reported before. \(\square \)

Fig. 1
figure 1

Reformulation of the Tikhonov problem of Eq. (4) for proving Theorem 6

A very familiar case where the assumptions of Theorem 6 are met is the one of the SVM with the hinge loss function: in Hastie et al. (2004) the entire regularization path is studied. When \(C = 0\), the solution is \({\varvec{w}} = {\varvec{0}}\), then there is a \(C_\infty \) for which the solution will not change \(\forall C \ge C_\infty \).

From the previous results, it can be clearly noted that, if the minimum value of the empirical error has not been reached yet, the empirical error \(\hat{L}\left( h^*_{C} \right) \) monotonically decreases (and the regularization term \(\left\| {\varvec{w}}^*_{C} \right\| \) monotonically increases) by increasing C, towards its global minimum. If the minimum has been reached, instead, the value \(\left\| {\varvec{w}}^*_{C} \right\| \) does not change even if C is increased. Refer to Fig. 2 for a graphical representation of the results.

Fig. 2
figure 2

Results of the Theorem 6: the monotonicity of the solution path

The fundamental result of Theorem 6 allows to derive an approach for solving the Ivanov and Morozov formulations, by exploiting solvers designed for Tikhonov problems.

Concerning the Ivanov regularization formulation, the approach is graphically presented in Fig. 3 and detailed in Algorithm 1. A starting value for the regularization parameter \(C_{\text {start}}\) is defined: if the solution computed with the Tikhonov formulation is such that \(\left\| {\varvec{w}}^*_{C_{\text {start}}}\right\| > w_{\text {MAX}}\), then the optimal solution for the Ivanov formulation lies on the boundary of the hypothesis space. Since the optimal solution \(\left\| {\varvec{w}}^*_{C}\right\| \) of Tikhonov Problem (4) monotonically increases (and then \(\hat{L}^*_{C} \) monotonically decreases) by increasing C, it is sufficient to search for \(\left\| {\varvec{w}}^*_{C} \right\| = w_{\text {MAX}}\) and obtain the solution \({\varvec{w}}^*_{w_{\text {MAX}}}\) and \(\hat{L}^*_{w_{\text {MAX}}}\). For this purpose, a simple bisection algorithm can be exploited. If, instead, \(\left\| {\varvec{w}}^*_{C_{\text {start}}} \right\| \le w_{\text {MAX}}\), it is possible to search for \(C=C_\infty \) such that the solution to the minimization problem does not change even if we keep increasing C, as proven in Theorem 6. If \(\left\| {\varvec{w}}^*_{C}\right\| < w_{\text {MAX}}\) and the value of \(\left\| {\varvec{w}}^*_{C}\right\| \) does not change with C, the solution has been found; otherwise, the value of C for which \(\left\| {\varvec{w}}^*_{C}\right\| = w_{\text {MAX}}\) shall be identified through a bisection procedure. Note that this technique allows to find the smoothest feasible solution to the Ivanov problem, accordingly to the further constraint added when introducing the Ivanov formulation.

Fig. 3
figure 3

Graphical representation of the algorithm for solving the Ivanov Problem (7) by exploiting the equivalent Tikhonov formulation of Problem (4)

figure a

Concerning the Morozov regularization formulation, the approach is graphically presented in Fig. 4 and detailed in Algorithm 2. Analogously to the algorithm for Ivanov problems, the approach is initialized with the value \(C_{\text {start}}\). If the corresponding empirical error is larger than the threshold \(\hat{L}_{\text {MAX}}\), C is increased until it reaches the boundary of the hypothesis spaceFootnote 2 (i.e. \(\hat{L}^*_{C} = \hat{L}_{\text {MAX}}\)), and a bisection algorithm can be exploited for this purpose. If the empirical error is smaller than the threshold, C must be decreased in order to identify \(C_\infty \), i.e. the value for which \(\left\| {\varvec{w}}^*_{C}\right\| \) starts varying as C decreases. However, we recommend to perform a preliminary step, which consists in checking the error performed by a degenerate model \({\varvec{w}} = {\varvec{0}}\), i.e. the smoothest possible model: if the empirical error for this degenerate model is below the threshold \(\hat{L}_{\text {MAX}}\), then this is the solution to the Morozov regularization problem.

There is an underlying assumption in order to be sure that the algorithms will find the solution, namely that the empirical risk has a minimizer.

As a final remark, note that, in principle, one can compute the entire regularization path (Hastie et al. 2004; Gunter and Zhu 2005; Bach et al. 2005; Friedman et al. 2010; Park and Hastie 2007) (i.e. the solution for all values of C or \(\lambda \)) in place of solving several different problems, formulated in accordance with Tikhonov regularization (Algorithms 1 and 2): the final solution is chosen as the one that satisfies the properties discussed above. This approach is generally not convenient, from a computational point of view, when solving one Ivanov (or Morozov) problem for a particular configuration of \(w_{\text {MAX}}\) (or \(\hat{L}_{\text {MAX}}\)). However, when coping with model selection and/or error estimation in the SRM framework, several values of the hyperparameters must be explored: as a consequence, in this scenario, computing the entire regularization path can be beneficial from the computational perspective.

Fig. 4
figure 4

Graphical representation of the algorithm for solving the Morozov Problem (10) by exploiting the equivalent Tikhonov formulation of Problem (4)

5 Implications on learning

We show in this section that the Tikhonov formulation is not directly linked to the SRM-based learning: it is only used as a workaround in order to exploit its computational advantage over the Ivanov formulation (Vapnik 1998). In fact, the SRM framework always resorts to the Ivanov approach, even if implicitly. Other learning frameworks, instead, like the Algorithmic Stability (Bousquet and Elisseeff 2002), can be linked to both formulations, therefore enabling the exploitation of the approach with the most favorable learning characteristics.

For this purpose, let us consider a 1-bounded l-Lipschitz loss function:

$$\begin{aligned} \ell (h({\varvec{x}}),y) \in [0, 1], \quad |\ell (h_1({\varvec{x}}),y) - \ell (h_2({\varvec{x}}),y) | \le l | h_1({\varvec{x}}) -h_2({\varvec{x}})|, \end{aligned}$$
(62)

and a kernel class such that \(K({\varvec{x}},{\varvec{x}}) \le k^2\), where \(k>0\) is a constant. In this case, the generalization error of a classifier can be bounded through the Expected Rademacher Complexity \(R({\mathcal {H}})\) (Koltchinskii 2001; Bartlett and Mendelson 2003; Anguita et al. 2012). In particular, the latter is an upper bound of the expected difference between L(h) and \(\hat{L}(h)\) (Koltchinskii 2001; Bartlett and Mendelson 2003; Anguita et al. 2012), i.e. the Expected Uniform Deviation U(h):

$$\begin{aligned} U(h) = {\mathbb {E}}_{{\mathcal {D}}_n} \sup _{h \in {\mathcal {H}}} \left[ L(h)-\hat{L}(h)\right] \le {\mathbb {E}}_{{\mathcal {D}}_n} {\mathbb {E}}_{\sigma } \sup _{h \in {\mathcal {H}}} \frac{2 l}{n} \sum _{i = 1}^n \sigma _i h({\varvec{x}}) = 2 l R({\mathcal {H}}). \end{aligned}$$
(63)

Since the Rademacher Complexity \(\hat{R}({\mathcal {H}}) = {\mathbb {E}}_{\sigma } \sup _{h \in {\mathcal {H}}} \frac{1}{n} \sum _{i = 1}^n \sigma _i h({\varvec{x}})\) and the Uniform Deviation \(\hat{U}(h) = \sup _{h \in {\mathcal {H}}} [ L(h)-\hat{L}(h) ] \) are bounded difference functions (McDiarmid 1989; Bartlett and Mendelson 2003), the following bound holds with probability \((1 - \delta )\) (Bartlett and Mendelson 2003):

$$\begin{aligned} L(h) \le \hat{L}(h) + 2 l \hat{R}({\mathcal {H}}) +3 \sqrt{\frac{\log \left( \frac{2}{\delta }\right) }{2 n}} \end{aligned}$$
(64)

In order to compute \(\hat{R}({\mathcal {H}})\), according to SRM, we should fix \({\mathcal {H}}\) and, for this purpose, an Ivanov regularization approach is the natural choice. The Tikhonov and Morozov approaches, instead, cannot be exploited for computing \(\hat{R}({\mathcal {H}})\), since \({\mathcal {H}}\) varies depending on the observations: this problem is also underlined and addressed in Shawe-Taylor et al. (1998). The computation of \(\hat{R}({\mathcal {H}})\) can be performed by solving a Tikhonov (or Morozov) problem and, then, by using the obtained solution as the constraint in the Ivanov formulation (Anguita et al. 2012). In other words, the Ivanov formulation cannot be avoided.

figure b

A possible workaround could consist in bounding \(\hat{R}({\mathcal {H}})\) by noting that, for kernel classes, \(b = 0\), and \(\Vert {\varvec{w}} \Vert \le w_{\text {MAX}}\) the following inequality holds (Bartlett and Mendelson 2003; Bousquet and Elisseeff 2002; Poggio et al. 2002):

$$\begin{aligned} \frac{w_{\text {MAX}}}{\sqrt{2} n} \sqrt{\sum _{i = 1}^n K({\varvec{x}}_i,{\varvec{x}}_i)} \le \hat{R}({\mathcal {H}}) \le \frac{w_{\text {MAX}}}{n} \sqrt{\sum _{i = 1}^n K({\varvec{x}}_i,{\varvec{x}}_i)} \le \frac{k\ w_{\text {MAX}}}{\sqrt{n}}. \end{aligned}$$
(65)

The proof of the upper bound is straightforward (Bartlett and Mendelson 2003), while the proof of the lower bound is based on the Khintchine inequality (Haagerup 1981). Note that the result of Eq. (65) shows exactly the same results achieved in Bartlett (1998), where Vapnik’s approach is exploited (Vapnik 1998): the size of the weights of \({\varvec{w}}\) is more important than the dimensionality of \({\varvec{w}}\).

By using the above inequality, the learning process can be performed using the Tikhonov (or Morozov) formulation, without resorting to the Ivanov one. In fact, it is sufficient to set \(w_{\text {MAX}} = \Vert {\varvec{w}}^* \Vert \), where \({\varvec{w}}^*\) is the solution of the Tikhonov (or Morozov) problem. Note that a further indication of the direct link between SRM and the Ivanov formulation is that, to the best knowledge of the authors, there is no explicit upper or lower bound of \(\hat{R}({\mathcal {H}})\) as a function of C (or \(\hat{L}_{\text {MAX}}\)).

The same conclusion can be drawn by exploiting the Local version of the Rademacher Complexity (Koltchinskii 2006; Bartlett et al. 2005) for which an equivalent version of Eq. (65) exists (Cortes et al. 2013; Bartlett et al. 2005; Mendelson 2003).

We can link, instead, both the Ivanov and Tikhonov formulations to the Algorithmic Stability framework (Bousquet and Elisseeff 2002; Poggio et al. 2004; Tomasi 2004; Oneto et al. 2014; Elisseeff et al. 2005; Evgeniou et al. 2004). Bousquet and Elisseeff (2002) showed that, under some mild conditions, it is possible to bound the generalization error of a learning algorithm without aprioristically defining a set of hypothesis: in particular, we consider the notion of Uniform Stability which provides the sharpest bounds. Let us define the Uniform Stability as the constant \(\beta \) such that:

$$\begin{aligned} \forall {\mathcal {D}}_n, ({\varvec{x}},y): \quad |\ell (h_1({\varvec{x}}),y) - \ell (h_2({\varvec{x}}),y) | \le \beta \end{aligned}$$
(66)

where \(h_1\) is learned using the whole set \({\mathcal {D}}_n\), while \(h_2\) is trained on the set, obtained by removing one sample from \({\mathcal {D}}_n\). Then, the following bound can be derived:

$$\begin{aligned} L(h) \le \hat{L}(h) + 2 \beta + (4 n \beta + 1) \sqrt{\frac{\log \left( \frac{1}{\delta }\right) }{2 n}} \end{aligned}$$
(67)

The Uniform Stability \(\beta \) can be upper bounded when the Tikhonov formulation is used for learning (Bousquet and Elisseeff 2002), in fact:

$$\begin{aligned} |\ell (h_1({\varvec{x}}),y) - \ell (h_2({\varvec{x}}),y) | \le l | h_1({\varvec{x}}) -h_2({\varvec{x}})| \le l^2 k^2 C \end{aligned}$$
(68)

Moreover, a similar bound can be found, when using the Ivanov regularization scheme:

$$\begin{aligned} |\ell (h_1({\varvec{x}}),y) - \ell (h_2({\varvec{x}}),y) | \le l | h_1({\varvec{x}}) -h_2({\varvec{x}})| \le 2 l k w_{\text {MAX}} \end{aligned}$$
(69)

Therefore, in the Algorithmic Stability framework, both C and \(w_{\text {MAX}}\) can be used for directly controlling the generalization ability of a learning algorithm.

However, it is worthwhile noting that, given a particular training set, even if we were able to find C and \(w_{\text {MAX}}\), such that the solutions to the Tikhonov and Ivanov formulations are equivalent, the stability of the two learning procedures could be different. In other words, let us consider a training set \({\mathcal {D}}_n\) and let us solve the Tikhonov formulation for a particular value of C, then the solution will be \({\varvec{w}}^*_{C}\). If we set \(w_{\text {MAX}} = \left\| {\varvec{w}}^*_{C} \right\| \) and solve the Ivanov formulation, then the same solution of the Tikhonov one will be found, given the results of Sect. 4. This means that the Tikhonov and the Ivanov formulations are apparently indistinguishable, since they give the same result when applied to the same data. Moreover, if we analyze the two procedures in the usual SRM framework, we can conclude that the two learning machines choose functions from the same set and therefore have the same associated risk. The Algorithmic Stability, instead, gives more insight into the learning process. In fact, even if the two procedures give the same solution, the risk associated to them can be different. This reflects the fact that the two procedures are, indeed, different despite choosing the same final model. The Tikhonov formulation, for a given C, chooses from a set of hypothesis which depends on \({\mathcal {D}}_n\), the Ivanov formulation, instead, will always choose from the same set of hypothesis, for a given \(w_{\text {MAX}}\), regardless of the available \({\mathcal {D}}_n\). Consequently, given a specific problem, it would be possible to adopt the formulation characterized by the best learning properties.

The Algorithmic Stability opens also another perspective over the Tikhonov and the Ivanov formulations. Given the solution of the Tikhonov formulation for a given C, one can adopt the Algorithmic Stability bound of Eq. (69) by taking \(w_{\text {MAX}} = \left\| {\varvec{w}}^*_{C} \right\| \), since the solution of the two formulations would be the same in this setting. In practice, we are pretending to use the Ivanov formulation, since this is more advantageous when the bound of Eq. (69) is tighter. Consequently, the two formulations may have different learning capabilities for different datasets, and the bounds of Eqs. (68) and (69) can tell us which is the best one for the problem under exam: it is only necessary to choose the formulation with the smaller associated risk.

In order to better clarify the above analysis, we show here, through a simple example, that the class \({\mathcal {H}}\) varies based on the observations in the case of the Tikhonov formulation. Let us consider a one dimensional linear classification problem, where \(h(x) = w \cdot x\), and the following datasets:

  • TOY-1—n samples in \(x = -1\) with \(y = -1\); n samples in \(x = 1\) with \(y = 1\); one sample in \(x = 1+\Delta \), with \(y = 1\) (Fig. 5).

  • TOY-2—n samples in \(x = -1\) with \(y = -1\); n samples in \(x = 1\) with \(y = 1\); one sample in \(x = 1+\Delta \), with \(y = -1\) (Fig. 6).

Fig. 5
figure 5

TOY-1

Fig. 6
figure 6

TOY-2

Figure 7 shows the trend of \(| w^* |\) as C is varied in \(C \in [10^{-6}, 10^4]\), where \(w^*\) is the solution of the Tikhonov formulation, when \(n = 5\), \(\Delta = 8\), and the Hinge loss function (Vapnik 1998) is used. It clearly emerges that the class of functions \({\mathcal {H}}\) is not determined by C: in fact, for a given value of C, the solution \(w^*\) changes depending on the training set.

Fig. 7
figure 7

Trend of \(| w^* |\), where \(w^*\) is the solution of the Tikhonov formulation, for TOY-1 and TOY-2

The same toy problems can be exploited to show that, using the notion of Uniform Stability, either the Tikhonov or the Ivanov formulations can be adopted, based on the one that shows the best learning properties. Figure 8 shows the value of the Uniform Stability by using Eqs. (68) and (69) as C is varied, where the same experimental setup as above is applied (note that \(l=1\) when the Hinge loss function is exploited). Obviously, if the bound of Eq. (68) is used, the Uniform Stability is the same for the two problems. Instead, if we use the bound of Eq. (69), the Uniform Stability of TOY-1 is larger respect to the one computed with Eq. (68), while the Uniform Stability of TOY-2 is smaller.

Fig. 8
figure 8

Uniform Stability by using Eqs. (68) and (69)

As a final remark, it is worth noting that, to the best of our knowledge, a link between the Morozov formulation and the Uniform Stability has not been found yet.

6 Tikhonov, Ivanov and Morozov formulations for support vector machine classifiers learning

The approches proposed in the previous Section are quite general and can be applied to any convex loss function. We will focus here on the SVM classifier, instead, and its hinge-loss function and show that the procedures described by Algorithms 1 and 2 can be further refined and improved.

The hinge loss function is defined as:

$$\begin{aligned} \ell \left( h({\varvec{x}}),y\right) = \xi = \max \left[ 0, 1 - y h({\varvec{x}})\right] = \left| 1 - y h({\varvec{x}}) \right| _+, \end{aligned}$$
(70)

where \(|\cdot |_+ = \max (0,\cdot )\). Then, the T-SVM learning problem can be formulated as follows:

$$\begin{aligned} h: \quad \arg \min _{{\varvec{w}},b,{\varvec{\xi }}}\quad&\sum _{i = 1}^n \xi _i + \frac{\lambda }{2} \left\| {\varvec{w}} \right\| ^2 \nonumber \\&y_i \left( {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}_i) + b\right) \ge 1 - \xi _i \nonumber \\&\xi _i \ge 0, \quad \quad i \in \left\{ 1,\ldots , n\right\} , \end{aligned}$$
(71)
$$\begin{aligned}&\nonumber \\ \text {or equivalently:}\nonumber \\ h: \quad \arg \min _{{\varvec{w}},b,{\varvec{\xi }}}\quad&\frac{1}{2} \left\| {\varvec{w}} \right\| ^2 + C \sum _{i = 1}^n \xi _i \nonumber \\&y_i \left( {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}_i) + b\right) \ge 1 - \xi _i \nonumber \\&\xi _i \ge 0, \quad \quad i \in \left\{ 1,\ldots , n\right\} . \end{aligned}$$
(72)

The I-SVM formulation is:

$$\begin{aligned} h: \quad \arg \min _{{\varvec{w}},b,{\varvec{\xi }},\left\| {\varvec{w}} \right\| } \quad&\sum _{i = 1}^n \xi _i \nonumber \\&\left\| {\varvec{w}} \right\| ^2 \le w^2_{\text {MAX}}, \nonumber \\&y_i \left( {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}_i) + b\right) \ge 1 - \xi _i \nonumber \\&\xi _i \ge 0, \quad \quad i \in \left\{ 1,\ldots , n\right\} , \end{aligned}$$
(73)

Finally, the M-SVM formulation is:

$$\begin{aligned} h: \quad \arg \min _{{\varvec{w}},b,{\varvec{\xi }}} \quad&\frac{1}{2} \left\| {\varvec{w}} \right\| ^2 \nonumber \\&\sum _{i = 1}^n \xi _i \le \hat{L}_{\text {MAX}}, \nonumber \\&y_i \left( {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}_i) + b\right) \ge 1 - \xi _i \nonumber \\&\xi _i \ge 0, \quad \quad i \in \left\{ 1,\ldots , n\right\} . \end{aligned}$$
(74)

6.1 Training a classifier with T-SVM

The most common approach for training a T-SVM classifier relies on solving the dual formulation of Problem (71):

$$\begin{aligned} \min _{{\varvec{\alpha }}}\quad&\frac{1}{2} \sum _{i = 1}^n \sum _{j = 1}^n \alpha _i \alpha _j y_i y_j K\left( {\varvec{x}}_i, {\varvec{x}}_j \right) - \sum _{i = 1}^n \alpha _i \nonumber \\&\sum _{i = 1}^n \alpha _i y_i = 0 \nonumber \\&0 \le \alpha _i \le C, \quad \quad i \in \left\{ 1,\ldots , n\right\} , \end{aligned}$$
(75)

where the classification function is reformulated accordingly:

$$\begin{aligned} h\left( {\varvec{x}}\right) = \sum _{i = 1}^n \alpha _i y_i K({\varvec{x}}_i,{\varvec{x}}) + b. \end{aligned}$$
(76)

One of the most well-known approaches for solving Problem (75) is the Sequential Minimal Optimization (SMO) algorithm (Platt 1998; Keerthi et al. 2001; Keerthi and Gilbert 2002; Fan et al. 2005), though several alternatives are available in literature.Footnote 3

6.2 Training a classifier with I-SVM

The first alternative for effectively solving I-SVM consists in exploiting Algorithm 1, where the T-SVM Problem (71) (or (75)) is used in place of Problem (4) and, analogously, I-SVM Problem (73) replaces Problem (7).

A second technique, tailored for I-SVM learning, consists in using the results of Hastie et al. (2004) in order to reduce the computational burden of Algorithm 1. In their work, Hastie et al. (2004) prove that the solution path of T-SVM, obtained by varying C, is piecewise linear. This means that the optimal solution \(\left( {\varvec{\alpha }}^*_T, b^*_T \right) \) linearly changes as C is increased, with the partial exception of S change-points, which can be effectively identified, as the computational cost to find them is comparable to the burden required to find one solution to Problem (75) (Hastie et al. 2004). Moreover, the authors also proved that \(\exists C_\infty < \infty \) such that \(\forall C \ge C_\infty \) the solution \(\left( {\varvec{w}}^*_T, b^*_T\right) \) does not vary with C, i.e. a particular case of the more general Theorem 6, presented in Sect. 4. Moreover, note that, when SVM is concerned, the equivalence between I-SVM, T-SVM and M-SVM holds not only for the norm of the solutions, but for the solution itself (Hastie et al. 2004). Consequently, we can exploit the information embedded into the regularization path to improve Algorithm 1, when addressing I-SVM learning.

Let \(RP_{[33]}\) be the implementation of the method, proposed in Hastie et al. (2004), allowing to identify the solution path for the T-SVM Problem (75):

$$\begin{aligned} \left\{ \left( {\varvec{\alpha }}^*_T(C_0), b^*_T(C_0), C_0\right) , \ldots , \left( {\varvec{\alpha }}^*_T(C_S), b^*_T(C_S), C_S\right) \right\} = RP_{[33]}, \end{aligned}$$
(77)

where, in particular, given Theorem 6 and the results of Hastie et al. (2004), \(C_S = C_\infty \). As the solution varies linearly, with respect to C, in the span between two consecutive change-points \(C_{i-1}\) and \(C_i\), with \(i \in \left\{ 1,\ldots ,S\right\} \), then:

$$\begin{aligned} {\varvec{\alpha }}^*_T(C_i)&= {\varvec{m}}^*_T(C_i) C + {\varvec{o}}^*_T(C_i), \quad \text {for} \quad C \in \left[ C_{i-1}, C_i\right] \end{aligned}$$
(78)
$$\begin{aligned} {\varvec{m}}^*_T(C_i)&= \frac{{\varvec{\alpha }}^*_T(C_i) - {\varvec{\alpha }}^*_T(C_{i-1})}{C_{i} -C_{i-1}}, \quad {\varvec{o}}^*_T(C_i) = {\varvec{m}}^*_T(C_i) C_{i-1} + {\varvec{\alpha }}^*_T(C_{i-1}). \end{aligned}$$
(79)

We can also compute \(\left\| {\varvec{w}}^*_C \right\| ^2\) for \(C \in \left[ C_{i-1}, C_i\right] \):

$$\begin{aligned}&\left\| {\varvec{w}}^*_C \right\| ^2 = \sum _{i = 1}^n \sum _{i = 1}^n \left[ \left( {\varvec{m}}^*_T(C_i)\right) _i C + \left( {\varvec{o}}^*_T(C_i)\right) _i\right] \nonumber \\&\qquad \qquad \quad \,\left[ \left( {\varvec{m}}^*_T(C_i)\right) _j C + \left( {\varvec{o}}^*_T(C_i)\right) _j\right] y_i y_j K\left( {\varvec{x}}_i,{\varvec{x}}_j\right) \end{aligned}$$
(80)
$$\begin{aligned}&\quad = C^2 \sum _{i = 1}^n \sum _{i = 1}^n \left( {\varvec{m}}^*_T(C_i)\right) _i \left( {\varvec{m}}^*_T(C_i)\right) _j y_i y_j K\left( {\varvec{x}}_i,{\varvec{x}}_j\right) \nonumber \\&\qquad + 2 C \sum _{i = 1}^n \sum _{i = 1}^n \left( {\varvec{m}}^*_T(C_i)\right) _i \left( {\varvec{o}}^*_T(C_i)\right) _j y_i y_j K\left( {\varvec{x}}_i,{\varvec{x}}_j\right) \nonumber \\&\qquad + \sum _{i = 1}^n \sum _{i = 1}^n \left( {\varvec{o}}^*_T(C_i)\right) _i \left( {\varvec{o}}^*_T(C_i)\right) _j y_i y_j K\left( {\varvec{x}}_i,{\varvec{x}}_j\right) \end{aligned}$$
(81)
$$\begin{aligned}&\quad = P2^*_T(C_i) C^2 + P1^*_T(C_i) C + P0^*_T(C_i) \quad C \in \left[ C_{i-1}, C_i\right] \end{aligned}$$
(82)

Given these results we can reformulate Eq. (77) as follows:

$$\begin{aligned}&\left\{ \left[ P2^*_T(C_1),P2^*_T(C_1),P2^*_T(C_1), \quad C \in \left[ C_0 = 0,C_1\right] \right] , \right. \nonumber \\&\left[ P2^*_T(C_2),P2^*_T(C_2),P2^*_T(C_2), \quad C \in \left[ C_1,C_2\right] \right] , \nonumber \\&\quad \ldots , \nonumber \\&\left. \left[ P2^*_T(C_S),P2^*_T(C_S),P2^*_T(C), \quad C \in \left[ C_{S-1},C_S = C_\infty \right] \right] \right\} \end{aligned}$$
(83)
$$\begin{aligned}&\quad =\left\{ P2^*_T(C),P2^*_T(C),P2^*_T(C), \quad C \in \left[ 0,C_\infty \right] \right\} \end{aligned}$$
(84)
$$\begin{aligned}&\quad =\left\{ \left\| {\varvec{w}}^* \right\| (C), {\varvec{w}}^*(C), \quad C \in \left[ 0,C_\infty \right] \right\} = RP_{[33]}. \end{aligned}$$
(85)

As a consequence, we can reformulate Algorithm 1 by exploiting the results of Hastie et al. (2004): the resulting approach is presented in Algorithm 3.

figure c

A third approach is based on the ideas of Martein and Schaible (1987) and Anguita et al. (2010). It exploits, in turn, conventional Linear (LP) and Quadratic Programming (QP) optimization algorithms, as shown in Algorithm 4. The first step consists in discarding the quadratic constraint \(\left\| {\varvec{w}} \right\| ^2 \le w^2_{\text {MAX}}\) and using the Representer Theorem to reformulate Problem (73) as follows:

$$\begin{aligned} h: \quad \arg \min _{{\varvec{\alpha }},b,{\varvec{\xi }}} \quad&\sum _{i = 1}^n \xi _i \nonumber \\&y_i \left( \sum _{j = 1}^n \alpha _j y_j K({\varvec{x}}_j,{\varvec{x}}_i) + b\right) \ge 1 - \xi _i \nonumber \\&\xi _i \ge 0, \quad \quad i \in \left\{ 1,\ldots , n\right\} \end{aligned}$$
(86)

which is a standard LP problem. After the optimization procedure ends, the value of \(\Vert {\varvec{w}}\Vert ^2\) is computed and two alternatives arise: if the constraint is satisfied, then the solution found is also the optimal one and the routine ends; otherwise the optimal solution corresponds to \(\Vert {\varvec{w}}\Vert = w_{MAX}\) (Boyd and Vandenberghe 2004). In order to find \({\varvec{w}}\), we have to switch to the dual of Problem (73):

$$\begin{aligned} {\mathcal {L}}\left( {\varvec{w}},b,{\varvec{\xi }},{\varvec{\beta }},\gamma ,{\varvec{\mu }}\right) =&\sum _{i = 1}^n \xi _i - \frac{\gamma }{2} \left( w^2_{\text {MAX}} - \left\| {\varvec{w}} \right\| ^2 \right) \nonumber \\&- \sum _{i = 1}^n \beta _i \left[ y_i \left( {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}_i) + b\right) - 1 + \xi _i\right] - \sum _{i = 1}^n \mu _i \xi _i. \end{aligned}$$
(87)

Then we can compute the Karush–Kuhn–Tucker (KKT) and the complementary conditions:

$$\begin{aligned} \frac{\partial {\mathcal {L}}\left( {\varvec{w}},b,{\varvec{\xi }},{\varvec{\beta }},\gamma ,{\varvec{\mu }}\right) }{\partial {\varvec{w}}} =&\; \gamma {\varvec{w}} - \sum _{i = 1}^n \beta _i y_i {\varvec{\phi }}({\varvec{x}}_i) = 0 \ \rightarrow \ {\varvec{w}} = \frac{1}{\gamma } \sum _{i = 1}^n \beta _i y_i {\varvec{\phi }}({\varvec{x}}_i) \nonumber \\&\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \gamma \ne 0 \end{aligned}$$
(88)
$$\begin{aligned} \frac{\partial {\mathcal {L}}\left( {\varvec{w}},b,{\varvec{\xi }},{\varvec{\beta }},\gamma ,{\varvec{\mu }}\right) }{\partial b} =&\; - \sum _{i = 1}^n \beta _i y_i = 0 \end{aligned}$$
(89)
$$\begin{aligned} \frac{\partial {\mathcal {L}}\left( {\varvec{w}},b,{\varvec{\xi }},{\varvec{\beta }},\gamma ,{\varvec{\mu }}\right) }{\partial \xi _i} =&\; 1 - \beta _i - \mu _i = 0 \ \rightarrow \ \beta _i \le 1 \end{aligned}$$
(90)
$$\begin{aligned}&\gamma ,{\varvec{\beta }},{\varvec{\mu }},\xi _i \ge 0 \end{aligned}$$
(91)
$$\begin{aligned}&\left\| {\varvec{w}} \right\| ^2 \le w^2_{\text {MAX}} \end{aligned}$$
(92)
$$\begin{aligned}&y_i \left( {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}) + b\right) \ge 1 - \xi _i \end{aligned}$$
(93)
$$\begin{aligned}&\beta _i \left[ y_i \left( {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}) + b\right) - 1 + \xi _i\right] = 0 \end{aligned}$$
(94)
$$\begin{aligned}&\mu _i \xi _i = 0 \end{aligned}$$
(95)
$$\begin{aligned}&\gamma \left( w^2_{\text {MAX}} - \left\| {\varvec{w}} \right\| ^2 \right) = 0, \quad \forall i \in \left\{ 1,\ldots ,n\right\} \end{aligned}$$
(96)

The dual formulation is formulated as follows:

$$\begin{aligned} \min _{{\varvec{\beta }},\gamma } \quad&\frac{1}{2 \gamma } \sum _{i = 1}^n \sum _{j = 1}^n \beta _i \beta _j y_i y_j K({\varvec{x}}_j,{\varvec{x}}_i) - \sum _{i = 1}^n \beta _i + \frac{\gamma w^2_{\text {MAX}}}{2} \nonumber \\&\sum _{i = 1}^n \beta _i y_i = 0 \nonumber \\&\gamma \ge 0 \nonumber \\&0 \le \beta _i \le 1, \quad \quad i \in \left\{ 1,\ldots , n\right\} . \end{aligned}$$
(97)

Note that, if the quadratic constraint \(\left\| {\varvec{w}} \right\| ^2 \le w^2_{\text {MAX}}\) were satisfied, \(\gamma \) would equal zero and the dual formulation would not be solvable: this is why, as a first step, we make use of the LP routines for solving Problem (73) and we exploit the dual formulation only if the quadratic constraint is not satisfied. We are interested in solving Problem (97) using conventional QP solvers for SVM (e.g. Platt 1998), therefore we use an iterative optimization technique. The first step consists in fixing the value of \(\gamma =\gamma _0 > 0\) and, then, optimizing the cost function with reference to the other dual variables \({\varvec{\beta }}\). It is easy to see that the term \(\frac{\gamma w_{MAX}^2}{2}\) is constant at this stage and can be removed from the expression, so the dual becomes:

$$\begin{aligned} \min _{{\varvec{\beta }}} \quad&\frac{1}{2} \sum _{i=1}^{n} \sum _{j=1}^{n} \beta _i \beta _j y_i y_j K({\varvec{x}}_j,{\varvec{x}}_i) - \gamma _0 \sum _{i = 1}^n \beta _i \nonumber \\&\sum _{i = 1}^n \beta _i y_i = 0 \nonumber \\&0\le \beta _i \le 1, \quad \quad i \in \left\{ 1,\ldots , n\right\} . \end{aligned}$$
(98)

which is equivalent to the conventional SVM dual problem (75) and can be solved with SMO. The next step consists in updating the value of \(\gamma _0\). We have to compute the Lagrangian of Problem (97):

$$\begin{aligned} {\mathcal {L}}\left( {\varvec{\beta }},\gamma ,\rho ,\kappa ,{\varvec{\eta }},{\varvec{\nu }} \right) =&\; \frac{1}{2 \gamma } \sum _{i = 1}^n \sum _{j = 1}^n \beta _i \beta _j y_i y_j K({\varvec{x}}_j,{\varvec{x}}_i) - \sum _{i = 1}^n \beta _i + \frac{\gamma w^2_{\text {MAX}}}{2} \nonumber \\&- \rho \left( \sum _{i = 1}^n \beta _i y_i\right) - \kappa \gamma - \sum _{i = 1}^n \eta _i \beta - \sum _{i = 1}^n \nu _i (C - \beta _i) \end{aligned}$$
(99)

The following derivative of \({\mathcal {L}}\left( {\varvec{\beta }},\gamma ,\rho ,\kappa ,{\varvec{\eta }},{\varvec{\nu }} \right) \) is the only one of interest for our purposes:

$$\begin{aligned} \frac{\partial {\mathcal {L}}\left( {\varvec{\beta }},\gamma ,\rho ,\kappa ,{\varvec{\eta }},{\varvec{\nu }} \right) }{\partial \gamma } = 0 = - \frac{1}{2\gamma ^2} \sum _{i = 1}^n \sum _{j = 1}^n \beta _i \beta _j y_i y_j K({\varvec{x}}_j,{\varvec{x}}_i) + \frac{w^2_{\text {MAX}}}{2} - \kappa \end{aligned}$$
(100)

Since, from the slackness conditions, we have that \(\kappa \gamma = 0\) and since, in the cases of interest, \(\gamma > 0\), it must be \(\kappa = 0\) and we find the following updating rule for \(\gamma _0\):

$$\begin{aligned} \gamma _0 = \frac{\sqrt{\sum _{i = 1}^n \sum _{j = 1}^n \beta _i \beta _j y_i y_j K({\varvec{x}}_j,{\varvec{x}}_i)}}{w_{MAX}}. \end{aligned}$$
(101)

We iteratively proceed in solving the dual of Problem (98) and updating the value of \(\gamma _0\) until the termination condition is met:

$$\begin{aligned} \left| \gamma _0 -\gamma _0^{\text {old}} \right| \le \tau , \end{aligned}$$
(102)

where \(\tau \) is a user-defined tolerance.

figure d

The main disadvantage of Algorithm 4 is that it requires the use of two different solvers (LP and QP routines). By exploiting the results of Sect. 4, we can avoid this problem: in fact, we showed that the Lagrange multiplier of the quadratic constraint of Eq. (73) (\(\gamma \)) is equivalent to the regularization parameter C in Eq. (71). Consequently, there exists a value of \(\gamma > 0\) for which the solution of Problem (73) is equivalent to the one of Problem (71) for \(C = C_\infty \). Consequently, only the QP solver is necessary to train a classifier with I-SVM, as shown in Algorithm 4.

figure e

6.3 Training a classifier with M-SVM

In order to propose an effective method to solve M-SVM by exploiting the approaches designed for T-SVM, as a first step we can apply the general-purpose approach of Algorithm 2. Obviously, this technique is characterized by the same drawbacks highlighted in Sect. 6.2 for I-SVM.

A second possibility consists, analogously to I-SVM, in exploiting the results of Hastie et al. (2004) to improve the performance of Algorithm 2. The entire regularization path is obtained by:

$$\begin{aligned} \left\{ {\varvec{w}}^*(C), b^*(C), {\varvec{\xi }}^*(C), \quad C \in \left[ 0,C_\infty \right] \right\} = RP_{[33]}, \end{aligned}$$
(103)

where the computational cost is equal to the one of a single optimization step of Problem (71) since all the variables (\({\varvec{\alpha }}\), b, \({\varvec{\xi }}\) etc.) are piecewise linear in C. Then Algorithm 2 can be modified accordingly as shown in Algorithm 6.

figure f

A third possibility consists in solving Problem (74) through an ad hoc procedure, which however allows to exploit the large amount of work pursued for designing effective solvers for T-SVM. For this purpose, we start by deriving the dual formulation for M-SVM:

$$\begin{aligned} {\mathcal {L}}\left( {\varvec{w}},b,{\varvec{\xi }},\gamma ,{\varvec{\beta }},{\varvec{\mu }} \right) =&\; \frac{1}{2} \left\| {\varvec{w}} \right\| ^2 - \gamma \left( \hat{L}_{\text {MAX}} - \sum _{i = 1}^n \xi _i \right) \nonumber \\&- \sum _{i = 1}^n \beta _i \left[ y_i \left( {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}_i) + b\right) - 1 + \xi _i \right] - \sum _{i = 1}^n \mu _i \xi _i. \end{aligned}$$
(104)

Then we can compute its KKT and the complementary conditions:

$$\begin{aligned} \frac{\partial {\mathcal {L}}\left( {\varvec{w}},b,{\varvec{\xi }},\gamma ,{\varvec{\beta }},{\varvec{\mu }} \right) }{\partial {\varvec{w}}} =&\; {\varvec{w}} - \sum _{i = 1}^n \beta _i y_i {\varvec{\phi }}({\varvec{x}}_i) = 0 \ \rightarrow \ {\varvec{w}} = \sum _{i = 1}^n \beta _i y_i {\varvec{\phi }}({\varvec{x}}_i) \end{aligned}$$
(105)
$$\begin{aligned} \frac{\partial {\mathcal {L}}\left( {\varvec{w}},b,{\varvec{\xi }},\gamma ,{\varvec{\beta }},{\varvec{\mu }} \right) }{\partial b} =&- \sum _{i = 1}^n \beta _i y_i = 0 \end{aligned}$$
(106)
$$\begin{aligned} \frac{\partial {\mathcal {L}}\left( {\varvec{w}},b,{\varvec{\xi }},\gamma ,{\varvec{\beta }},{\varvec{\mu }} \right) }{\partial \xi _i} =&\; \gamma - \beta _i - \mu _i = 0 \ \rightarrow \ \beta _i \le \gamma \end{aligned}$$
(107)
$$\begin{aligned}&\gamma ,{\varvec{\beta }},{\varvec{\mu }},\xi _i \ge 0 \end{aligned}$$
(108)
$$\begin{aligned}&\sum _{i = 1}^n \xi _i \le \hat{L}_{\text {MAX}} \end{aligned}$$
(109)
$$\begin{aligned}&y_i \left( {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}) + b\right) \ge 1 - \xi _i \end{aligned}$$
(110)
$$\begin{aligned}&\beta _i \left[ y_i \left( {\varvec{w}} \cdot {\varvec{\phi }}({\varvec{x}}) + b\right) - 1 + \xi _i\right] = 0 \end{aligned}$$
(111)
$$\begin{aligned}&\mu _i \xi _i = 0 \end{aligned}$$
(112)
$$\begin{aligned}&\gamma \left( \hat{L}_{\text {MAX}} - \sum _{i = 1}^n \xi _i \right) = 0, \quad \forall i \in \left\{ 1,\ldots ,n\right\} \end{aligned}$$
(113)

Finally, we get:

$$\begin{aligned} \min _{{\varvec{\beta }},\gamma } \quad&CD({\varvec{\beta }},\gamma ) = \frac{1}{2} \sum _{i = 1}^n \sum _{j = 1}^n \beta _i \beta _j y_i y_j K({\varvec{x}}_i,{\varvec{x}}_j) - \sum _{i = 1}^n \beta _i + \gamma \hat{L}_{\text {MAX}} \nonumber \\&\sum _{i = 1}^n \beta _i y_i = 0 \nonumber \\&\gamma \ge 0 \nonumber \\&0 \le \beta _i \le \gamma , \quad \quad i \in \left\{ 1,\ldots , n\right\} , \end{aligned}$$
(114)

with

$$\begin{aligned} h({\varvec{x}}) = \sum _{i = 1}^n \beta _i y_i K({\varvec{x}}_i,{\varvec{x}}) + b. \end{aligned}$$
(115)

Note that, if we fix \(\gamma = \gamma _0\), where \( \gamma _0\) is a constant value, Problem (114) becomes:

$$\begin{aligned} \min _{{\varvec{\beta }},\gamma } \quad&\frac{1}{2} \sum _{i = 1}^n \sum _{j = 1}^n \beta _i \beta _j y_i y_j K({\varvec{x}}_i,{\varvec{x}}_j) - \sum _{i = 1}^n \beta _i \nonumber \\&\sum _{i = 1}^n \beta _i y_i = 0 \nonumber \\&0 \le \beta _i \le \gamma _0, \quad \quad i \in \left\{ 1,\ldots , n\right\} , \end{aligned}$$
(116)

which thus can be solved with the procedures, designed for T-SVM. As the problem is also convex with respect to \(\gamma \), an iterative procedure, described in Algorithm 7 and based on Flannery et al. (1992), can be used to optimize it. It is worthwhile underlining that the resulting procedure only requires one QP solver, analogously to Algorithm 5.

figure g

7 Benchmarking the performance of I-SVM and M-SVM solvers

We exploit in the following a real-world dataset for the purpose of comparing the algorithms presented in the previous sections. We make use of the DaimlerChrysler dataset (Munder and Gavrila 2006), where half of the 9800 images, consisting of \(d=36 \times 18 = 648\) pixels, contains the picture of a pedestrian, while the other half contains only some general background or other objects. In order to derive more statistically relevant results than a single run, we create 100 replicates of the dataset, where the value of the input patterns are left unchanged while a random array of labels is assigned to the samples, thus emulating a conventional setup for model selection and error estimation through the Rademacher Complexity (Bartlett and Mendelson 2003; Koltchinskii 2006; Anguita et al. 2012).

Fig. 9
figure 9

Distributions of the values \(\left( \left\| {\varvec{w}}^*_T \right\| , \hat{L}^*_T\right) \), \(\left( \left\| {\varvec{w}}^*_I \right\| , \hat{L}^*_I\right) \) and \(\left( \left\| {\varvec{w}}^*_M \right\| , \hat{L}^*_M\right) \) for T-SVM, I-SVM and M-SVM on the datasets used for the experiments. a T-SVM Problem (71), b I-SVM Problem (73), c M-SVM Problem (74)

The algorithms are implemented in FORTRAN 90, compiled by exploiting the Intel Visual Fortran Composer XE compiler (2012), and are run on a Microsoft Windows Server 2008 R2 server with 16 GB RAM and mounting two Intel Xeon E5320 1.86 GHz CPUs.

For our experiments, we exploit a Gaussian kernel function (Keerthi and Lin 2003):

$$\begin{aligned} K\left( {\varvec{x}}_i,{\varvec{x}}_j\right) = e^{- \frac{\left\| {\varvec{x}}_i - {\varvec{x}}_j \right\| _2^2}{2 \sigma ^2} }, \end{aligned}$$
(117)

where the Gaussian width parameter \(\sigma \) is estimated by computing the average distance between patterns belonging to the two classes, according to the rule-of-thumb proposed in Milenova et al. (2005). The SVM regularization parameters C, \(w_{\text {MAX}}\) and \(\hat{L}_{\text {MAX}}\) for T-SVM, I-SVM and M-SVM, respectively, are set by exploiting one-shot procedures as well. In particular, for C we adopt the procedure proposed in Milenova et al. (2005), returning an regularization parameter value that we define \(C^{[46]}\) for the sake of simplicity. Consequently, \(w_{\text {MAX}}\) and \(\hat{L}_{\text {MAX}}\) are respectively set to \(w^{[46]}_{\text {MAX}}\) and \(\hat{L}_{\text {MAX}}^{[46]}\), which are computed by solving T-SVM Problem (71) with \(C = C^{[46]}\) on the original DaimlerChrysler dataset. Note that the regularization parameter and the parameter of the Gaussian kernel are kept constant for all 100 random replicates.

As highlighted in the previous sections, some solvers are needed in order to derive the solutions to T-SVM, I-SVM and M-SVM. In particular, whenever a QP solver is required (namely in Algorithms 12, 34, 56, and 7), the Sequential Minimal Optimization (SMO) procedure is exploited (Platt 1998; Keerthi et al. 2001); on the contrary, when an LP solver is required (e.g. in Algorithm 4), we exploit the simplex method proposed in Flannery et al. (1992).

Figure 9 shows the distributions of the values \(\left( \left\| {\varvec{w}}^*_T \right\| , \hat{L}^*_T\right) \), \(\left( \left\| {\varvec{w}}^*_I \right\| , \hat{L}^*_I\right) \) and \(\left( \left\| {\varvec{w}}^*_M \right\| , \hat{L}^*_M\right) \) for T-SVM, I-SVM and M-SVM: they are useful to compare the obtained solutions in the three cases. It is worth noting that, when solving a T-SVM learning problem, neither \(\left\| {\varvec{w}}^*_T \right\| \) nor \(\hat{L}^*_T\) remain fixed, but they vary depending on the random labels assigned to the samples for computing the Rademacher Complexity of the hypothesis space. This shows the problematic behavior of the T-SVM approach, which does not allow to precisely control the size of the hypothesis space. I-SVM, instead, uses by construction a fixed hypothesis space, while M-SVM shows, as predicted, a fixed empirical error on the training data.

Table 1 shows the average training time, needed by the algorithms detailed in this work for solving T-SVM, I-SVM and M-SVM problems as described above. As expected, T-SVM is the fastest to solve, though Algorithms 5 and 7 are characterized by a comparable performance. Given that the I-SVM shows a clear advantage in terms of hypothesis space control, but only a slight increase in computation time, we believe that the I-SVM should be the preferred approach.

Table 1 Training time for T-SVM, I-SVM and M-SVM

8 Concluding remarks

In this paper, we proved that the three regularization paths for Tikhonov, Ivanov and Morozov regularization are equivalent, provided that mild and easy-to-satisfy conditions hold (such as the convexity of the loss function). In other words, they are the same learning problem seen from three different perspectives.

Traditionally, this reason motivated the exploitation of the Tikhonov approach at the expense of Ivanov and Morozov ones: by leaving unconstrained both the empirical error and the hypothesis space size terms, the Tikhonov formulation is the easiest to solve and several approaches appeared in literature for this purpose in the last decades. However, the capability of fixing one of the two quantities, which control the learning process, is of importance in order to derive more insights and more refined approaches dealing with performance assessment of learnt models, especially taking in account recent advances and refinements of the SRM principle (Vapnik 1998; Bartlett and Mendelson 2003; Koltchinskii 2006; Bousquet et al. 2004; Shawe-Taylor et al. 1998).

This is particularly true for Ivanov regularization, which represents the most intuitive and direct implementation of the SRM principle, as also underlined by Vapnik in his seminal work (Vapnik 1998). When the SVM was introduced as a Tikhonov formulation, an unmet gap was created between the capacity of effectively and intuitively controlling the hypothesis space size, typical of the Ivanov approach, and the performance in model training, namely the reason-why Tikhonov regularization was chosen. this is the reason that leaded us to study how this gap could be filled: in particular, we proposed effective and easy-to-implement approaches to solve the I-SVM, without neglecting the huge amount of work in the last years dedicated to solving T-SVM.