# Comments on: Support Vector Machines Maximizing Geometric Margins for Multi-class Classification

- 516 Downloads

## 1 Introduction

The article deals with multi-class discrimination with support vector machines (SVMs). The authors present multi-class SVMs (MSVMs) which they have introduced in recent years: multiobjective MSVMs (MMSVMs). Those machines are based on the same functional class as that of the standard MSVMs (Guermeur 2012). They differ in the nature of the learning problem, which is no longer a standard optimization problem (convex quadratic programming problem), but a multiobjective optimization problem (taking the form of a second-order cone programming problem). The aim is to maximize exactly all geometric margins, so as to improve generalization performance. This performance is assessed empirically, through experiments performed on data sets from the UCI benchmark repository. In our comments, we make use of the latest results of the statistical theory of large margin multi-category classifiers to study the connection between the (width of the) geometric margins and the generalization performance.

The organization of these comments is as follows. Section 2 discusses the characteristics of the *all-together* (AT) MSVMs. Section 3 is devoted to the theoretical study of the generalization performance of these machines and the MMSVMs. At last, we discuss in Sect. 4 the options available to bridge the gap between theory and practice.

## 2 On the standard MSVMs

*agnostic learning*(Kearns et al. 1992). We assume that \(\left( \mathcal {X}, \mathcal {A} \right) \) and \(\left( M, \mathcal {B} \right) \) are measurable spaces and the link between descriptions and categories can be characterized by an unknown probability measure \(P\) on the measurable space \(\left( \mathcal {X} \times M, \mathcal {A} \otimes \mathcal {B} \right) \). Obviously, the statistical properties of the MMSVMs should be studied in the same framework.

## 3 Dependence of the guaranteed risks on the geometric margins

In the framework of pattern recognition, irrespective of the class of functions involved, all the guaranteed risks can be written as a sum of two terms: a sample-based estimate of performance and a control term which is an increasing function of the *capacity* of the class (see for instance, Vapnik 1998). In the case of large margin multi-category classifiers, the central capacity measure is a covering number. The nature of this number varies as a function of the pathway followed to derive the bound. We now discuss the characteristics of the bounds available, focusing on their dependence on the sample size \(l\) and the number of categories \(m\). In the specific case when the classifier is an MSVM or an MMSVM, we establish the way the covering numbers of interest can be upper bounded as a function of restrictions imposed on the corresponding functional class, restrictions precisely related to the width of the geometric margins. This calls for the introduction of standard definitions, starting with margin operators.

### **Definition 1**

### **Definition 2**

### **Definition 3**

### **Definition 4**

*cover*of \(E'\) is a coverage of \(E'\) with open balls of radius \(\epsilon \) the centers of which belong to \(E\). These centers form an \(\epsilon \)

*-net*of \(E'\). A

*proper*\(\epsilon \)

*-net*of \(E'\) is an \(\epsilon \)-

*net*of \(E'\) included in \(E'\). If \(E'\) has an \(\epsilon \)-net of finite cardinality, then its

*covering number*\(\mathcal {N} \left( \epsilon , E', \rho \right) \) is the smallest cardinality of its \(\epsilon \)-nets:

The definition of a covering number thus involves the specification of a (pseudo-) metric. We will make use of two of them.

### **Definition 5**

### **Definition 6**

The standard way to derive an upper bound on a covering number consists in establishing a generalized Sauer–Shelah lemma (Alon et al. 1997; Mendelson and Vershynin 2003) involving an extension of the Vapnik–Chervonenkis (VC) dimension (Vapnik and Chervonenkis 1971). In Guermeur (2007), it was proved that the extensions characterizing the learnability of large margin multi-category classifiers are the \(\gamma \)–\(\Psi \)-dimensions.

### **Definition 7**

*-shattered*(\(\Psi \)

*-shattered with margin*\(\gamma \)) by \(\left( \mathcal {F}, \Delta ^{\#} \right) \) if there is a mapping \(\psi ^n = \left( \psi ^{(i)} \right) _{1 \leqslant i \leqslant n}\) in \(\Psi ^n\) and a vector \(\mathbf {b}_n = \left( b_i \right) _{1 \leqslant i \leqslant n}\) in \(\mathbb {R}^n\) such that, for each vector \(\mathbf {k}_n = \left( k_i \right) _{1 \leqslant i \leqslant n}\) in \(\left\{ -1, 1 \right\} ^n\), there is a function \(f_{\mathbf {k}_n}\) in \(\mathcal {F}\) satisfying

*-dimension*, or \(\Psi \)

*-dimension with margin*\(\gamma \), of \(\left( \mathcal {F}, \Delta ^{\#} \right) \), denoted by \(\gamma \)–\(\Psi \)-dim\( \left( \mathcal {F}, \Delta ^{\#} \right) \), is the maximal cardinality of a subset of \(\mathcal {X}\) \(\gamma \)–\(\Psi \)-shattered by \(\left( \mathcal {F}, \Delta ^{\#} \right) \), if this cardinality is finite. If no such maximum exists, \(\left( \mathcal {F}, \Delta ^{\#} \right) \) is said to have infinite \(\gamma \)–\(\Psi \)-dimension.

These dimensions can be seen either as scale-sensitive extensions of the \(\Psi \)-dimensions (Ben-David et al. 1995), or multivariate extensions of the fat-shattering dimension (Kearns and Schapire 1994). One of them appears easier to handle due to its connection with the one-against-one decomposition scheme, the extension of the Natarajan dimension (Natarajan 1989) (margin Natarajan dimension, denoted \(\gamma \text{-N-dim }\)).

For a \(m\)-category classifier computing a class of functions \(\mathcal {G}\) from \(\mathcal {X}\) into \(\mathbb {R}^m\), Theorem 22 in Guermeur (2007), an extension of Corollary 9 in Bartlett (1998) and Theorem 4.1 in Vapnik (1998), provides us with a guaranteed risk whose control term grows as the square root of the logarithm of \(\mathcal {N}^{(p)} \left( \epsilon , \Delta ^{\#} \mathcal {G}, 2l \right) \), the supremum over \(\mathcal {X}^{2l}\) of \(\mathcal {N}^{(p)} \left( \epsilon , \Delta ^{\#} \mathcal {G}, d_{ \Delta ^{\#} \mathcal {G}, \mathbf {x}_{2l}, \infty } \right) \). This covering number (with \(\Delta ^*\) as margin operator) can be bounded from above by means of a generalized Sauer–Shelah lemma involving the margin Natarajan dimension of \(\left( \mathcal {G}, \Delta \right) \) (Lemma 39 in Guermeur 2007). Thus, characterizing the connection between the generalization performance of a MSVM (or a MMSVM) and its geometric margins can boil down to deriving an upper bound on its margin Natarajan dimension in terms of those margins. This is precisely what we get with the following theorem, a straightforward multi-class extension of Theorem 4.6 in Bartlett and Shawe-Taylor (1999):

### **Theorem 1**

The bound sketched above is not utterly satisfactory due to its suboptimal dependence on the sample size \(l\). Indeed, its control term decreases with \(l\) as a \(O \left( \frac{\ln \left( l \right) }{\sqrt{l}} \right) \). The optimal convergence rate, \(\frac{1}{\sqrt{l}}\), can be obtained by following a more direct path involving a different capacity measure: the Rademacher average (Bartlett et al. 2005). To the best of our knowledge, the first result of this kind is Corollary 8.1 in Mohri et al. (2012). This bound is not utterly satisfactory either since its control term grows quadratically with \(m\). The reason for this drawback basically rests in the fact that even in the case of kernel machines, the Rademacher averages associated with multivariate models cannot be bounded as straightforwardly as those associated with univariate models. The property used by Mohri and his co-authors to adapt the bi-class line of reasoning to the multi-class case, i.e., cope with this difficulty, appears in the proof of Theorem 8.1 in Mohri et al. (2012). It is the sub-additivity of the supremum. The quadratic dependence can be seen as an artifact of this choice. To make the best of both worlds, so as to optimize both dependences (on \(l\) and \(m\)), we propose a hybrid approach. In short, it consists in following the proof of Theorem 8.1 in Mohri et al. (2012) up to the point where the Rademacher average appears, and then apply Dudley’s integral inequality (see for instance Theorem 11.17 in Ledoux and Talagrand 1991), to switch back to a covering number. The corresponding covering number is \(\mathcal {N} ( \epsilon , \mathcal {F}_{\mathcal {G}}^{\#}, l )\), the supremum over \( ( \mathcal {X} \times M )^l\) of \(\mathcal {N} ( \epsilon , \mathcal {F}_{\mathcal {G}}^{\#}, d_{ \mathcal {F}_{\mathcal {G}}^{\#}, \mathbf {z}_l, 2} )\), with \(\mathbf {z}_l = ( ( x_i, y_i ) )_{i \in I} \in ( \mathcal {X} \times M )^l\). Bounding from above this covering number as a function of the margin Natarajan dimension of \( ( \mathcal {G}, \Delta )\) remains an open problem. The only solution available so far to make use of a generalized Sauer–Shelah lemma consists in treating separately the classes of functions to which the component functions of the model of interest belong. To that end, one can apply the following lemma, whose proof raises no difficulty.

### **Lemma 1**

To bound from above the covering numbers of the classes \(\mathcal {G}_p\) in terms of a generalized VC dimension, one can use a variant of Theorem 1 in Mendelson and Vershynin (2003). Then, one can easily verify that the convergence rate obtained is (at worst) \(\sqrt{\frac{\ln \left( l \right) }{l}}\) (“halfway” between that of the two previous bounds), while the control term only grows as the square root of \(m\). To finish the derivation of the guaranteed risk, it remains to bound \(m\) fat-shattering dimensions. Turning back to the case of the MSVMs and MMSVMs, this can be done by means of a result already mentioned: Theorem 4.6 in Bartlett and Shawe-Taylor (1999). The problem is that the resulting bound cannot be used to provide a theoretical justification to the MMSVMs. Indeed, by handling independently the component functions of the classifier, we have lost most of the connection between the control term of the guaranteed risk and the geometric margins (whose definition involves the difference of two component functions).

## 4 Discussion

Tatsumi and Tanino have introduced multi-class support vector machines which are based on a principle in full accordance with the intuition borrowed from the bi-class case: a direct maximization of the geometric margins. The experimental evidence they provide is very promising. However, strange as it may seem, the statistical theory of large margin multi-category classifiers still fails to fully justify their choices. This justification could come as the byproduct of the derivation of sharper bounds on the risk. We conjecture that a bound exhibiting the optimal convergence rate with a control term growing only as the square root of the number of categories could be obtained from an appropriate implementation of the generic chaining method (Talagrand 2005).

## Notes

### Acknowledgments

The author would like to thank E. Didiot for carefully reading this manuscript.

## References

- Alon N, Ben-David S, Cesa-Bianchi N, Haussler D (1997) Scale-sensitive dimensions, uniform convergence, and learnability. J ACM 44(4):615–631CrossRefGoogle Scholar
- Bartlett P (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans Inf Theory 44(2):525–536CrossRefGoogle Scholar
- Bartlett P, Shawe-Taylor J (1999) Generalization performance of support vector machines and other pattern classifiers. In: Schölkopf B, Burges C, Smola A (eds) Advances in kernel methods—support vector learning, chap 4. The MIT Press, Cambridge, pp 43–54Google Scholar
- Bartlett P, Bousquet O, Mendelson S (2005) Local Rademacher complexities. Ann Stat 33(4):1497–1537CrossRefGoogle Scholar
- Ben-David S, Cesa-Bianchi N, Haussler D, Long P (1995) Characterizations of learnability for classes of \(\{0,\ldots, n\}\)-valued functions. J Comput Syst Sci 50(1):74–86CrossRefGoogle Scholar
- Berlinet A, Thomas-Agnan C (2004) Reproducing kernel hilbert spaces in probability and statistics. Kluwer, BostonCrossRefGoogle Scholar
- Bonidal R (2013) Sélection de modèle par chemin de régularisation pour les machines à vecteurs support à coût quadratique. Ph.D. thesis, Université de LorraineGoogle Scholar
- Bredensteiner E, Bennett K (1999) Multicategory classification by support vector machines. Computat Optim Appl 12(1/3):53–79CrossRefGoogle Scholar
- Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292Google Scholar
- Guermeur Y (2002) Combining discriminant models with new multi-class SVMs. Pattern Anal Appl 5(2):168–179CrossRefGoogle Scholar
- Guermeur Y (2007) VC theory of large margin multi-category classifiers. J Mach Learn Res 8:2551–2594Google Scholar
- Guermeur Y (2012) A generic model of multi-class support vector machine. Int J Intell Inf Database Syst 6(6):555–577Google Scholar
- Guermeur Y, Monfrini E (2011) A quadratic loss multi-class SVM for which a radius-margin bound applies. Informatica 22(1):73–96Google Scholar
- Kearns M, Schapire R (1994) Efficient distribution-free learning of probabilistic concepts. J Comput Syst Sci 48(3):464–497CrossRefGoogle Scholar
- Kearns M, Schapire R, Sellie L (1992) Toward efficient agnostic learning. In: COLT’92, pp 341–352Google Scholar
- Kolmogorov A, Tihomirov V (1961) \(\epsilon \)-entropy and \(\epsilon \)-capacity of sets in functional spaces. Am Math Soc Transl Ser 2(17):277–364Google Scholar
- Ledoux M, Talagrand M (1991) Probability in Banach spaces: isoperimetry and processes. Springer, BerlinCrossRefGoogle Scholar
- Lee Y, Lin Y, Wahba G (2004) Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. J Am Stat Assoc 99(465):67–81CrossRefGoogle Scholar
- Mendelson S, Vershynin R (2003) Entropy and the combinatorial dimension. Invent Math 152:37–55CrossRefGoogle Scholar
- Mohri M, Rotamizadeh A, Talwalkar A (2012) Foundations of machine learning. The MIT Press, CambridgeGoogle Scholar
- Natarajan B (1989) On learning sets and functions. Mach Learn 4(1):67–97Google Scholar
- Talagrand M (2005) The generic chaining: upper and lower bounds of stochastic processes. Springer, BerlinGoogle Scholar
- Tewari A, Bartlett P (2007) On the consistency of multiclass classification methods. J Mach Learn Res 8:1007–1025Google Scholar
- Vapnik V (1998) Stat Learn Theory. Wiley, New YorkGoogle Scholar
- Vapnik V, Chervonenkis A (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab Appl XVI (2):264–280Google Scholar
- Wahba G (1992) Multivariate function and operator estimation, based on smoothing splines and reproducing kernels. In: Casdagli M, Eubank S (eds) Nonlinear modeling and forecasting, SFI studies in the sciences of complexity, vol XII, pp 95–112Google Scholar
- Weston J, Watkins C (1998) Multi-class support vector machines. Tech. Rep. CSD-TR-98-04, Royal Holloway, University of London, Department of Computer Science (1998)Google Scholar
- Zhang T (2004) Statistical analysis of some multi-category large margin classification methods. J Mach Learn Res 5:1225–1251Google Scholar