Advertisement

Machine Learning

, Volume 99, Issue 2, pp 327–351 | Cite as

Unconfused ultraconservative multiclass algorithms

  • Ugo Louche
  • Liva RalaivolaEmail author
Article

Abstract

We tackle the problem of learning linear classifiers from noisy datasets in a multiclass setting. The two-class version of this problem was studied a few years ago where the proposed approaches to combat the noise revolve around a Perceptron learning scheme fed with peculiar examples computed through a weighted average of points from the noisy training set. We propose to build upon these approaches and we introduce a new algorithm called Unconfused Multiclass additive Algorithm (U MA) which may be seen as a generalization to the multiclass setting of the previous approaches. In order to characterize the noise we use the confusion matrix as a multiclass extension of the classification noise studied in the aforementioned literature. Theoretically well-founded, U MA furthermore displays very good empirical noise robustness, as evidenced by numerical simulations conducted on both synthetic and real data.

Keywords

Multiclass classification Perceptron Noisy labels Confusion Matrix Ultraconservative algorithms 

1 Introduction

Context This paper deals with linear multiclass classification problems defined on an input space \({\mathcal {X}}\) (e.g., \({\mathcal {X}}={\mathbb {R}}^d\)) and a set of classes
$$\begin{aligned} \mathcal{Q}\doteq \{1,\ldots , Q\}. \end{aligned}$$
In particular, we are interested in establishing the robustness of ultraconservative additive algorithms (Crammer and Singer 2003) to label noise classification in the multiclass setting—in order to lighten notation, we will now refer to these algorithms as ultraconservative algorithms. We study whether it is possible to learn a linear predictor from a training set made of independent realizations of a pair \((X,Y)\) of random variables:
$$\begin{aligned} {\mathcal {S}}\doteq \{(\varvec{x}_i,y_i)\}_{i=1}^n \end{aligned}$$
where \(y_i\in \mathcal{Q}\) is a corrupted version of a true label, i.e. deterministically computed class, \(t(\varvec{x}_i)\in \mathcal{Q}\) associated with \(\varvec{x}_i\), according to some concept \(t\). The random noise process \(Y\) that corrupts the label to provide the \(y_i\)’s given the \(\varvec{x}_i\)’s is supposed uniform within each pair of classes, thus it is fully described by a confusion matrix \(C=(C_{pq})_{p,q}\in {\mathbb {R}}^{Q\times Q}\) so that
$$\begin{aligned} \forall \varvec{x},C_{pt(\varvec{x})}=\mathbb {P}_{Y}(Y=p|\varvec{x}). \end{aligned}$$
The goal that we would like to achieve is to provide a learning procedure able to deal with the confusion noise present in the training set \({\mathcal {S}}\) to give rise to a classifier \(h\) with small risk
$$\begin{aligned} R(h)\doteq \mathbb {P}_{X\sim \mathcal {D}}(h(X)\ne t(X)), \end{aligned}$$
\(\mathcal {D}\) being the distribution according to which the \(\varvec{x}_i\)’s are obtained. As we want to recover from the confusion noise, i.e., we want to achieve low risk on uncorrupted/non-noisy data, we use the term unconfused to characterize the procedures we propose.

Ultraconservative learning procedures are online learning algorithms that output linear classifiers. They display nice theoretical properties regarding their convergence in the case of linearly separable datasets, provided a sufficient separation margin is guaranteed (as formalized in Assumption 1 below). In turn, these convergence-related properties yield generalization guarantees about the quality of the predictor learned. We build upon these nice convergence properties to show that ultraconservative algorithms are robust to a confusion noise process, provided that: i) \(C\) is invertible and can be accessed, ii) the original dataset \(\lbrace (\varvec{x}_i, t(\varvec{x}_i) ) \rbrace _{i=1}^n\) is linearly separable. This paper is essentially devoted to proving how/why ultraconservative multiclass algorithms are indeed robust to such situations. To some extent, the results provided in the present contribution may be viewed as a generalization of the contributions on learning binary perceptrons under misclassification noise (Blum et al. 1996; Bylander 1994).

Beside the theoretical questions raised by the learning setting considered, we may depict the following example of an actual learning scenario where learning from noisy data is relevant. This learning scenario will be further investigated from an empirical standpoint in the section devoted to numerical simulations (Sect. 4).

Example 1

One situation where coping with mislabelled data is required arises in (partially supervised) scenarios where labelling data is very expensive. Imagine a task of text categorization from a training set \({\mathcal {S}}={\mathcal {S}}_{\ell }{{\mathrm{\cup }}}{\mathcal {S}}_u\), where \({\mathcal {S}}_{\ell }=\{(\varvec{x}_i,y_i)\}_{i=1}^n\) is a set of \(n\) labelled training examples and \({\mathcal {S}}_u=\{\varvec{x}_{n+i}\}_{i=1}^m\) is a set of \(m\) unlabelled vectors; in order to fall back to realistic training scenarios where more labelled data cannot be acquired, we may assume that \(n\ll m\). A possible three-stage strategy to learn a predictor is as follows: first learn a predictor \(f_{\ell }\) on \({\mathcal {S}}_{\ell }\) and estimate its confusion error \(C\) via a cross-validation procedure—\(f\) is assumed to make mistakes evenly over the class regions—, second, use the learned predictor to label all the data in \({\mathcal {S}}_u\) to produce the labelled traning set \(\widehat{{\mathcal {S}}}=\{(\varvec{x}_{n+i},t_{n+i}:=f(\varvec{x}_{n+i}))\}_{i=1}^m\) and finally, learn a classifier \(f\) from \(\widehat{{\mathcal {S}}}\) and the confusion information \(C\).

This introductory example pertains to semi-supervised learning and this is only one possible learning scenario where the contribution we propose, U MA, might be of some use. Still, it is essential to understand right away that one key feature of U MA, which sets it apart from many contributions encountered in the realm of semi-supervised learning, is that we do provide theoretical bounds on the sample complexity and running time required by our algorithm to output an effective predictor.

The present paper is an extended version of Louche and Ralaivola (2013). Compared with the original paper, it provides a more detailed introduction of the tools used in the paper, a more thorough discussion on related work as well as more extensive numerical results (which confirm the relevance of our findings). A strategy to make use of kernels for nonlinear classification has also been added.

Contributions Our main contribution is to show that it is both practically and theoretically possible to learn a multiclass classifier on noisy data if some information on the noise process is available. We propose a way to generate new points for which the true class is known. Hence we can iteratively populate a new unconfused dataset to learn from. This allows us to handle a massive amount of mislabelled data with only a very slight loss of accuracy. We embed our method into ultraconservative algorithms and provide a thorough analysis of it, in which we show that the strong theoretical guarantees that characterize the family of ultraconservative algorithms carry over to the noisy scenario.

Related work Learning from mislabelled data in an iterative manner has a long-standing history in the machine learning community. The first contributions on this topic, based on the Perceptron algorithm (Minsky and Papert 1969), are those of Bylander (1994), Blum et al. (1996), Cohen (1997), which promoted the idea utilized here that a sample average may be used to construct update vectors relevant to a Perceptron learning procedure. These first contributions were focused on the binary classification case and, for Blum et al. (1996), Cohen (1997), tackled the specific problem of strong-polynomiality of the learning procedure in the probably approximately correct (PAC) framework (Kearns and Vazirani 1994). Later, Stempfel and Ralaivola (2007) proposed a binary learning procedure making it possible to learn a kernel Perceptron in a noisy setting; an interesting feature of this work is the recourse to random projections in order to lower the capacity of the class of kernel-based classifiers. Meanwhile, many advances were witnessed in the realm of online multiclass learning procedures. In particular, Crammer and Singer (2003) proposed families of learning procedures subsuming the Perceptron algorithm, dedicated to tackle multiclass prediction problems. A sibling family of algorithms, the passive-aggressive online learning algorithms (Crammer et al. 2006), inspired both by the previous family and the idea of minimizing instantaneous losses, were designed to tackle various problems, among which multiclass linear classification. Sometimes, learning with partially labelled data might be viewed as a problem of learning with corrupted data (if, for example, all the unlabelled data are randomly or arbitrarily labelled) and it makes sense to mention the works Kakade et al. (2008) and Ralaivola et al. (2011) as distant relatives to the present work.

Organization of the paper Section 2 formally states the setting we consider throughout this paper. Section 3 provides the details of our main contribution: the U MA algorithm and its detailed theoretical analysis. Section 4 presents numerical simulations that support the soundness of our approach.

2 Setting and problem

2.1 Noisy labels with underlying linear concept

The probabilistic setting we consider hinges on the existence of two components. On the one hand, we assume an unknown (but fixed) probability distribution \(\mathcal {D}\) on the input space \({\mathcal {X}}\doteq {\mathbb {R}}^d\). On the other hand, we also assume the existence of a deterministic labelling function \(t:{\mathcal {X}}\rightarrow \mathcal{Q}\), where \(\mathcal{Q}\doteq \{1,\ldots Q\}\), which associates a label \(t(\varvec{x})\) to any input example \(\varvec{x}\); in the Probably Approximately Correct (PAC) literature, \(t\) is sometimes referred to as a concept (Kearns and Vazirani 1994; Valiant 1984).

In the present paper, we focus on learning linear classifiers, defined as follows.

Definition 1

(Linear classifiers) The linear classifier \(f_W:{\mathcal {X}}\rightarrow \mathcal{Q}\) is a classifier that is associated with a set of vectors \(W=[\varvec{w}_1\cdots \varvec{w}_Q]\in {\mathbb {R}}^{d\times Q}\), which predicts the label \(f_W(\varvec{x})\) of any vector \(\varvec{x}\in {\mathcal {X}}\) as
$$\begin{aligned} f_W(\varvec{x})=\mathop {{{\mathrm{\text {argmax}\;}}}}\limits _{q\in \mathcal{Q}}\left\langle \varvec{w}_q,\varvec{x}\right\rangle . \end{aligned}$$
(1)
Additionally, without loss of generality, we suppose that
$$\begin{aligned} \mathbb {P}_{X\sim \mathcal {D}}\left( \left\| X\right\| =1\right) =1, \end{aligned}$$
where \(\Vert \cdot \Vert \) is the Euclidean norm. This allows us to introduce the notion of margin.

Definition 2

(Margin of a linear classifier) Let \(c:{\mathcal {S}}\rightarrow \mathcal{Q}\) be some fixed concept. Let \(W=[\varvec{w}_1\cdots \varvec{w}_Q]\in {\mathbb {R}}^{d\times Q}\) be a set of \(Q\) weight vectors. Linear classifier \(f_W\) is said to have margin \(\theta >0\) with respect to \(c\) (and distribution \(\mathcal {D}\)) if the following holds:
$$\begin{aligned} \mathbb {P}_{X\sim \mathcal {D}}\left\{ \exists p\ne c(X):\left\langle \varvec{w}_{c(X)}-\varvec{w}_{p},X\right\rangle \le \theta \right\} =0. \end{aligned}$$
Note that if \(f_W\) has margin \(\theta >0\) with respect to \(c\) then
$$\begin{aligned} \mathbb {P}_{X\sim \mathcal {D}}(f_W(X)\ne c(X))=0. \end{aligned}$$

Equipped with this definition, we shall consider that the following assumption of linear separability with margin \(\theta \) of concept \(t\) holds throughout.

Assumption 1

(Linear Separability of t with Margin \(\theta \)) There exist \(\theta \ge 0\) and \(W^*=[\varvec{w}_1^*\cdots \varvec{w}_Q^*]\in {\mathbb {R}}^{d\times Q}\), with \(\Vert W^* \Vert _F^2 = 1\) (\(\Vert \cdot \Vert _F\) denotes the Frobenius norm) such that \(f_{W^*}\) has margin \(\theta \) with respect to the concept \(t\).

In a conventional setting, one would be asked to learn a classifier \(f\) from a training set
$$\begin{aligned} {\mathcal {S}}_{\text {true}}\doteq \{(\varvec{x}_i,t(\varvec{x}_i))\}_{i=1}^n \end{aligned}$$
made of \(n\) labelled pairs from \({\mathcal {X}}\times \mathcal{Q}\) such that the \(\varvec{x}_i\)’s are independent realizations of a random variable \(X\) distributed according to \(\mathcal {D}\), with the objective of minimizing the true risk or misclassification error \(R_{\text {error}}(f)\) of \(f\) given by
$$\begin{aligned} R_{\text {error}}(f) \doteq \mathbb {P}_{X\sim \mathcal {D}}(f(X) \ne t(X)). \end{aligned}$$
(2)
In other words, the objective is for \(f\) to have a prediction behavior as close as possible to that of \(t\). As announced in the introduction, there is however a little twist in the problem that we are going to tackle. Instead of having direct access to \({\mathcal {S}}_{\text {true}}\), we assume that we only have access to a corrupted version
$$\begin{aligned} {\mathcal {S}}\doteq \lbrace (\varvec{x}_i, y_i) \rbrace _{i=1}^n \end{aligned}$$
(3)
where each \(y_i\) is the realization of a random variable \(Y\) whose distribution agrees with the following assumption:

Assumption 2

The law \(\mathcal {D}_{Y|X}\) of \(Y\) is the same for all \(x \in {\mathcal {X}}\) and its conditional distribution
$$\begin{aligned} \mathbb {P}_{Y\sim \mathcal {D}_{Y|X=\varvec{x}}}(Y|X=\varvec{x}) \end{aligned}$$
is fully summarized into a known confusion matrix \(C\) given by
$$\begin{aligned} \forall \varvec{x},\; C_{pt(\varvec{x})} \doteq \mathbb {P}_{Y\sim \mathcal {D}_{Y|X=\varvec{x}}}(Y = p |X = \varvec{x})=\mathbb {P}_{Y\sim \mathcal {D}_{Y|X=\varvec{x}}}(Y = p |t(\varvec{x}) = q). \end{aligned}$$
(4)

Alternatively put, the noise process that corrupts the data is uniform within each class and its level does not depend on the precise location of \(\varvec{x}\) within the region that corresponds to class \(t(\varvec{x})\). The noise process \(Y\) is both a) aggressive, as it does not only apply, as we may expect, to regions close to the class boundaries between classes and b) regular, as the mislabelling rate is piecewise constant. Nonetheless, this setting can account for many real-world problems as numerous noisy phenomena can be summarized by a simple confusion matrix. Moreover it has been proved (Blum et al. 1996) that robustness to classification noise generalizes robustness to monotonic noise where, for each class, the noise rate is a monotonically decreasing function of the distance to the class boundaries.

Remark 1

The confusion matrix \(C\) should not be mistaken with the matrix \(\tilde{C}\) of general term: \(\tilde{C}_{ij} \doteq \mathbb {P}_{X\sim \mathcal {D}_{X|Y=j}}(t(X) = i | Y = j)\) which is the class-conditional distribution of \(t(X)\) given \(Y\). The problem of learning from a noisy training set and \(\tilde{C}\) is a different problem than the one we aim to solve. In particular, \(\tilde{C}\) can be used to define cost-sensitive losses rather directly whereas doing so with \(C\) is far less obvious. Anyhow, this second problem of learning from \(\tilde{C}\) is far from trivial and very interesting, and it falls way beyond the scope of the present work.

Finally, we assume the following from here on:

Assumption 3

\(C\) is invertible.

Note that this assumption is not as restrictive as it may appear. For instance, if we consider the learning setting depicted in Example 1 and implemented in the numerical simulations, then the confusion matrix obtained from the first predictor \(f_{\ell }\) is often diagonally dominant, i.e. the magnitudes of the diagonal entries are larger than the sum of the magnitudes of the entries in their corresponding rows, and \(C\) is therefore invertible. Generally speaking, the problems that we are interested in (i.e. problems where the true classes seems to be recoverable) tend to have invertible confusion matrix. It is most likely that invertibility is merely a sufficient condition on \(C\) that allows us to establish learnability in the sequel. Identifying less stringent conditions on \(C\), or conditions termed in a different way—which would for instance be based on the condition number of \(C\)—for learnability to remain, is a research issue of its own that we leave for future investigations.

The setting we have just presented allows us to view \({\mathcal {S}}=\{(\varvec{x}_i,y_i)\}_{i=1}^n\) as the realization of a random sample \(\{(X_i,Y_i)\}_{i=1}^n\), where each pair \((X_i,Y_i)\) is an independent copy of the random pair \((X,Y)\) of law \(\mathcal {D}_{XY}\doteq \mathcal {D}_X\mathcal {D}_{X|Y}.\)

2.2 Problem: learning a linear classifier from noisy data

The problem we address is the learning of a classifier \(f\) from \({\mathcal {S}}\) and \(C\) so that the error rate
$$\begin{aligned} R_{\text {error}}(f)=\mathbb {P}_{X\sim \mathcal {D}}(f(X)\ne t(X)) \end{aligned}$$
of \(f\) is as small as possible: the usual goal of learning a classifier \(f\) with small risk is preserved, while now the training data is only made of corrupted labelled pairs.

Building on Assumption 1, we may refine our learning objective by restricting ourselves to linear classifiers \(f_W\), for \(W=[\varvec{w}_1\cdots \varvec{w}_Q]\in {\mathbb {R}}^{d\times Q}\) (see Definition 1). Our goal is thus to learn a relevant matrix \(W\) from \({\mathcal {S}}\) and the confusion matrix \(C\). More precisely, we achieve risk minimization through classic additive methods and the core of this work is focused on computing noise-free update points such that the properties of said methods are unchanged.

3 Uma: unconfused ultraconservative multiclass algorithm

This section presents the main result of the paper, that is, the U MA procedure, which is a response to the problem posed above: U MA makes it possible to learn a multiclass linear predictor from \({\mathcal {S}}\) and the confusion information \(C\). In addition to the algorithm itself, this section provides theoretical results regarding the convergence and sample complexity of U MA.

As U MA is a generalization of the ultraconservative additive online algorithms proposed in Crammer and Singer (2003) to the case of noisy labels, we first and foremost recall the essential features of this family of algorithms. The rest of the section is then devoted to the presentation and analysis of U MA.

3.1 A brief reminder on ultraconservative additive algorithms

Ultraconservative additive online algorithms were introduced by Crammer and Singer (2003). As already stated, these algorithms output multiclass linear predictors \(f_W\) as in Definition 1 and their purpose is therefore to compute a set \(W=[\varvec{w}_1\cdots \varvec{w}_Q]\in {\mathbb {R}}^{d\times Q}\) of \(Q\) weight vectors from some training sample \({\mathcal {S}}_{\text {true}}=\{(\varvec{x}_i,t( \varvec{x}_i ) )\}_{i=1}^n\). To do so, they implement the procedure depicted in Algorithm 1, which centrally revolves around the identification of an error set and its simple update: when processing a training pair \((\varvec{x},y)\), they perform updates of the form
$$\begin{aligned} \varvec{w}_q\leftarrow \varvec{w}_q+\tau _q\varvec{x}, \; q=1,\ldots Q, \end{aligned}$$
whenever the error set \(\mathcal{E}(\varvec{x},y)\) defined as
$$\begin{aligned} \mathcal{E}(\varvec{x},y)\doteq \left\{ r\in \mathcal{Q}\backslash \{y\}: \langle \varvec{w}_r,\varvec{x}\rangle -\langle \varvec{w}_{y},\varvec{x}\rangle \ge 0\right\} \end{aligned}$$
(5)
is not empty, with the constraint for the family \(\{\tau _q\}_{q\in \mathcal{Q}}\) of step sizes to fulfill
$$\begin{aligned} \left\{ \begin{array}{l} \tau _{y}=1\\ \tau _r\le 0,\quad \text { if } r\in \mathcal{E}(\varvec{x},y)\\ \tau _r=0,\quad \text { otherwise}\end{array}\right. \quad \text { and }\quad \sum _{r=1}^Q\tau _r=0. \end{aligned}$$
(6)
The term ultraconservative refers to the fact that only those prototype vectors \(\varvec{w}_r\) which achieve a larger inner product \(\langle \varvec{w}_r,\varvec{x}\rangle \) than \(\langle \varvec{w}_y,\varvec{x}\rangle \), that is, the vectors that can entail a prediction mistake when decision rule (1) is applied, may be affected by the update procedure. The term additive conveys the fact that the updates consist in modifying the weight vectors \(\varvec{w}_r\)’s by adding a portion of \(\varvec{x}\) to them (which is to be opposed to multiplicative update schemes). Again, as we only consider these additive types of updates in what follows, it will have to be implicitly understood even when not explicitly mentioned.

One of the main results regarding ultraconservative algorithms, which we extend in our learning scenario is the following.

Theorem 1

(Mistake bound for ultraconservative algorithms Crammer and Singer 2003) Suppose that concept \(t\) is in accordance with Assumption 1. The number of mistakes/updates made by one pass over \({\mathcal {S}}\) by any ultraconservative procedure is upper-bounded by \(2/\theta ^2\).

This result is essentially a generalization of the well-known Block–Novikoff theorem (Block 1962; Novikoff 1963), which establishes a mistake bound for the Perceptron algorithm (an ultraconservative algorithm itself).

3.2 Main result and high level justification

This section presents our main contribution, U MA, a theoretically grounded noise-tolerant multiclass algorithm depicted in Algorithm 2. U MA learns and outputs a matrix \(W=[\varvec{w}_1\cdots \varvec{w}_Q]\in {\mathbb {R}}^{d\times Q}\) from a noisy training set \({\mathcal {S}}\) to produce the associated linear classifier
$$\begin{aligned} f_W(\cdot )={{\mathrm{\text {argmax}\;}}}_q\langle \varvec{w}_q,\cdot \rangle \end{aligned}$$
(7)
by iteratively updating the \(\varvec{w}_q\)’s, whilst maintaining \(\sum _q\varvec{w}_q=0\) throughout the learning process. As a new member of multiclass additive algorithms, we may readily recognize in step 8 through step 10 of Algorithm 2 the generic step sizes \(\{\tau _q\}_{q\in \mathcal{Q}}\) promoted by ultraconservative algorithms (see Algorithm 1). An important feature of U MA is that it only uses information provided by \({\mathcal {S}}\) and does not make assumption on the accessibility to the noise-free dataset \({\mathcal {S}}_{\text {true}}\): the incurred pivotal difference with regular ultraconservative algorithms is that the update points used are now the computed (line 4 through line 7) \(\varvec{z}_{pq}\) vectors instead of the \(\varvec{x}_i\)’s. Establishing that under some conditions U MA stops and provides a classifier with small risk when those update points are used is the purpose of the following subsections; we will also discuss the unspecified step 3, dealing with the selection step.

For the impatient reader, we may already leak some of the ingredients we use to prove the relevance of our procedure. Theorem 1, which shows the convergence of ultraconservative algorithms, rests on the analysis of the updates made when training examples are misclassified by the current classifier. The conveyed message is therefore that examples that are erred upon are central to the convergence analysis. It turns out that steps 4 through 7 of U MA (cf. Algorithm 2) construct a point \(\varvec{z}_{pq}\) that is, with high probabilty, mistaken on. More precisely, the true class \(t(\varvec{z}_{pq})\) of \(\varvec{z}_{pq}\) is \(q\) and it is predicted to be of class \(p\) by the current classifier; at the same time, these update vectors are guaranteed to realize a positive margin condition with respect to \(W^*\): \(\langle \varvec{w}_q^*,\varvec{z}_{pq}\rangle >\langle \varvec{w}_k^*,\varvec{z}_{pq}\rangle \) for all \(k\ne q\). The ultraconservative feature of the algorithm is carried by step 8 and step 10, which make it possible to update any prototype vector \(\varvec{w}_r\) with \(r\ne q\) having an inner product \(\langle \varvec{w}_r,\mathbf{z}_{pq}\rangle \) with \(\mathbf{z}_{pq}\) larger than \(\langle \varvec{w}_q,\mathbf{z}_{pq}\rangle \) (which should be the largest if a correct prediction were made). The reason why we have results ‘with high probability’ is because the \(z_{pq}\)’s are sample-based estimates of update vectors known to be of class \(q\) but predicted as being of class \(p\), with \(p\ne q\); computing the accuracy of the sample estimates is one of the important exercises of what follows. A control on the accuracy makes it possible for us to then establish the convergence of the proposed algorithm.

3.3 With high probability, \(\mathbf{z}_{pq}\) is a mistake with positive margin

Here, we prove that the update vector \(\varvec{z}_{pq}\) given in step 7 is, with high probability, a point on which the current classifier errs.

Proposition 1

Let \(W=[\varvec{w}_1\cdots \varvec{w}_Q]\in {\mathbb {R}}^{d\times Q}\) and \(\alpha > 0\) be fixed. Let \({\mathcal {A}}_{p}^{\alpha }\) be defined as in step 7 of Algorithm 2, i.e:
$$\begin{aligned} {\mathcal {A}}_{p}^{\alpha }\doteq \left\{ \varvec{x} |\varvec{x}\in {\mathcal {S}}, \left\langle {\varvec{w}_p} , {\varvec{x}} \right\rangle - \left\langle {\varvec{w}_k} , {\varvec{x}} \right\rangle \ge \alpha ,\;\forall k\ne p\right\} \!. \end{aligned}$$
(8)
For \(k\in \mathcal{Q}\), \(p\ne k\), consider the random variable \(\gamma _{k}^p\) (\(\gamma _{k}^p\) in step 5 of Algorithm 2 is a realization of this variable, hence the overloading of notation \(\gamma _{k}^p\)):
$$\begin{aligned} \gamma _{k}^p\doteq \frac{1}{n}\sum _{i}{\mathbb {I}}{\left\{ Y_i=k \right\} } {\mathbb {I}}{\left\{ X_i\in {\mathcal {A}}_{p}^{\alpha } \right\} }X_i^{\top }. \end{aligned}$$
The following holds, for all \(k\in \mathcal{Q}\):
$$\begin{aligned} {\mathbb {E}}_{{\mathcal {S}}}\left\{ \gamma _{k}^p\right\} = {\mathbb {E}}_{\{(X_i,Y_i)\}_{i=1}^n}\left\{ \gamma _{k}^p\right\} = \sum _{q=1}^QC_{kq}\mu _q^p, \end{aligned}$$
(9)
where
$$\begin{aligned} \mu _q^p\doteq {\mathbb {E}}_{\mathcal {D}_X}\left\{ {\mathbb {I}}{\left\{ t(X)=q \right\} } {\mathbb {I}}{\left\{ X\in {\mathcal {A}}_{p}^{\alpha } \right\} }X^{\top }\right\} . \end{aligned}$$
(10)

Proof

Let us compute \({\mathbb {E}}_{\mathcal {D}_{XY}}\left\{ {\mathbb {I}}{\left\{ Y=k \right\} } {\mathbb {I}}{\left\{ X\in {\mathcal {A}}_{p}^{\alpha } \right\} }X^{\top }\right\} \):
$$\begin{aligned} {\mathbb {E}}_{\mathcal {D}_{XY}}&\left\{ {\mathbb {I}}{\left\{ Y=k \right\} }{\mathbb {I}}{\left\{ X\in {\mathcal {A}}_{p}^{\alpha } \right\} }X^{\top }\right\} \\&=\int _{{\mathcal {X}}}\sum _{q=1}^Q{\mathbb {I}}{\left\{ q=k \right\} }{\mathbb {I}}{\left\{ \varvec{x}\in {\mathcal {A}}_{p}^{\alpha } \right\} }\varvec{x}^{\top }\mathbb {P}_{Y}(Y=q|X=\varvec{x})d\mathcal {D}_{X}(\varvec{x})\\&=\int _{{\mathcal {X}}}{\mathbb {I}}{\left\{ \varvec{x}\in {\mathcal {A}}_{p}^{\alpha } \right\} }\varvec{x}^{\top }\mathbb {P}_{Y}(Y=k|X=\varvec{x})d\mathcal {D}_{X}(\varvec{x})\\&=\int _{{\mathcal {X}}}{\mathbb {I}}{\left\{ \varvec{x}\in {\mathcal {A}}_{p}^{\alpha } \right\} }\varvec{x}^{\top }C_{kt(\varvec{x})}d\mathcal {D}_{X}(\varvec{x})\\&=\int _{{\mathcal {X}}}\sum _{q=1}^Q{\mathbb {I}}{\left\{ t(\varvec{x})=q \right\} }{\mathbb {I}}{\left\{ \varvec{x}\in {\mathcal {A}}_{p}^{\alpha } \right\} }\varvec{x}^{\top }C_{kq}d\mathcal {D}_{X}(\varvec{x})\\&=\sum _{q=1}^Q C_{kq}\int _{{\mathcal {X}}}{\mathbb {I}}{\left\{ t(\varvec{x})=q \right\} }{\mathbb {I}}{\left\{ \varvec{x}\in {\mathcal {A}}_{p}^{\alpha } \right\} }\varvec{x}^{\top }d\mathcal {D}_{X}(\varvec{x}) =\sum _{q=1}^Q C_{kq}\mu _q^p, \end{aligned}$$
cf. (4)
where the last line comes from the fact that the classes are non-overlapping. The \(n\) pairs \((X_i,Y_i)\) being identically and independently distributed gives the result. \(\square \)

Intuitively, \(\mu _q^p\) must be seen as an example of class \(p\) which is erroneously predicted as being of class \(q\). Such an example is precisely what we are looking for to update the current classifier; as expecations cannot be computed, the estimate \(\varvec{z}_{pq}\) of \(\mu _q^p\) is used instead of \(\mu _q^p\).

Proposition 2

Let \(W=[\varvec{w}_1\cdots \varvec{w}_Q]\in {\mathbb {R}}^{d\times Q}\) and \(\alpha \ge 0\) be fixed. For \(p,q\in \mathcal{Q}\), \(p\ne q\), \({\varvec{z}}_{pq}\in {\mathbb {R}}^d\) is such that
$$\begin{aligned}&{\mathbb {E}}_{\mathcal {D}_{XY}}{{\varvec{z}}_{pq}}=\mu _{q}^p \end{aligned}$$
(11)
$$\begin{aligned}&\langle \varvec{w}_q^*,\mu _{q}^p\rangle - \langle \varvec{w}_k^*,\mu _{q}^p\rangle \ge \theta ,\;\forall k\ne q, \end{aligned}$$
(12)
$$\begin{aligned}&\langle \varvec{w}_p,\mu _{q}^p\rangle - \langle \varvec{w}_k,\mu _{q}^p\rangle \ge \alpha ,\;\forall k\ne p. \end{aligned}$$
(13)
(Normally, we should consider the transpose of \(\mu _{q}^p\), but since we deal with vectors of \({\mathbb {R}}^d\)—and not matrices—we abuse the notation and omit the transpose.)
This means that
  1. i)

    \(t(\mu _{q}^p)=q\), i.e. the ‘true’ class of \(\mu _{q}^p\) is \(q\);

     
  2. ii)

    and \(f_W(\mu _{q}^p)=p\); \(\mu _{q}^p\) is therefore misclassified by the current classifier \(f_W\).

     

Proof

According to Proposition 1,
$$\begin{aligned} {\mathbb {E}}_{\mathcal {D}_{XY}}\left\{ \Gamma ^{p}\right\} = {\mathbb {E}}_{\mathcal {D}_{XY}}\left\{ \left[ \begin{array}{c}\gamma _1^p\\ \vdots \\ \gamma _Q^p\end{array}\right] \right\} = \left[ \begin{array}{c} {\mathbb {E}}_{\mathcal {D}_{XY}}\left\{ \gamma _1^p\right\} \\ \vdots \\ {\mathbb {E}}_{\mathcal {D}_{XY}}\left\{ \gamma _Q^p\right\} \end{array}\right] = \left[ \begin{array}{c}\sum _{q=1}^QC_{1q}\mu _1^p\\ \vdots \\ \sum _{q=1}^QC_{Qq}\mu _Q^p\end{array}\right] = C\left[ \begin{array}{c}\mu _1^p\\ \vdots \\ \mu _Q^p\end{array}\right] . \end{aligned}$$
Hence, inverting \(C\) and extracting the \(q\)th of the resulting matrix equality gives that \({\mathbb {E}}\left\{ {\varvec{z}}_{pq}\right\} =\mu _{q}^p\).

Equation (12) is obtained thanks to Assumption 1 combined with (10) and the linearity of the expectation. Equation (13) is obtained thanks to the definition (8) of \({\mathcal {A}}_{p}^{\alpha }\) (made of points that are predicted to be of class \(p\)) and the linearity of the expectation. \(\square \)

The attentive reader may notice that Proposition 2 or, equivalently, step 7, is precisely the reason for requiring \(C\) to be invertible, as the computation of \(\varvec{z}_{pq}\) hinges on the resolution of a system of equations based on \(C\).

Proposition 3

Let \(\varepsilon >0\) and \(\delta \in (0;1]\). There exists a number
$$\begin{aligned} n_0(\varepsilon , \delta , d, Q) = \mathcal {O}\left( \frac{1}{\varepsilon ^2} \left[ \ln \frac{1}{\delta } + \ln Q + d\ln \frac{1}{\varepsilon } \right] \right) \end{aligned}$$
such that if the number of training samples is greater than \(n_0\) then, with high probability
$$\begin{aligned}&\langle \varvec{w}_q^*,\varvec{z}_{pq}\rangle - \langle \varvec{w}_k^*,\varvec{z}_{pq}\rangle \ge \theta - \varepsilon \end{aligned}$$
(14)
$$\begin{aligned}&\langle \varvec{w}_p,\varvec{z}_{pq}\rangle - \langle \varvec{w}_k,\varvec{z}_{pq}\rangle \ge 0,\;\forall k\ne p. \end{aligned}$$
(15)

Proof

The existence of \(n_0\) relies on pseudo-dimension arguments. We defer this part of the proof to “Appendix” and we will directly assume here that if \(n \ge n_0\), then, with probability \(1 - \delta \) for any \(\varvec{W}\), \(\varvec{z}_{pq}\).
$$\begin{aligned} \left| \left\langle {\varvec{w}_p - \varvec{w}_q} , {\varvec{z}_{pq}} \right\rangle - \left\langle {\varvec{w}_p - \varvec{w}_q} , {\mu _{q}^{p}} \right\rangle \right| \le \varepsilon . \end{aligned}$$
(16)
Proving (14) then proceeds by observing that
$$\begin{aligned} \left\langle {\varvec{w}_q^* - \varvec{w}_k^*} , {\varvec{z}_{pq}} \right\rangle = \left\langle {\varvec{w}_q^* - \varvec{w}_k^*} , {\mu _{q}^{p}} \right\rangle + \left\langle {\varvec{w}_q^* - \varvec{w}_k^*} , {\varvec{z}_{pq} - \mu _{q}^{p}} \right\rangle \end{aligned}$$
bounding the first part using Proposition 2:
$$\begin{aligned} \left\langle {\varvec{w}_q^* - \varvec{w}_k^*} , {\mu _{q}^{p}} \right\rangle \ge \theta \end{aligned}$$
and the second one with (16). A similar reasoning allows us to get (15) by setting \(\alpha \doteq \varepsilon \) in \({\mathcal {A}}_{p}^{\alpha }\). \(\square \)

This last proposition essentially says that the update vectors \(\mathbf{z}_{pq}\) that we compute are, with high probability, erred upon and realize a margin condition \(\theta - \varepsilon \).

Note that \(\alpha \) is needed to cope with the imprecision incurred by the use of empirical estimates. Indeed, we can only approximate \(\langle \varvec{w}_p,\varvec{z}_{pq}\rangle - \langle \varvec{w}_k,\varvec{z}_{pq}\rangle \) in (15) up to a precision of \(\varepsilon \). Thus for the result to hold we need to have \(\langle \varvec{w}_p,\mu _{q}^p\rangle - \langle \varvec{w}_k,\mu _{q}^p\rangle \ge \varepsilon \) which is obtained from (13) when \(\alpha = \varepsilon \). In practice, this just says that the points used in the computation of \(\varvec{z}_{pq}\) are at a distance at least \(\alpha \) from any decision boundaries.

Remark 2

It is important to understand that the parameter \(\alpha \) helps us derive sample complexity results by allowing us to retrieve a linearly separable training dataset with positive margin from the noisy dataset. The theoretical results we prove hold for any such \(\alpha >0\) parameter and the smaller this parameter, the larger the sample complexity, i.e., the harder it is for the algorithm to take advantage of a training samples that meets the sample complexity requirements. In other words, the smaller \(\alpha \), the less likely it is for U MA to succeed; yet, as shown in the experiments, where we use \(\alpha =0\), U MA continues to perform quite well.

3.4 Convergence and stopping criterion

We arrive at our main result, which provides both convergence and a stopping criterion.

Proposition 4

Under Assumptions 1, 2 and 3 there exists a number \(n\), polynomial in \(d, 1/\theta , Q, 1/\delta \), such that if the training sample is of size at least \(n\), then, with high probability (\(1 - \delta \)), U MA makes at most \(\mathcal {O}(1/{\theta }^2)\) updates.

Proof

Let \({\mathcal {S}}_{\varvec{z}_{}}\) the set of all the update vectors \(\varvec{z}_{pq}\) generated during the execution of U MA and labeled with their true class \(q\). Observe that, in this context, U MA (Alg. 2) behaves like a regular ultraconservative algorithm run on \({\mathcal {S}}_{\varvec{z}_{}}\). Namely: a) lines 4 through 7 compute a new point in \({\mathcal {S}}_{\varvec{z}_{}}\), and b) lines 8 through 10 perform an ultraconservative update step.

From Proposition 3, we know that with high probability, \(w^*\) is a classifier with positive margin \(\theta - \varepsilon \) on \({\mathcal {S}}_{\varvec{z}_{}}\) and it comes from Theorem 1 that U MA does not make more than \(\mathcal {O}(1/{\theta }^2)\) mistakes on such dataset.

Because, by construction, we have that with high probability each element of \({\mathcal {S}}_{\varvec{z}_{}}\) is erred upon then \(\vert {\mathcal {S}}_{\varvec{z}_{}} \vert \in \mathcal {O}(1/{\theta }^2)\); that means that, with high probability, U MA does not make more than \(\mathcal {O}(1/{\theta }^2)\) updates.

All in all, after \(\mathcal {O}(1/{\theta }^2)\) updates, there is a high probability that we are not able to construct examples on which U MA makes a mistake or, equivalently, the conditional misclassification errors \(\mathbb {P}(f_{W}(X)=p|Y=q)\) are all small. \(\square \)

Even though U MA operates in a batch setting, it ‘internally’ simulates the execution of an online algorithm that encounters a new training point (\(\varvec{z}_{pq} \in {\mathcal {S}}_{\varvec{z}_{}}\)) at each time step. To more precisely see how U MA can be seen as an online algorithm, it suffices to imagine it be run in a way where each vector update is made after a chunk of \(n\) (where \(n\) is as in Proposition 4) training data has been encountered and used to compute the next element of \({\mathcal {S}}_{\varvec{z}_{}}\). Repeating this process \(\mathcal {O}(1/{\theta }^2)\) times then guarantees convergence with high probability. Note that, in this scenario, U MA requires \(n' = \mathcal {O}(n/{\theta }^2)\) data to converge which might be far more than the sample complexity exhibited in Proposition 4. Nonetheless, \(n'\) still remains polynomial in \(d\), \(1/{\theta }\), \(Q\) and \(1/{\delta }\). For more detail on this (online to batch conversion) approach, we refer the interested readers to Blum et al. (1996).

3.5 Selecting \(p\) and \(q\)

So far, the question of selecting good pairs of values \(p\) and \(q\) to perform updates has been left unanswered. Indeed, our results hold for any pair \((p,q)\) and convergence is guaranteed even when \(p\) and \(q\) are arbitrarily selected as long as \(\varvec{z}_{pq}\) is not \(\mathbf {0}\). Nonetheless, it is reasonable to use heuristics for selecting \(p\) and \(q\) with the hope that it might improve the practical convergence speed.

On the one hand, we may focus on the pairs \((p,q)\) for which the empirical misclassification rate
$$\begin{aligned} \hat{\mathbb {P}}_{X\sim {\mathcal {S}}}\left\{ f_W(X) \ne t(X)\right\} \doteq \frac{1}{n}\sum _{i=1}^n{\mathbb {I}}{\left\{ f_W(\varvec{x}_i)\ne t(\varvec{x}_j) \right\} } \end{aligned}$$
(17)
is the highest (\(X\sim {\mathcal {S}}\) means that \(X\) is randomly drawn from the uniform distribution of law \(\varvec{x}\mapsto n^{-1}\sum _{i=1}^n{\mathbb {I}}{\left\{ \varvec{x}=\varvec{x}_i \right\} }\) defined with respect to training set \({\mathcal {S}}=\{(\varvec{x}_i,y_i)\}_{i=1}^n\)). We want to favor those pairs \((p,q)\) because, i) the induced update may lead to a greater reduction of the error and ii) more importantly, because \(\varvec{z}_{pq}\) may be more reliable, as \({\mathcal {A}}_{p}^{\alpha }\) will be bigger.
On the other hand, recent advances in the passive aggressive literature (Ralaivola 2012) have emphasized the importance of minimizing the empirical confusion rate, given for a pair \((p,q)\) by the quantity
$$\begin{aligned} \hat{\mathbb {P}}_{X\sim {\mathcal {S}}}\left\{ f_W(X)=p|t(X)=q\right\} \doteq \frac{1}{n_q}\sum _{i=1}^n{\mathbb {I}}{\left\{ t(\varvec{x}_i)=q, f_W(\varvec{x}_i)=p \right\} }, \end{aligned}$$
(18)
where
$$\begin{aligned} n_q\doteq \sum _{i=1}^n{\mathbb {I}}{\left\{ t(\varvec{x}_i)=q \right\} }. \end{aligned}$$
This approach is especially worthy when dealing with imbalanced classes and one might want to optimize the selection of \((p,q)\) with respect to the confusion rate.
Obviously, since the true labels in the training data cannot be accessed, neither of the quantities defined in (17) and (18) can be computed. Using a result provided in Blum et al. (1996), which states that the norm of an update vector computed as \(\mathbf{z}_{pq}\) directly provides an estimate of (17), we devise two possible strategies for selecting \((p,q)\):
$$\begin{aligned} (p,q)_{\text {error}}&\doteq \mathop {{{\mathrm{\text {argmax}\;}}}}\limits _{(p,q)} \Vert \mathbf{z}_{pq}\Vert \end{aligned}$$
(19)
$$\begin{aligned} (p,q)_{\text {conf}}&\doteq \mathop {{{\mathrm{\text {argmax}\;}}}}\limits _{(p,q)}\frac{\Vert \mathbf{z}_{pq}\Vert }{\hat{\pi }_q}, \end{aligned}$$
(20)
where \(\hat{\pi }_q\) is the estimated proportion of examples of true class \(q\) in the training sample. In a way similar to the computation of \(\mathbf{z}_{pq}\) in Algorithm 2, \(\hat{\pi }_q\) may be estimated as follows:
$$\begin{aligned} \hat{\pi }_q=\frac{1}{n}[C^{-1}{\hat{\varvec{y}}}]_q, \end{aligned}$$
where \(\hat{\varvec{y}}\in {\mathbb {R}}^Q\) is the vector containing the number of examples from \({\mathcal {S}}\) having noisy labels \(1,\ldots ,Q\), respectively.

The second selection criterion is intended to normalize the number of errors with respect to the proportions of different classes and aims at being robust to imbalanced data. Our goal here is to provide a way to take into account the class distribution for the selection of \((p,q)\). Note that this might be a first step towards transforming U MA into an algorithm for minimizing the confusion risk, even though additional (and significant) work is required to provably provide U MA with this feature.

On a final note, we remark that \((p,q)_{\text {conf}}\) requires additional precautions when used: when \((p,q)_{\text {error}}\) is implemented, \(\varvec{z}_{pq}\) is guaranteed to be the update vector of maximum norm among all possible update vectors, whereas this no longer holds true when \((p,q)_{\text {conf}}\) is used and if \(\varvec{z}_{pq}\) is close to \(\mathbf {0}\) then there may exist another possibly more informative—from the standpoint of convergence speed—update vector \(\varvec{z}_{p'q'}\) for some \((p',q')\ne (p,q).\)

3.6 U MA and kernels

Thus far, we have only considered the situation where linear classifiers are learned. There are however many learning problems that cannot be handled effectively without going beyond linear classification. A popular strategy to deal with such a situation is obviously to make use of kernels (Schölkopf and Smola 2002). In this direction, there are (at least) two paths that can be taken. The first one is to revisit U MA and provide a kernelized algorithm based on a dual representation of the weight vectors, as is done with the kernel Perceptron (see Cristianini and Shawe-Taylor 2000) or its close cousins (see, e.g. Friess et al. 1998; Dekel et al. 2005; Freund and Schapire 1999). Doing so would entail the question of finding sparse expansions of the weight vectors with respect to the training data in order to contain the prediction time and to derive generalization guarantees based on such sparsity: this is an interesting and ambitious research program on its own. A second strategy, which we make use of in the numerical simulations, is simply to build upon the idea of Kernel Projection Machines (Blanchard and Zwald 2008; Takerkart and Ralaivola 2011): first, perform a Kernel Principal Component Analysis (shorthanded as kernel-PCA afterwards) with \(D\) principal axes, second, project the data onto the principal \(D\)-dimensional subspace and, finally, run U MA on the obtained data. The availability of numerous methods to efficiently extract the principal subspaces (or approximation thereof) (Bach and Jordan 2002; Drineas et al. 2006; Drineas and Mahoney 2005; Stempfel and Ralaivola 2007; Williams and Seeger 2000) makes this path a viable strategy to render U MA usable for nonlinearly separable concepts. This explains why we decided to use this strategy in the present paper.

4 Experiments

In this section, we present results from numerical simulations of our approach and we discuss different practical aspects of U MA. The ultraconservative step sizes retained are those corresponding to a regular Perceptron: \(\tau _p=-1\) and \(\tau _q=+1\), the other values of \(\tau _r\) being equal to 0.

Section 4.1 discusses robustness results, based on simulations conducted on synthetic data while Section 4.2 takes it a step further and evaluates our algorithm on real data, with a realistic noise process related to Example 1 (cf. Sect. 1).

We essentially use what we call the confusion rate as a performance measure, which is:
$$\begin{aligned} \frac{1}{\sqrt{Q}} \Vert {{\widehat{C}}} \Vert _F \end{aligned}$$
Where \(\Vert {{\widehat{C}}} \Vert _F\) is the Frobenius norm of the confusion matrix \(\widehat{C}\) computed on a test set \(S_{\text {test}}\) (independent from the training set), i.e.:
$$\begin{aligned} \Vert {{\widehat{C}}} \Vert _F^2 = \sum _{i,j} \widehat{C}_{ij}^2,\text { with }\widehat{C}_{pq} \doteq \left\{ \begin{array}{ll} 0 &{}\quad \text { if } p=q,\\ \displaystyle \frac{\sum _{\varvec{x}_i \in S_{\text {test}}} {\mathbb {I}}{\left\{ \widehat{y}_i = p \text { and } t_i = q \right\} }}{\sum _{\varvec{x}_i \in S_{\text {test}}} {\mathbb {I}}{\left\{ t_i = q \right\} }}&{}\quad \text { otherwise,} \end{array}\right. \end{aligned}$$
with \(\widehat{y}_i\) the label predicted for the test instance \(\mathbf{x}_i\) by the learned predictor. \(\widehat{C}\) is much akin to a recall matrix, and the \(1/\sqrt{Q}\) factor ensure that the confusion rate is comprised within 0 and 1.

4.1 Toy dataset

We use a 10-class dataset with a total of roughly 1000 2-dimensional examples uniformly distributed according to \(\mathcal {U}\), which is the uniform distribution over the unit circle centered at the origin. Labelling is achieved according to (1) given a set of 10 weight vectors \(\varvec{w}_1,\ldots ,\varvec{w}_{10}\), which are also randomly generated according to \(\mathcal {U}\); all these weight vectors have therefore norm 1. A margin \(\theta = 0.025\) is enforced in the generated data by removing examples that are too close to the decision boundaries—practically, with this value of \(\theta \), the case where three classes are so close to each other that no training example from one of the classes remained after enforcing the margin never occurred.

The learned classifiers are tested against a dataset of 10,000 points that are distributed according to the training distribution. The results reported in the tables and graphics are averaged over 10 runs.

The noise is generated from the sole confusion matrix. This situation can be tough to handle and is rarely met with real data but we stick with it as it is a good example of a worst-case scenario.

Robustness to noise We first (Fig. 1a) evaluate the robustness to noise of U MA by running our algorithm with various confusion matrices. We uniformly draw a reference nonnegative square matrix \(M\), the rows of \(M\) are then normalized, i.e. each entry of \(M\) is divided by the sum of the elements of its row, so \(M\) is a stochastic matrix. If \(M\) is not invertible it is rejected and we draw a new matrix until we have an invertible one. Then, we define \(N\) such that \(N = {(M - I)}/10\), where \(I\) is the identity matrix of order \(Q\); typically \(N\) has nonpositive diagonal entries and nonnegative off-diagonal coefficients. We will use \(N\) to parametrize a family of confusion matrices that have their most dominant coefficient to move from their diagonal to their off-diagonal parts. Namely, we run U MA 20 times with confusion matrices \(C\in \{C_i\doteq \Omega (I + iN)\}_{i=1}^{20}\), where \(\Omega \) is a matrix operator which outputs a (row-)stochastic matrix: when applied on matrix \(A\), \(\Omega \) replaces the negative elements of \(A\) by zeros and it normalizes the rows of the obtained matrix; note that \(i = 10\) corresponds to the case where \(C= M\). Equivalently, one can think of \(C_i\) as the weighted average between \(I\) and \(\Omega (N)\) where \(I\) has a constant weight of 1 and \(\Omega (N)\) is weighted by \(i\). Note that, after some point, further increasing \(i\) has little effect on \(C_i\) as it eventually converges to \(\Omega (N)\). Figure 1a plots our results against the Frobenius norm of the diagonal-free confusion matrix \(C\), that is: \(\Vert {C- \mathtt diag (C)} \Vert _F\) where \(\mathtt diag (C)\) denotes the diagonal matrix with the same diagonal values as \(C\). For the sake of comparison, we also have run U MA with a fixed confusion matrix \(C= I\) on the same data. This amounts to running a Perceptron through the data multiple times and it allows us to have a baseline for measuring the improvement induced by the use of the confusion matrix.
Fig. 1

a Evolution of the confusion rate (y-axis) for different noise levels (x-axis); b evolution of the same quantity with respect to errors in the confusion matrix \(C\) (x-axis) measured by the approximation factor (see text)

Robustness to the incorrect estimation of the confusion matrix The second experiment (Fig. 1b) evaluates the robustness of U MA to the use of a confusion matrix that is not exactly the confusion matrix that describes the noise process corrupting the data; this will allow us to measure the extent to which a confusion matrix (inaccurately) estimated from the training data can be dealt with by U MA. Using the same notation as before, and the same idea of generating a random stochastic reference matrix \(M\), we proceed as follows: we use the given matrix \(M\) to corrupt the noise-free dataset and then, each confusion matrix from the family \(\{C_i\}_{i=1}^{20}\) is fed to U MA as if it were the confusion matrix governing the noise process. We introduce the notion of approximation factor \(\rho \) as \(\rho (i)\doteq 1-i/10\), so that \(\rho \) takes values in the set \(\{-1,-0.9,\ldots ,0.9\}\). As reference, the limit case where \(\rho = 1\)—that is, \(i = 0\)—corresponds to the case where U MA is fed with the identity matrix \(I\), effectively being oblivious of any noise in the training set. More generally, the values of \(C\) are being shifted away from the diagonal as \(\rho \) decreases, the equilibrium point being \(\rho = 0\) where \(C\) is equal to the true confusion matrix \(M\). Consequently, a positive (resp. negative) approximation factor means that the noise is underestimated (resp. overestimated), in the sense that the noise process described by \(C\) would corrupt a lower (resp. higher) fraction of labels from each class than the true noise process applied on the training set, and corresponding to \(M\). Figure 1b plots the confusion rate against this approximation factor.

On Fig. 1a we observe that U MA clearly provides improvement over the Perceptron algorithm for every noise level tested, as it achieves lower confusion rates. Nonetheless, its performance degrades as the noise level increases, going from a confusion rate of \(0.5\) for small noise levels—that is, when \(\Vert {C- \mathtt diag (C)} \Vert _F\) is small—to roughly \(2.25\) when the noise is the strongest. Comparatively, the Perceptron algorithm follows the same trend, but with higher confusion rate, ranging from \(1.7\) to \(2.75\).

The second simulation (Fig. 1b) points out that, in addition to being robust to the noise process itself, U MA is also robust to underestimated (approximation factor \(\rho > 0\)) noise levels, but not to overestimated (approximation factor \(\rho < 0\)) noise levels. Unsurprisingly, the best confusion rate corresponds to an approximation factor of 0, which means that U MA is using the true confusion matrix and can achieve a confusion rate as low as \(1.8\). There is a clear gap between positive and negative approximation factors, the former yielding confusion rates around \(2.6\) while the latter’s are slightly lower, around \(2.15\). From these observations, it is clear that the approximation factor has a major influence on the performances of U MA.

4.2 Real data

4.2.1 Experimental protocol

In addition to the results on synthetic data, we also perform simulations in a realistic learning scenario. In this section we are going to assume that labelling examples is very expensive and we implement the strategy evoked in Example 1. More precisely, for a given dataset \({\mathcal {S}}\), proceed as follows:
  1. 1.

    Ask for a small number \(m\) of examples for each of the \(Q\) classes.

     
  2. 2.

    Learn a rough classifier1 \(g\) from these \(Q \times m\) points.

     
  3. 3.

    Estimate the confusion \(C\) of \(g\) on a small labelled subset \({\mathcal {S}}_{\text {conf}}\) of \({\mathcal {S}}\).

     
  4. 4.

    Predict the missing labels \(\varvec{y}\) of \({\mathcal {S}}\) using \(g\); thus, \(\varvec{y}\) is a sequence of noisy labels.

     
  5. 5.

    Learn the final classifier \(f_{\mathtt{\normalsize U}\mathtt{MA}}\) from \({\mathcal {S}}\), \(\varvec{y}\), \(C\) and measure its error rate.

     
One might wonder why we do not simply sample a very small portion of \({\mathcal {S}}\) in the first step. The reason is that in the case of very uneven classes proportions some of the classes may be missing in this first sampling. This is problematic when estimating \(C\) as it leads to a non-invertible confusion matrix. Moreover, the purpose of \(g\) is only to provide a baseline for the computation of \(\varvec{y}\), hence tweaking the class (im)balance in this step is not a problem.

In order to put our results into perspective, we compare them with results obtained from various algorithms. This allows us to give a precise idea of the benefits and limitations of U MA. Namely, we learn four additional classifiers: \(f_{\varvec{y}}\) is a regular Perceptron learned on \({\mathcal {S}}\) labelled with noisy labels \(\varvec{y}\), \(f_{\text {conf}}\) and \(f_{\text {full}}\) are trained with the correctly labelled training sets \({\mathcal {S}}_{\text {conf}}\) and \({\mathcal {S}}\) respectively and, lastly, \(f_{{\mathtt{S3VM}}}\) is a classifier produced by a multiclass semi-supervised SVM algorithm (S3VM, Bennett and Demiriz 1998) run on \({\mathcal {S}}\) where only the labels of \({\mathcal {S}}_{\text {conf}}\) are provided. The performances achieved by \(f_{\varvec{y}}\) and \(f_{\text {full}}\) provide bounds for U MA ’s error rates: on the one hand, \(f_{\varvec{y}}\) corresponds to a worst-case situation, as we simply ignore the confusion matrix and use the regular Perceptron instead—arguably, U MA should perform better than this—; on the other hand, \(f_{\text {full}}\) represents the best-case scenario for learning, when all the correct labels are available—the performance of \(f_{\text {full}}\) should always top that of U MA (and the performances of other classifiers). The last two classifiers, \(f_{\text {conf}}\) and \(f_{{{\mathtt{S3VM}}}}\), provide us with objective comparison measures. They are learned from the same data as U MA but use them differently: \(f_{\text {conf}}\) is learned from the reduced training set \({\mathcal {S}}_{\text {conf}}\) and \(f_{{{\mathtt{S3VM}}}}\) is output by a semi-supervised learning strategy that infers both \(f_{{{\mathtt{S3VM}}}}\) and the missing labels of \({\mathcal {S}}\) and it totally ignores the predictions \(\varvec{y}\) made by \(g\). Note that according to the learning scenario we implement, we assume \(C\) to be estimated from raw data. This might not always be the case with real-world problems and \(C\) might be easier and/or less expensive to get than raw data; for instance, it might be deduced from expert knowledge on the studied domain. In that case, \(f_{\text {conf}}\) and \(f_{{{\mathtt{S3VM}}}}\) may suffer from not taking full advantage of the accurate information about the confusion.

4.2.2 Datasets

Our simulations are conducted on three different datasets. Each one with different features. For the sake of reproducibility, we used datasets that can be easily found on the UCI Machine learning repository (Bache and Lichman 2013). Moreover, these datasets correspond to tasks for which generating a complete, labelled, training set is typically costly because of the necessity of human supervision and subject to classification noise. The datasets used and their main features are as follows.

Optical recognition of handwritten digits This well-known dataset is composed of \(8\times 8\) grey-level images of handwritten digits, ranging from 0 to 9. The dataset is composed of 3823 images of 64 features for training, and 1797 for the test phase. We set \(m\) to 10 for this dataset, which means that \(g\) is learned from 100 examples only. \({\mathcal {S}}_{\text {conf}}\) is a sampling of 5 % of \({\mathcal {S}}\). The classes are evenly distributed (see Fig. 2a). We handle the nonlinearity through the use of a Gaussian kernel-PCA (see Sect. 3.6) to project the data onto a feature space of dimension 640.
Fig. 2

Class distribution for the three datasets. a Handwritten digits, b letter recognition, c reuters

Letter recognition The Letter Recognition dataset is another well-known pattern recognition dataset. The images of the letters are summarized into a vector of 16 attributes, which correspond to various primitives computed on the raw data. With 20,000 examples, this dataset is much larger than the previous one. As for the handwritten digits dataset, the examples are evenly spread across the 26 classes (see Fig. 2b). We uniformly select 15,000 examples for training and the remaining 5000 are used for test. We set \(m\) to 50 as it seems that smaller values do not yield usable confusion matrices. We again sample 5 % of the dataset to form \({\mathcal {S}}_{\text {conf}}\) and use, as before, a Gaussian kernel-based Kernel-PCA to (nonlinearly) expand the dimension of the data to 1600.

Reuters The Reuters dataset is a nearly linearly-separable document categorization dataset of more than 300,000 instances of nearly 47,000 features each. For size reasons we restrict ourselves to roughly 15,000 examples for training, and 15,000 other for test. It occurs that some classes are so underrepresented that they are flooded by the noise process and/or do not appear in \({\mathcal {S}}_{\text {conf}}\), which may lead to a non-invertible confusion matrix. We therefore restrict the dataset to the nine largest classes. One might wonder whether doing so erases class imbalance. This is not the case as, even this way, the least represented class accounts for roughly 500 examples while this number reaches nearly 4000 for the most represented one (see Fig. 2c). Actually, these 9 classes represent more than 70 % of the dataset, reducing the training and test sets to approximately 11,000 examples each. We do not use any kernel for this dataset, the data being already near to linearly-separable. Also, we sample \({\mathcal {S}}_{\text {conf}}\) on 5 % of the training set and we set \(m = 20\).

4.2.3 Results

Table 1 presents the misclassification error rates averaged on 10 runs. Keep in mind that we have not conducted a very thorough optimization of the hyper-parameters as the point here is essentially to compare U MA with the other algorithms. Additionally, we also report the error rates of \(f_{{{\mathtt{S3VM}}}}\) when trained on the kernelized data with all dimensions, that is the kernelized data before we project them onto their \(D\) principal components. Because the projection step is indeed unbecessary with S3VM, this will give us insights on the error due to the Kernel-PCA step. Comparing the first and the last columns of Table 1, it appears that U MA always induces a slight performance gain, i.e. a decrease of the misclassification rate, with respect to \(f_{\varvec{y}}\).

From the second and third columns of Table 1, it is clear that the reduced number of examples available to \(f_{\text {conf}}\) induces a drastic increase in the misclassification rate with respect to \(f_{\text {full}}\) which is allowed to use the totality of the dataset during the training phase.
Table 1

Misclassification rates of different algorithms

Dataset

\({f}_{{{\mathbf {y}}}}\)

\({f_{\mathtt{conf}}}\)

\({f_{\mathtt{full}}}\)

\({f_{\mathtt{{\mathtt{S3VM}}}}}\)

U MA

\({f}_{\mathtt{{\mathtt{S3VM}}}}\) (no K-PCA)

Handwritten digits

\(0.25\)

\(0.21\)

\(0.04\)

\(0.15\)

\(0.16\)

\(0.07\)

Letter recognition

\(0.35\)

\(0.36\)

\(0.23\)

\(0.49\)

\(0.33\)

\(0.18\)

Reuters

\(0.30\)

\(0.17\)

\(0.01\)

\(0.22\)

\(0.21\)

\(0.22\)

Comparing U MA and \(f_{\text {conf}}\) in Table 1 (fifth and second columns), we observe that U MA achieves lower misclassification rates on the Handwritten Digits and Letter Recognition datasets but a higher misclassification rate on Reuters. Although this is likely related to the strong class imbalance in the dataset. Indeed, some classes are overly represented, accounting for the vast majority of the whole dataset (see Fig. 2c). Because \({\mathcal {S}}_{\text {conf}}\) is uniformly sampled from the main dataset, \(f_{\text {conf}}\) is trained with a lot of examples from the overrepresented classes and therefore it is very effective, in the sense that it achieves a low misclassification rate, for these overrepresented classes; this, in turn, induces a (global) low misclassification rate, as possibly high misclassification rates on underrepresented classes are countervailed by theirs accounting for a small portion of the data. On the other hand, because of this disparity in class representation, the slightest error in the confusion matrix, granted it involves one of these overrepresented classes, may lead to a significant increase of the misclassification rate. In this regard, U MA is strongly disadvantaged with respect to \(f_{\text {conf}}\) on the Reuters dataset and it is the cause of the reported results.

The error rates for the S3VM and U MA classifiers are close for the Reuters and Handwritten Digits datasets whereas U MA has a clear advantage on the Letter Recognition problem. On the other hand, note that we used the S3VM method in conjunction with a Kernel-PCA for the sake of comparison with U MA in its kernelized form. The last column of Table 1 tends to confirm that this projection strategy increase the error rate of \(f_{{{\mathtt{S3VM}}}}\). Also, reminds that the value of \(m\) does not impact the performances of \(f_{{{\mathtt{S3VM}}}}\) but has a significant effect on U MA, even though U MA never uses these labelled data. For instance, on the Reuters datasets, increasing \(m\) from 20 to 70 reduces U MA ’s error rate by nearly \(0.1\) (see the error rates of Fig. 3 (\(m=70\)) when the size of labelled data is close to 550, that is 5 % of the whole dataset). Despite our efforts to keep \(m\) as small as possible, we could not go under \(m = 50\) for the Letter Recognition dataset without compromising the invertibility of the confusion matrix. The simple fact that an unusually high number of examples are required to simply learn a rough classifier asserts the complexity of this dataset. Moreover, the fact that \(f_{\varvec{y}}\) also outperforms \(f_{{{\mathtt{S3VM}}}}\) implies that the labels fed to U MA are already mostly correct, and, according to our working assumptions, this is the most favorable setting for U MA.
Fig. 3

Error rate of U MA and \(f_{\text {conf}}\) with respect to the sampling size. Reuters dataset with \(m=70\) for the sake of figure’s readability

Nonetheless, the disparities between U MA and \(f_{\text {conf}}\) deserve more attention. Indeed, the same data are being used by both algorithms, and one could expect more closeness in the results. To get a better insight on what is occurring, we have reported the evolution of the error rate of these two algorithms with respect to the sampling size of \({\mathcal {S}}_{\text {conf}}\) in Fig. 3. We can see that U MA is unaffected by the size of the sample, essentially ignoring the possible errors in the confusion matrix on small samples. This reinforces our previous results showing that U MA is robust to errors in the confusion matrix. On the other hand, with the addition of more samples, the refinement of the confusion matrix does not allow U MA to compete with the value of additional (correctly) labelled data and eventually, when the size of \({\mathcal {S}}_{\text {conf}}\) grows, \(f_{\text {conf}}\) performs better than U MA. This points towards the idea that the aggregated nature of the confusion matrix incurs some loss of relevant information for the classification task at hand, and that a more accurate estimate of the confusion matrix, as induced by, e.g., the use of larger \({\mathcal {S}}_{\text {conf}}\), may not compensate for the information provided by additional raw data.

Building on this observation, we go a step further and replicate this experiment for all of the three datasets; only this time we track the performances of \(f_{{{\mathtt{S3VM}}}}\) instead. The results are plotted on Fig. 4. For the three datasets, we observe the same behavior as before. Namely, U MA is able to maintain a low error rate even with a very small size of \({\mathcal {S}}_{\text {conf}}\). On the other hand, U MA does not benefit as much as other methods from a large pool of labelled examples. In this case, U MA quickly stabilizes while, to the contrary, the S3VM method starts at a fairly high error rate and keeps improving as more labelled examples are available.
Fig. 4

Error rates for the reuter (left), optical digit recognition (center) and letter (right) datasets with respect to the size of \({\mathcal {S}}_{\text {conf}}\). Average over 15 runs

Beyond this, it is important to recall that U MA never uses the labels of \({\mathcal {S}}_{\text {conf}}\) (those are only used to estimate the confusion matrix, not the classifier—refer to Sect. 4.2.1 for the detailed learning protocol). While refining the estimation of \(C\) is undoubtedly useful, a direction toward substantial performance gains should revolve around the combination of both this refined estimation of \(C\) and the use of the correctly labelled training set \({\mathcal {S}}_{\text {conf}}\). This is a research subject on its own that we leave for future work.

All in all, the reported results advise us to prefer U MA over other available methods when the amount of labelled data is particularly small, in addition, obviously, to the motivating case of the present work where the training data are corrupted and the confusion matrix is known. Also, another interesting finding we get is that even a rough estimation of the confusion matrix is sufficient for U MA to behave well.

Finally, we investigate the impact of the selection strategy of \((p,q)\) on the convergence speed of U MA (see Sect. 3.5). We use three variations of U MA with different strategies for selecting \((p,q)\) (error, confusion, and random) and monitor each one along the learning process on the reuters dataset. The error and confusion strategies are described in Sect. 3.5 and the random strategy simply selects \(p\) and \(q\) at random.

From Fig. 5, which reports the misclassification rate and the confusion rate along the iterations, we observe that both performance measures evolve similarly, attaining a stable state around the 30th iteration. The best strategy depends on the performance measure used, even though regardless of the performance measure used, we observe that the random selection strategy leads to a predictor that does not achieve the best performance measure (there is always a curve beneath that of the random selection procedure), which shows that it not an optimal selection strategy.
Fig. 5

Error and confusion risk on reuters dataset with various update strategies

As one might expect, the confusion-based strategy performs better than the error-based strategy when the confusion rate is retained as a performance measure, while the converse holds when using the error rate. This observation motivates us to thoroughly study the confusion-based strategy in a near future as being able to propose methods robust to class imbalance is a particularly interesting challenge of multiclass classification.

The plateau reached around the 30th iteration may be puzzling, since the studied dataset presents no positive margin and convergence is therefore not guaranteed. One possible explanation for this is to see the reuters dataset as linearly separable problem corrupted by the effect of a noise process, which we call the intrinsic noise process that has structural features ‘compatible’ with the classification noise. By this, we mean that there must be features of the intrinsic noise such that, when additional classification noise is added, the resulting noise that characterizes the data is similar to a classification noise, or at least, to a noise that can be naturally handled by U MA. Finding out the family of noise processes that can be combined with the classification noise—or, more generally, the family of noise processes themselves—without hindering the effectiveness of U MA is one research direction that we aim to explore in a near future.

5 Conclusion

In this paper, we have proposed a new algorithm, U MA—for Unconfused Multiclass Additive algorithm—to cope with noisy training examples in multiclass linear problems. As its name indicates, it is a learning procedure that extends the (ultraconservative) additive multiclass algorithms proposed by Crammer and Singer (2003); to handle the noisy datasets, it only requires the information about the confusion matrix that characterizes the mislabelling process. This is, to the best of our knowledge, the first time the confusion matrix is used as a way to handle noisy label in multiclass problems.

One of the core ideas behind U MA, namely, the computation of the update vector \(\varvec{z}_{pq}\), is not tied to the additive update scheme. Thus, as long as the assumption of linear separability holds, the very same idea can be used to render a wide variety of algorithms robust to noise by iteratively generating a noise-free training set with the consecutive values of \(\varvec{z}_{pq}\). Although, every computation of a new \(\varvec{z}_{pq}\) requires learning a new classifier to start with. This may eventually incur prohibitive computational costs when applied to batch methods (as opposed to online methods) which are designed to process the entirety of the dataset at once.2

U MA takes advantage of the online scheme of additive algorithms and avoids this problem completely. Moreover, additive algorithms are designed to directly handle multiclass problem rather than having recourse to a bi-class mapping. The end-results of this are tightened theoretical guarantees and a convergence rate that does not depend of \(Q\), the number of classes. Besides, U MA can be directly used with any additive algorithms, allowing to handle noise with multiple methods without further computational burden.

While we provide sample complexity analysis, it should be noted that a tighter bound can be derived with specific multiclass tools, such as the Natarajan’s dimension (see Daniely et al. 2011 for example), which allow to better specify the expressiveness of a multiclass classifier. However, this is not the main focus of this paper and our results are based on simpler tools.

To complement this work, we want to investigate a way to properly tackle near-linear problems (such as reuters). As for now the algorithm already does a very good jobs due to its noise robustness. However more work has to be done to derive a proper way to handle cases where a perfect classifier does not exist. We think there are great avenues for interesting research in this domain with an algorithm like U MA and we are curious to see how this present work may carry over to more general problems.

Footnotes

  1. 1.

    For the sake of self-containedness, we use U MA for this task (with \(C\) being the identity matrix). Remind that, when used this way, U MA acts as a regular Perceptron algorithm

  2. 2.

    Nonetheless, from a purely theoretical point of view, U MA makes at most \(O(1/\theta ^2)\) mistakes (see proposition 4) and computing \(\varvec{z}_{pq}\) can be done in \(O(n)\) time. Therefore, polynomial batch methods do not suffer much from this as their overall execution time is still polynomial.

  3. 3.

    Note that in some references the right-hand side of (24) might viewed as a probability measure over \(m\) independent Rademacher variables.

Notes

Acknowledgments

The authors would like to thank the reviewers for their feedback and invaluable comments. This work is partially supported by the Agence Nationale de la Recherche (ANR), project GRETA 12-BS02-004-01. We would like to thank the anonymous reviewers for their insightful and extremely valuable feedback on earlier versions of this paper.

References

  1. Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.MathSciNetGoogle Scholar
  2. Bache, K., & Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml
  3. Bennett, K. P., & Demiriz, A. (1998). Semi-supervised support vector machines. In Advances in Neural Information Processing Systems, Vol. 11, Papers from Neural Information Processing Systems (NIPS) 1998 (pp. 368–374), Denver, CO, USA. http://papers.nips.cc/paper/1582-semi-supervised-support-vectormachines.
  4. Blanchard, G., & Zwald, L. (2008). Finite-dimensional projection for classification and statistical learning. IEEE Transactions on Information Theory, 54(9), 4169–4182.MathSciNetCrossRefGoogle Scholar
  5. Block, H. (1962). The perceptron: A model for brain functioning. Reviews of Modern Physics, 34, 123–135.zbMATHMathSciNetCrossRefGoogle Scholar
  6. Blum, A., Frieze, A. M., Kannan, R., & Vempala, S. (1996) A polynomial-time algorithm for learning noisy linear threshold functions. In Proceedings of 37th IEEE symposium on foundations of computer science (pp. 330–338).Google Scholar
  7. Bylander, T. (1994). Learning linear threshold functions in the presence of classification noise. In Proceedings of 7th annual workshop on computational learning theory (pp. 340–347). New York, NY: ACM Press.Google Scholar
  8. Cohen, E. (1997). Learning noisy perceptrons by a perceptron in polynomial time. In Proceedings of 38th IEEE symposium on foundations of computer science (pp. 514–523).Google Scholar
  9. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. JMLR, 7, 551–585.zbMATHMathSciNetGoogle Scholar
  10. Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3, 951–991.zbMATHMathSciNetGoogle Scholar
  11. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  12. Daniely, A., Sabato, S., Ben-David, S., & Shalev-Shwartz, S. (2011). Multiclass learnability and the ERM principle. Journal of Machine Learning Research Proceedings Track, 19, 207–232.Google Scholar
  13. Dekel, O., Shalev-shwartz, S., & Singer, Y. (2005). The forgetron: A kernel-based perceptron on a fixed budget. In Advances in Neural Information Processing Systems, Vol. 18, Papers from Neural Information Processing Systems (NIPS) 2005 (pp. 259–266), Vancouver, BC, Canada. http://papers.nips.cc/paper/2806-the-forgetron-a-kernel-basedperceptron-on-a-fixed-budget.
  14. Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer.zbMATHCrossRefGoogle Scholar
  15. Drineas, P., Kannan, R., & Mahoney, M. W. (2006). Fast Monte Carlo algorithms for matrices ii: Computing a low rank approximation to a matrix. SIAM Journal on Computing, 36(1), 158–183.zbMATHMathSciNetCrossRefGoogle Scholar
  16. Drineas, P., & Mahoney, M. W. (2005). On the Nyström method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153–2175.zbMATHMathSciNetGoogle Scholar
  17. Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296.zbMATHCrossRefGoogle Scholar
  18. Friess, T., Cristianini, N., & Campbell, N. (1998). The kernel-adatron algorithm: A fast and simple learning procedure for support vector machines. In J. Shavlik (Ed.), Machine learning: Proceedings of the 15th international conference. Morgan Kaufmann Publishers.Google Scholar
  19. Kakade, S. M., Shalev-Shwartz, S., & Tewari, A. (2008). Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th international conference on machine learning, ICML ’08 (pp. 440–447). New York, NY: ACM.Google Scholar
  20. Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory. Cambridge: MIT Press.Google Scholar
  21. Louche, U., & Ralaivola, L. (2013). Unconfused ultraconservative multiclass algorithms. In: JMLR workshop & conference proceedings 29 (Proceedings of ACML 13) (pp. 309–324).Google Scholar
  22. Minsky, M., & Papert, S. (1969). Perceptrons: An introduction to computational geometry. Cambridge: MIT Press.zbMATHGoogle Scholar
  23. Novikoff, A. (1963). On convergence proofs for perceptrons. In Proceedings of the symposium on the mathematical theory of automata (Vol. 12, pp. 615–622).Google Scholar
  24. Ralaivola, L. (2012). Confusion-based online learning and a passive-aggressive scheme. In NIPS (pp. 3293–3301).Google Scholar
  25. Ralaivola, L., Favre, B., Gotab, P., Bechet, F., & Damnati, G. (2011). Applying multiclass bandit algorithms to call-type classification. In ASRU (pp. 431–436).Google Scholar
  26. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels, support vector machines, regularization, optimization and beyond. MIT University Press. http://www.learning-with-kernels.org
  27. Stempfel, G., & Ralaivola, L. (2007). Learning kernel perceptron on noisy data and random projections. In In Proceedings of algorithmic learning theory (ALT 07).Google Scholar
  28. Takerkart, S., & Ralaivola, L. (2011). MKPM: A multiclass extension to the kernel projection machine. In CVPR (pp. 2785–2791). http://dblp.uni-trier.de/db/conf/cvpr/cvpr2011.html#TakerkartR11
  29. Valiant, L. (1984). A theory of the learnable. Communications of the ACM, 27, 1134–1142.zbMATHCrossRefGoogle Scholar
  30. Williams, C. K. I., & Seeger, M. (2000). Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, Vol. 13, Papers from Neural Information Processing Systems (NIPS) 2000 (pp. 682–688), Denver, CO, USA. http://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-upkernel-machines.

Copyright information

© The Author(s) 2015

Authors and Affiliations

  1. 1.Qarma, Lab. d’Informatique Fondamentale de Marseille, CNRSUniversité d’Aix-MarseilleMarseilleFrance

Personalised recommendations