Adaptive regularization of weight vectors
 1.5k Downloads
 49 Citations
Abstract
We present AROW, an online learning algorithm for binary and multiclass problems that combines large margin training, confidence weighting, and the capacity to handle nonseparable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive mistake bounds for the binary and multiclass settings that are similar in form to the second order perceptron bound. Our bounds do not assume separability. We also relate our algorithm to recent confidenceweighted online learning techniques. Empirical evaluations show that AROW achieves stateoftheart performance on a wide range of binary and multiclass tasks, as well as robustness in the face of nonseparable data.
Keywords
Online learning Supervised learning Text classification Adaptive regularization1 Introduction
Online learning algorithms are fast and simple, make few statistical assumptions, and perform well in a wide variety of settings. The Perceptron algorithm is perhaps the oldest online machine learning algorithm, tracing its origins back to the 1950s. The Perceptron, which uses a gradual additive update based on stochastic gradient ascent, has been supported by numerous mistake bound analyses (Littlestone 1988). Despite its age, it is still widely used for modern problems, including complex structured learning tasks (Collins 2002).
While popular, the Perceptron suffers from several wellknown problems. First, while guaranteed to converge on linearly separable data sets, convergence can be slow, often taking many iterations over the data set. This slow convergence is a consequence of nonaggressive updates, which take a fixed step towards a better solution but offer no guarantees about the improvement on the current training example. In many difficult learning settings, even after an update, the Perceptron will still classify an example incorrectly. Similarly, the Perceptron is a strictly mistake driven learning algorithm, meaning that correctly classified training examples with ambiguous scores (the correct label is only slightly preferred) are treated as correct. A common solution to this problem is the Perceptron with margin, in which the size of the learning update is controlled by a learning rate hyperparameter whose increase leads to a more aggressive update (Freund and Schapire 1999).
A second common problem is Perceptron’s erratic behavior in high noise settings. Since Perceptron treats every example equally, examples with label noise still receive a full update. In contrast, batch algorithms, such as the Support Vector Machines algorithm, can sacrifice accuracy on noisy examples in favor of improving performance on the majority of training examples through the use of slack variables. The result is that the Perceptron’s weights oscillate, sometimes dramatically, as it constantly struggles with noisy labels. A number of strategies exist for addressing this problem, including those proposed by Krogh and Hertz (1992), Krogh (1992), Khardon and Wachman (2007), and the voted or averaged Perceptron proposed by Freund and Schapire (1999). This problem can be particularly dramatic in highly nonseparable structured learning tasks, which, as a result, almost universally rely on the averaged Perceptron.
A variety of new algorithms have been proposed to address the shortcomings of the Perceptron. One is the Passive Aggressive (PA) algorithm (Crammer et al. 2003, 2006), also known as MIRA, which is based on the same additive update form as Perceptron. In PA, however, the strength of the update comes from a per example learning rate that is based on the solution to a convex optimization problem; for each example PA enforces a prediction margin based on the hinge loss, updating the algorithm’s parameters accordingly. The result is that, after the update, the example is guaranteed to be classified correctly. The algorithm’s name comes from this behavior: it aggressively updates each example to enforce a margin and passively ignores examples which are already correctly predicted with a margin. The result is significantly faster convergence. As a result, PA has been widely used in vision (Frome et al. 2007; Jie et al. 2010; Chechik et al. 2010), natural language processing (McDonald et al. 2005; Chiang et al. 2008), and bioinformatics (Bernal et al. 2007). Unfortunately, the aggressive nature of the updates has a significant negative impact on learning in the noisy setting, where incorrectly labeled examples will force large updates to the parameters to achieve the margin requirement. The standard solution relies on slack variables, which effectively clip the update to prevent dramatic parameter change based on a single example.
Recently, research on online learning has returned to this issue of convergence, seeking even more aggressive learning algorithms. One effective source for this aggressiveness has been parameter confidence, which has been shown to effectively guide online learning. Parameter confidence is encoded using an additional set of variables that measure the algorithm’s confidence in its current parameter estimates. These algorithms make larger updates to less confident parameters, and then increase the confidence of a parameter each time it is updated. It is also possible to model second order feature interactions (CesaBianchi et al. 2005; Ma et al. 2010). Tracking parameter confidence in this manner generally increases the rate of training convergence.
One implementation of this idea is Confidence Weighted (CW) learning, which is based on the passiveaggressive update. CW maintains parameter confidence through a Gaussian distribution over linear classifier hypotheses, which is then used to control the direction and scale of parameter updates (Dredze et al. 2008; Crammer et al. 2012). Updates not only fix learning mistakes but also increase confidence. In many settings, CW has been shown to be significantly more aggressive than PA, leading to much faster convergence rates. In addition to formal guarantees in the mistake bound model (Littlestone 1988), CW learning has achieved stateoftheart performance, as well as faster learning rates, on a variety of tasks.
However, the strict update criterion used by CW learning is very aggressive and can overfit (Crammer et al. 2008). As a result, the most popular versions of CW rely on approximate solutions that effectively regularize the update and improve results. However, current analyses of CW learning still assume that the data are separable. It is not immediately clear how to relax this assumption for noisy data.
In this paper, we introduce an algorithm that addresses the need for both faster convergence and resistance to training noise. The core idea is to maintain the formalization of parameter confidence and second order feature interactions introduced by CW, but forego the aggressiveness of both CW and PA. Parameter confidence provides its own form of aggressiveness, so softening the margin requirement allows for robustness to training noise without sacrificing convergence speed. The resulting algorithm adaptively regularizes the prediction function upon seeing each new instance, making it robust to sudden changes in the classification function due to label noise. We call our algorithm AROW: Adaptive Regularization Of Weights. We emphasize that this approach is quite different from simply introducing slack variables, which merely modulate the strength of the update. Instead, we derive a completely new update rule that results in nonmatching updates to the model parameters and confidence parameters. If we think of CW as enforcing statements of probability, then AROW can be seen as controlling model expectations. The result is an online learning algorithm that combines the attractive properties of large margin training, confidence weighting, and the capacity to handle nonseparable data.
After deriving AROW, we provide a mistake bound analysis, similar in form to the second order Perceptron bound, that does not assume data separability. Our previous work (Crammer et al. 2009b) focused on the binary case of AROW. In this work, we derive an additional binary version, detail a fuller analysis, extend the algorithm to the multiclass setting, and provide new empirical results. We demonstrate that, for clean data, AROW achieves similar performance to CW, already state of the art for many tasks, and that AROW maintains convergence rates and significantly improves performance in the presence of label noise. We believe this second property will be of critical importance in many real world applications.
The paper proceeds as follows. In Sect. 2 we give a brief introduction to confidence weighted online methods. In Sect. 3 we introduce AROW and derive updates for binary and multiclass settings, and in Sect. 4 we provide a theoretical analysis of its behavior. Section 5 contains empirical evaluations of AROW using a variety of binary and multiclass applications. We conclude with a discussion of related work in Sect. 6 and summarize our contributions in Sect. 7.
2 Confidence weighted online learning of linear classifiers
Recently Dredze, Crammer, and Pereira (Dredze et al. 2008; Crammer et al. 2008) proposed confidence weighted (CW) learning, an algorithmic framework for online learning of classification problems. CW learning incorporates a notion of confidence in the current classifier by maintaining a Gaussian distribution over the weights; its mean is given by \({\boldsymbol{\mu}}\in\mathbb{R}^{d}\), and its covariance matrix is given by \(\varSigma\in\mathbb{R}^{d\times d }\). Intuitively, μ _{ p } encodes the learner’s knowledge of the weight for feature p, and Σ _{ p,p } encodes its confidence in that weight. Small Σ _{ p,p } indicates that the learner is certain that the true weight is near μ _{ p }. The offdiagonal covariance terms Σ _{ p,q } (p≠q) capture interactions between weights, though they are often unused in practice for reasons of efficiency (Ma et al. 2010).
In theory, a CW classifier labels an instance x by first drawing a parameter vector \({\boldsymbol{w}}\sim{\mathcal{N}} ({{\boldsymbol{\mu}}, \varSigma} )\) and then applying the prediction rule h _{ w }. In practice, however, it can be easier to simply use the average weight vector \(\operatorname{E} [{{\boldsymbol{w}}} ]={\boldsymbol{\mu}}\) to make predictions. This is similar to the approach taken by Bayes point machines (Herbrich et al. 2001), where a single weight vector is used to approximate a distribution. Furthermore, for binary classification, the prediction given by the mean weight vector turns out to be Bayes optimal.
Confidenceweighted algorithms have been shown to perform well in practice (Crammer et al. 2008; Dredze et al. 2008), but they suffer from several problems. First, the update is quite aggressive, forcing the probability of predicting each example correctly to be at least η>1/2 regardless of the cost to the objective. This may cause severe overfitting when labels are noisy; indeed, current analyses of the CW algorithm assume that the data are linearly separable (Crammer et al. 2008). Second, CW methods are appropriate only for zeroone loss classification problems due to the form of the constraint in Eq. (2). It is not clear how to usefully generalize the CW approach to alternative loss functions or settings such as regression. In this work we address both shortcomings, developing a CWlike algorithm that copes effectively with label noise and generalizes the advantages of CW learning in an extensible way. We also present an analysis for the general nonseparable case.
3 Adaptive regularization of weights
The objective in Eq. (3) balances three desires. First, the parameters should not change radically on each round, since the current parameters contain information about previous examples (first term). Second, the new mean parameters should predict the current example with low loss (second term). Finally, as we see more examples, our confidence in the parameters should grow (third term). Note that this objective is not simply the dualization of the CW constraint, but a new formulation inspired by the previous discussion.
3.1 Binary classification with the squared hinge loss
Claim
3.2 Binary classification with the hinge loss
Pseudocode for binary AROW with the hinge loss also appears in Fig. 1. Comparing this update to the squared hinge update, we observe that the update to the mean parameters μ takes a common additive form of μ _{ t−1} plus a scalar times Σ _{ t−1} x. The difference is the exact value of the scalar (compare Eq. (16) with Eq. (23)).
Finally, we note from Eq. (5) and Eq. (6) that the update for the covariance Σ _{ t } is performed using the same rule as for the squared hinge case, and is given in Eq. (15).
3.3 Multiclass classification
We now present two updates for the multiclass setting. The first is based on directly minimizing the hingeloss defined above (see, for example, Crammer and Singer 2003); the second update, motivated by the high time complexity of the first update, is based on a top1 reduction (e.g., see Collins 2002). We start with an update that minimizes the multiclass hinge loss.
4 Analysis
We start our analysis by showing that AROW can be combined with Mercer kernels (Mercer 1909) using the following representer theorem.
Lemma 1
(Representer Theorem)
Assume that Σ _{0}=I and μ _{0}=0. The mean parameters μ _{ t } and confidence parameters Σ _{ t } produced by updating via Eq. (12) and Eq. (15) can be written as linear combinations of the input vectors (resp. outer products of the input vectors with themselves) with coefficients depending only on inner products of input vectors.
Proof
From these equations we see that \(\{\pi_{p,q}^{(t)}\}\) can be computed via inner product as long as β _{ t } can, which is implied by Eq. (16). □
We now turn to analyzing the number of mistakes AROW makes in the binary case. Denote by \(\mathcal{M}\) the set of example indices for which the algorithm makes a mistake (that is, where y _{ t }(μ _{ t−1}⋅x _{ t })≤0) and let \(M= \vert\mathcal{M}\vert\). Similarly, denote by \(\mathcal{U}\) the set of example indices for which there is an update but not a mistake (0<y _{ t }(μ _{ t }⋅x _{ t })≤1) and let \(U= \vert \mathcal{U} \vert\). The remaining examples, for which the algorithm had a margin of at least one (1<y _{ t }(μ _{ t }⋅x _{ t })), do not affect the behavior of the algorithm and can be ignored. We denote the outer product of the mistakes by \(\mathbf{X}_{\mathcal{M}} = \sum_{t\in\mathcal{M}} {\boldsymbol{x}}_{i} {{\boldsymbol{x}}}^{\top}_{i}\), the outer product of the errors by \(\mathbf{X}_{\mathcal{U}} = \sum_{t\in\mathcal{U}} {\boldsymbol{x}}_{i}{{\boldsymbol {x}}}^{\top}_{i}\), and their sum by \(\mathbf{X}_{\mathcal{A}} = \mathbf{X}_{\mathcal{M}} + \mathbf{X}_{\mathcal{U}}\).
Theorem 1
Before turning to the proof we highlight a few properties of the bound.
Remark 1
The bound compares the number of mistakes the algorithm makes to the hinge loss of a reference vector. This asymmetry is typical of mistake bounds, e.g., those for the secondorder perceptron (CesaBianchi et al. 2005) and the Perceptron (Gentile 2003).
Remark 2
The two square root terms of the bound depend on r in opposite ways: the first is monotonically increasing, while the second is monotonically decreasing. One could expect to optimize the bound by minimizing over r. However, the bound also depends on r indirectly via other quantities (e.g., \(\mathbf{X}_{\mathcal{A}}\)), so there is no direct way to do so.
Remark 3
If all of the updates are associated with errors, that is, \(\mathcal{U}=\emptyset\), then the bound reduces to the bound of the secondorder perceptron (CesaBianchi et al. 2005). In general, however, the bounds are not comparable since each depends on the actual runtime behavior of its algorithm.
Remark 4
Remark 5
The bound has a nontrivial dependency on the number of updates. If \({{\boldsymbol{u}}}^{\top}\mathbf{X}_{\mathcal{A}} {\boldsymbol {u}}\) is small, then making updates may reduce the bound, since it increases in \(\sqrt{U}\) and decreases in U.
Remark 6
Remark 7
We do not know of a bound for AROW with standard hinge loss, and leave it as an open problem.
We now prove the theorem. We first prove two auxiliary lemmas.
Lemma 2
The proof appears in Appendix.
Lemma 3
Proof
We are now ready to prove Theorem 1.
Proof
4.1 Multiclass problems
In the multiclass setting, where there are K>2 possible labels, we analyze the top1 version of AROW shown in Fig. 3, which reduces the manyway decision at each iteration to a binary choice between the true label and its current closest competitor.
We first state the analogue of Lemma 2.
Lemma 4
These lemmas suffice to prove the following mistake bound for AROW using the top1 reduction from multiclass to binary classification. The proof exactly mirrors that of Theorem 1, replacing the vector y _{ t } x _{ t } with the vector Δf _{ t }, and using Lemma 4 instead of Lemma 2.
Theorem 2
5 Empirical evaluation
Our empirical evaluation investigates the effectiveness of AROW as both a binary and multiclass classification algorithm. We consider how AROW performs compared with stateoftheart online classification algorithms in both clean and noisy settings. We also consider several types of data: synthetic, binary document classification and digit recognition (OCR), and multiclass document classification.
5.1 Setup

PassiveAggressive (PA) (Crammer et al. 2006): A large margin based online method that uses additive updates to enforce a fixed margin for each training example. Updates are made on margin violations (aggressive) but not otherwise (passive).

Second Order Perceptron (SOP) (CesaBianchi et al. 2005): An extension of the Perceptron algorithm that captures second order information. It is similar to AROW, with an important distinction being that it only updates on mistakes.

ConfidenceWeighted (CW) learning (Dredze et al. 2008): Similar to PA, except that a distribution over weight vectors replaces a single weight vector hypothesis. CW is the inspiration for AROW and is discussed in detail in Sect. 2. We use the “variance” version developed by Dredze et al. (2008).
Since we consider high dimensional datasets, it is computationally infeasible to model all second order feature interactions for SOP, CW and AROW. Instead, we drop crossfeature terms by projecting onto the set of diagonal matrices, following the approach of Dredze et al. (2008). While this may reduce performance, we make the same approximation for all evaluated algorithms. We found this method performed similarly to other projection schemes (Crammer et al. 2008).
All hyperparameters (including r for AROW) and the number of online iterations (up to 10) were optimized using a single randomized run. We used 2,000 instances from each dataset and report all results over 10fold crossvalidation unless otherwise noted.
5.2 Synthetic data
Our synthetic data experiments follow the setting of Crammer et al. (2008). We generated 5,000 training examples in \(\mathbb{R}^{20}\), where the first two coordinates were drawn from a 45^{∘} rotated Gaussian distribution with standard deviation 1. The remaining 18 coordinates were drawn from independent Gaussian distributions \({\mathcal{N}} ({0,2} )\). Each point’s label depended on the first two coordinates using a separator parallel to the long axis of the ellipsoid, yielding a linearly separable set. (See Fig. 4 (left) for an illustration.) To evaluate performance degradation in noisy label settings, we randomly inverted the labels on 10 % of the training examples. Note that evaluation is still with respect to the correct labels, so we can evaluate the true error rate. Since our synthetic data is low dimensional, we consider the full second order versions of CW, SOP and AROW, as well as the diagonalized versions. Algorithm parameters were tuned and results are reported over 100 runs.
5.2.1 Results
5.3 Binary classification

Amazon: This dataset contains product reviews from Amazon.com that are labeled with both a domain (e.g., books or music) and a star rating for the product (Blitzer et al. 2007). This data set is commonly used to evaluate multidomain sentiment classification. We used this data to create domain classification tasks, in which each task required the classification of a document into one of two domains. We took all pairs of the six domains to yield 15 tasks. Feature extraction follows Blitzer et al. (2007), representing each document as a set of bigram counts.
 20 Newsgroups: 20 newsgroups is a commonly used text classification dataset that includes approximately 20,000 newsgroup messages mined from 20 different newsgroups.^{1} The dataset is a popular choice for binary and multiclass text classification as well as unsupervised clustering. We used the version of the data with duplicates removed. Following common practice, we created binary problems of choosing between two similar groups:These distinctions involve neighboring categories so they are fairly difficult to make. This yielded 3 tasks. Each message was represented as a binary bagofwords and there were between 1850 and 1971 instances per task.
comp:
comp.sys.ibm.pc.hardware
vs. comp.sys.mac.hardware
sci:
sci.electronics
vs. sci.med
talk:
talk.politics.guns
vs. talk.politics.mideast
 Reuters (RCV1v2/LYRL2004): The Reuters Corpus Volume 1 contains over 800,000 manually categorized newswire stories (Lewis et al. 2004). Labels describing the general topic, industry, and region of each article are provided. We created binary decision tasks by deciding between two industry labels, for a total of 3 tasks:Like 20 newsgroups, these distinctions involve neighboring categories so they are fairly hard to make. Details on document preparation and feature extraction are given by Lewis et al. (2004). For each problem we selected 2,000 instances using a bagofwords representation with binary features.
Insurance:
Life (I82002)
vs. NonLife (I82003)
Business Services:
Banking (I81000)
vs. Financial (I83000)
Retail Distribution:
Specialist Stores (I65400)
vs. Mixed Retail (I65600)

Sentiment: Using the same Amazon product reviews data, the goal is to classify a product review as having either positive or negative sentiment. Feature representations are the same as described above. We created a separate binary task for each of the 6 domains, yielding 6 tasks.

Spam: The ECML/PKDD Spam Challenge^{2} provides spam and ham emails for traditional spam classification. Data is provided for different users in two different tasks. We selected three task A users and classify emails as spam or ham (3 tasks). The provided data is already represented as features using bagofwords.
In addition to binary document classification tasks, we also evaluate on two well known digit recognition tasks (OCR): MNIST ^{3} and USPS. For each of the data sets, we created 45 binary allpairs tasks and an additional 10 onevsall tasks from the MNIST data (100 tasks). For these experiments, we report results using standard training and test splits instead of 10fold cross validation.
For every data set, we introduce noise at various levels by randomly and independently flipping each binary label with a fixed probability (0, 0.05, 0.1, 0.15, 0.2, 0.3).
5.3.1 Results
To capture the large number of results for the four algorithms on multiple tasks at varying noise levels, we summarize our results in two ways. First, we compute the mean rank of each algorithm on all of the tasks. That is, for each task, we rank the four algorithms according to their performance on that task, with a rank of 1 indicating an algorithm outperformed all others and a rank of 4 for the worst algorithm. We then average these ranks over all of the tasks and report the mean rank. While this obscures the raw accuracy differences between each algorithm, it indicates general trends in the results.
Mean rank (out of 4, over all datasets) at different noise levels. A rank of 1 on a task indicates that an algorithm outperformed all the others, while a rank of 2 indicates that it was the second best performing algorithm. These ranks are averaged over all tasks. With no noise, AROW is the best algorithm, outperforming CW on a narrow majority of the tasks. As the noise level increases, the difference between AROW and the other algorithms increase. Additionally, CW does worse and is eventually overtaken by PA
Algorithm  Noise level  

0.0  0.05  0.1  0.15  0.2  0.3  
AROW  1.51  1.44  1.38  1.42  1.25  1.25 
CW  1.63  1.87  1.95  2.08  2.42  2.76 
PA  2.95  2.83  2.78  2.61  2.33  2.08 
SOP  3.91  3.87  3.89  3.89  4.00  3.91 
Real differences emerge as we consider noisier settings. As the noise level increases, CW gradually worsens relative to PA and AROW. At the 20 % noise level, CW and PA are comparable, and at the 30 % noise level, PA easily outranks CW. Across all of these settings, AROW improves with respect to the other methods.
Our second summarization of the results allows for direct comparison between pairs of algorithms. We present the paired scores for each task as a point in a scatter plot with the xaxis indicating the baseline accuracy and the yaxis indicating AROW accuracy. A line with a slope of 1 allows for a direct comparison; for points above the line, AROW obtains higher accuracy, while points below the line favor the baseline. We include scatter plots at three different noise levels (0 %, 10 % and 30 %); see Fig. 8.
5.4 Multiclass classification

Amazon: Using the Amazon data, we created two domain classification tasks from seven product domains: apparel, books, dvds, electronics, kitchen, music, video. In Amazon 7, we include all seven domains and in Amazon 3 we select the three most common: books, dvds, and music. Feature extraction is the same as above.

20 Newsgroups: We use all messages from the 20 newsgroups, classifying them into the newsgroups from which they originate. Feature extraction is the same as above.

Enron: The Enron email data set contains hundreds of thousands of email messages for over 100 users.^{4} The data set has been used for numerous classification tasks. We consider the task of automated sorting of emails into folders (Klimt and Yang 2004; Bekkerman et al. 2004). We selected two users with many email folders and messages: farmerd (Enron A) and kaminskiv (Enron B). We used the ten largest folders for each user, excluding nonarchival email folders such as “inbox,” “deleted items,” and “discussion threads.” Emails were represented as binary bagsofwords with stopwords removed.

NY Times: The New York Times Annotated Corpus contains 1.8 million articles that appeared from 1987 to 2007 (Sandhaus 2008). In addition to being one of the largest collections of raw news text, it is possibly the largest collection of publicly released annotated news text. Among other annotations, each article is labeled with the desk that produced the story (Financial, Sports, etc.) (NYTD), the online section to which the article was posted (NYTO), and the section in which the article was printed (NYTS). Articles were represented as bagsofwords with feature counts (stop words removed).

Reuters: We used the Reuters corpus described above along with the four general topic labels for topic classification: corporate, economic, government, and markets. Feature extraction follows the setup above.
A summary of the nine tasks, including the number of instances, features, and labels, and whether the numbers of examples in each class are balanced
Task  Instances  Features  Labels  Bal. 

20 News  18,828  252,115  20  Y 
Amazon 7  13,580  686,724  7  Y 
Amazon 3  7,000  494,481  3  Y 
Enron A  3,000  13,559  10  N 
Enron B  3,000  18,065  10  N 
NYTD  10,000  108,671  26  N 
NYTO  10,000  108,671  34  N 
NYTS  10,000  114,316  20  N 
Reuters  4,000  23,699  4  N 
5.4.1 Results

Perceptron: A multiclass Perceptron using a 1ofk encoding.

PA: A PA classifier with a 1ofk encoding using a single constraint. Crammer et al. (2009a) found that a single constraint did well on these tasks.

CW: A diagonal CW classifier with a single constraint.

AdaGrad: A diagonally regularized dual averaging (RDA) version with L _{1} regularization (Duchi et al. 2011).
All experimental details (number of folds, parameter optimization, etc.) are the same as the binary experiments above. For multiclass data, we randomly select another label to replace the correct label in the case of label noise.
Accuracy on the multiclass data sets with no label noise
Data  Label Noise = 0.0  

Percep.  MIRA  CW  AdaGrad  AROWH  AROW 1  AROW 5  AROW all  
20 News  78.74  88.41  92.76  87.52  90.91  91.85  92.74  93.39 
Enrom Kaminski  62.83  67.77  72.27  68.77  71.07  71.73  72.03  71.97 
Enron Farmer  75.8  82.83  83.07  79.7  83.53  84.33  84.83  83.6 
NYT Desk  77.7  81.22  82.43  74.7  81.19  81.74  81.97  81.98 
NYT Online  76.91  81.11  82.62  75.25  81.97  82.7  82.67  82.59 
NYT Section  48.53  56.42  54.62  54.69  56.71  55.49  56.05  56.65 
Reuters  92.25  93.4  92.4  88.45  93.73  93.35  92.95  92.95 
Sentiment 3  92.16  93.46  94.49  93.24  93.93  94.21  94.07  94.07 
Sentiment All  74.06  76.31  78.41  74.78  77.64  77.84  78.4  78.49 
Accuracy on the multiclass data sets with label noise of 0.1
Data  Label Noise = 0.1  

Percep.  MIRA  CW  AdaGrad  AROWH  AROW 1  AROW 5  AROW all  
20 News  66.77  85.27  88.77  86.68  88.08  86.31  88.3  89.33 
Enrom Kaminski  55.43  68.07  67.03  67.37  68.8  69.07  68.93  69.27 
Enron Farmer  67.33  69.73  74.5  78.87  80.07  80.23  80.87  80.93 
NYT Desk  69.72  78.43  79.25  75.43  78.58  80.04  79.73  80.21 
NYT Online  69.9  78.59  79.46  74.8  79.61  80.35  80.58  80.75 
NYT Section  45.41  53.24  51.72  52.98  53.63  54.76  54.93  54.93 
Reuters  82.1  88.38  90.63  87.97  91.33  91.63  92.1  92.1 
Sentiment 3  86.41  92.19  93.43  93.09  93.87  93.49  93.53  93.53 
Sentiment All  69.34  74.71  77.54  74.46  75.42  75.42  75.86  76.19 
Accuracy on the multiclass data sets with label noise of 0.15
Data  Label Noise = 0.15  

Percep.  MIRA  CW  AdaGrad  AROWH  AROW 1  AROW 5  AROW all  
20 News  63.08  73.65  86.19  85.78  86.81  86.76  87.13  88.27 
Enrom Kaminski  52.4  65.87  63.93  65.57  68.13  67.43  67.4  66.53 
Enron Farmer  64.07  76.93  70.9  78.4  78.83  79.5  80.23  78.6 
NYT Desk  67.09  77.46  77.63  75.59  77.92  78.83  78.8  79.46 
NYT Online  66.14  78.16  77.97  76.03  78.61  79.1  79.26  79.8 
NYT Section  42.6  54.11  50.12  52.51  54.84  52.98  52.66  53.59 
Reuters  75.8  88.33  88.92  88.85  91.02  91.33  91.8  91.8 
Sentiment 3  82.86  90.36  92.29  93.5  92.03  92.7  92.49  92.49 
Sentiment All  66.07  75.98  77.05  73.88  76.17  74.15  75.9  76.04 
Accuracy on the multiclass data sets with label noise of 0.2
Data  Label Noise = 0.2  

Percep.  MIRA  CW  AdaGrad  AROWH  AROW 1  AROW 5  AROW all  
20 News  56.51  78.02  82.89  84.34  86.1  84.76  83.73  86.45 
Enrom Kaminski  48.8  63.6  60.1  64.2  64.07  65.47  65.33  66.3 
Enron Farmer  58.53  74.37  67.17  78.23  76.37  78.43  77.63  77.67 
NYT Desk  64.11  77.51  74.82  74.77  78.65  77.39  77.76  77.51 
NYT Online  63.11  77.9  76.21  75.29  77.33  78.03  76.97  78.09 
NYT Section  39.41  53.35  48.41  50.58  52.79  53.37  53.7  53.78 
Reuters  72.78  87.92  86.88  88.77  88.5  89.9  90.3  90.3 
Sentiment 3  80.39  92.59  91.76  93.29  93.27  92.74  92.8  92.8 
Sentiment All  62.45  74.9  75.72  73.61  75.09  75.13  75.65  75.46 
Accuracy on the multiclass data sets with label noise of 0.25
Data  Label Noise = 0.25  

Percep.  MIRA  CW  AdaGrad  AROWH  AROW 1  AROW 5  AROW all  
20 News  51.38  80.16  76.6  82.17  84.95  82.37  82.72  82.36 
Enrom Kaminski  48.27  59.17  57.6  61.6  63.67  58.3  59.07  62.13 
Enron Farmer  53.57  73.1  61.93  76.07  75.37  77.5  76.67  77.8 
NYT Desk  58.28  74.11  74.2  74.44  76.65  77.07  75.87  75.68 
NYT Online  57.73  76.69  72.46  73.26  77.29  76.66  75.91  75.78 
NYT Section  36.98  47.13  46.63  49.31  50.01  50.84  50.63  50.74 
Reuters  68.17  87.38  82.78  88.67  89.0  89.2  88.77  88.77 
Sentiment 3  75.97  91.5  89.19  92.41  92.79  91.63  91.24  91.24 
Sentiment All  60.27  72.05  75.15  72.19  75.08  74.01  75.1  73.33 
Accuracy on the multiclass data sets with label noise of 0.5
Data  Label Noise = 0.5  

Percep.  MIRA  CW  AdaGrad  AROWH  AROW 1  AROW 5  AROW all  
20 News  34.21  50.46  56.0  55.95  64.82  62.01  64.15  65.6 
Enrom Kaminski  31.23  46.23  37.4  41.93  46.67  45.7  44.73  43.97 
Enron Farmer  38.2  58.27  41.07  54.33  58.83  58.7  58.17  58.53 
NYT Desk  37.98  56.41  57.04  56.81  64.69  63.64  59.21  62.28 
NYT Online  40.87  64.25  56.66  56.6  61.86  63.47  64.55  63.88 
NYT Section  27.27  37.02  34.47  30.17  38.42  37.73  38.89  39.04 
Reuters  47.38  75.58  58.93  73.32  75.5  75.2  75.73  75.73 
Sentiment 3  53.89  72.0  68.86  72.53  69.3  59.04  79.39  79.39 
Sentiment All  42.63  61.09  62.88  54.2  62.84  61.58  62.69  62.66 
5.5 Discussion
Online algorithm properties overview
Algorithm  Large Margin  Confidence  NonSeparable  Adaptive Margin 

Perceptron  No  No  Yes  No 
PA  Yes  No  Yes  No 
SOP  No  Yes  Yes  No 
CW  Yes  Yes  No  Yes 
AROW  Yes  Yes  Yes  No 
AROWH  Yes  Yes  Yes  No 
AdaGrad  Yes  Yes  Yes  No 
Based on the results it is clear that the combination of confidence information and large margin learning is powerful when label noise is low. CW easily outperforms the other baselines in such situations, as it has been shown to do in previous work. However, as noise increases, the separability assumption inherent in CW appears to reduce its performance considerably.
AROW, by combining the large margin and confidence weighting of CW with a soft update rule that accommodates nonseparable data, matches CW’s performance in general while avoiding degradation under noise. AROW lacks the adaptive margin of CW, suggesting that this characteristic is not crucial to achieving strong performance. We note that AROWH and AdaGrad have similar properties; however, we leave open for future work the possibility that an algorithm with all four properties might have unique advantages.
6 Related work
Online additive algorithms have a long history, from the Perceptron (Rosenblatt 1958) to more recent methods (Kivinen and Warmuth 1997; Crammer et al. 2006). Our update has a more general form in which the input vector x _{ i } is linearly transformed using the covariance matrix, both rotating the input and assigning weight specific learning rates.
Confidence weighted (CW) (Dredze et al. 2008; Crammer et al. 2008) algorithms, from which AROW was developed, update the mean and confidence parameters simultaneously, while AROW makes a decoupled update and softens the hard constraint of CW. The AROW algorithm can be seen as a variant of the PAII algorithm by Crammer et al. (2006), where the regularization is modified according to the data. Additionally, future work might include developing a batch version of AROW, for instance, in the way the Gaussian Margin Machines of Crammer et al. (2009c) act as a batch version of CW. It might also be worthwhile to explore the performance of AROW with an (approximated) full covariance matrix, which has been shown to improve performance in some tasks (Ma et al. 2010).
AROW is perhaps most similar to the second order Perceptron (SOP) (CesaBianchi et al. 2005). SOP, CW, and AROW all maintain secondorder information. SOP performs the same type of update as AROW, but only in case of a true error. AROW, on the other hand, updates even when its prediction is correct so long as there is insufficient margin. Furthermore, SOP uses the current example in the correlation matrix for prediction, while AROW updates after prediction. Fundamentally, CW and AROW have a probabilistic motivation, while the SOP is geometric, the idea being to replace the ball around an example with a refined ellipsoid. However, a variant of CW similar to SOP follows from our derivation if we set α _{ i }=1 in Eq. (16). Shivaswamy and Jebara (2007) have applied a similar motivation to batch learning.
The idea of using weightspecific variable learning rates has a long history in neural network learning (Sutton 1992), although we do not know of a previous model that specifically models confidence in a way that takes into account the frequency of features.
Ensemble learning shares the idea of combining multiple classifiers. Gaussian process classification (GPC) maintains a Gaussian distribution over weight vectors (primal) or over regressor values (dual). Our algorithm uses a different update criterion than the standard GPC Bayesian updates (Rasmussen and Williams 2006, Chap. 3), avoiding the challenge of approximating posteriors. Bayes point machines (Herbrich et al. 2001) maintain a collection of weight vectors consistent with the training data, and use the single linear classifier which best represents the collection. Conceptually, the collection is a nonparametric distribution over the weight vectors. Its online version (Harrington et al. 2003) maintains a set of weight vectors that are updated simultaneously. The relevance vector machine (Tipping 2001) incorporates probability into the dual formulation of SVMs. As in our work, the dual parameters are random variables distributed according to a diagonal Gaussian with example specific variance. The weightedmajority (Littlestone and Warmuth 1994) algorithm and later improvements (CesaBianchi et al. 1997) combine the output of multiple arbitrary classifiers, maintaining a multinomial distribution over the experts. In this work, we assume linear classifiers as experts and maintain a Gaussian distribution over their weight vectors.
With the growth of available data there is an increasing need for algorithms that process training data very efficiently. A similar approach to ours is to train classifiers incrementally (Bordes and Bottou 2005). The extreme case is to use each example once, without repetitions, as in the multiplicative update method of Carvalho and Cohen (2006).
In Bayesian modeling, there are several existing approaches that use parameterized distributions over weight vectors. Borrowing concepts from support vector machines, Jaakkola et al. (1999) developed maximum entropy discrimination methods, which employ a generative model for each class. The models are specified by distributions over weights as well as margin thresholds, and the weights are learned using the maximumentropy principle. In a more recent approach, Minka et al. (2009) proposed using additional virtual vectors to allow more expressive power beyond a Gaussian prior and posterior.
Passing the output of a linear model through a logistic function has a long history in the statistical literature, and is extensively covered in many textbooks (e.g. Hastie et al. 2001). Platt (1998) used similar ideas to convert the output of a support vector machine into probabilistic quantities.
Hazan (2006) described a framework for gradient descent algorithms with logarithmic regret in which a quantity similar to Σ _{ t } plays an important role. Our algorithm differs in several ways. First, Hazan (2006) considered gradient algorithms, while we derive and analyze algorithms that directly solve an optimization problem. Second, we bound the loss directly, not the cumulative sum of regularization and loss. Third, the gradient algorithms perform a projection after making an update (not before) since the norm of the weight vector is kept bounded.
Since the conference version of this work was published, several algorithms related to CW and AROW have been proposed. Duchi et al. (2011) and McMahan and Streeter (2010) proposed replacing the standard Euclidean distance in stochastic gradient decent with general Mahalanobis distances defined by second order feature information. Their analysis suggests a logarithmic regret under some conditions, similar to our bounds here. However, the precise forms of the bounds are not comparable in general. Recently, Orabona and Crammer (2010) proposed a framework for online learning that includes an algorithm similar to AROW as a special case. From a different perspective, Crammer and Lee (2010) proposed a “microscopic” view of learning, tracking individual weight vectors as opposed to just macroscopic quantities such as mean and covariance. Their update has similar form to that of AROW (Eq. (16)), but with different rates.
Shivaswamy and Jebara (2010a, 2010b) proposed using second order information in the batch setting where an independent and identically distributed set of training examples is assumed. Their algorithm maximizes the (average) margin while also minimizing its variance. However, they do not maintain a distribution over weight vectors, and the probability space is induced using the distribution over training examples.
Finally, there have been several additional applications of AROW. Mejer and Crammer (2010) formulated a structured prediction learning algorithm based on CW, including different strategies for estimating confidence in a prediction label. These same ideas can be applied to AROW. Crammer (2010) applied CW to the common speech task of phone recognition, which might likewise benefit from AROW due to inherent noise. Saha et al. (2011) developed a multitask online learning framework based on the AROW objective. Finally, AROW and the idea of confidence have been used for detecting phishing URLs (Le et al. 2010) and for learning language models (HaThuc and Cancedda 2011).
7 Summary
We have presented AROW, an online learning algorithm that improves performance in noisy settings. Building on previous work on Confidence Weighted learning, AROW combines several desirable properties of online learning algorithms: large margin training, confidence weighting, and the capacity to handle nonseparable data. The result is an algorithm that outperforms existing online learning algorithms, especially in the presence of label noise. Empirically, these trends hold up on a number of binary and multiclass data sets. Additionally, we derive a mistake bound that does not assume separability. Finally, our results suggest that future research into an algorithm that maintains the benefits of AROW while also using an adaptive margin could lead to a new robust method with potentially even better performance.
Footnotes
Notes
Acknowledgements
Part of this research was done when all authors were affiliated with the department of information and computer science, the University of Pennsylvania. This work was partly supported by the Israeli Science Foundation grant ISF1567/10. This publication only reflects the authors’ views.
References
 Bekkerman, R., McCallum, A., & Huang, G. (2004). Automatic categorization of email into folders: benchmark experiments on Enron and SRI corpora (Technical Report IR 418:1). Center for Intelligent Information Retrieval. Google Scholar
 Bernal, A., Crammer, K., Hatzigeorgiou, A., & Pereira, F. (2007). Global discriminative learning for higheraccuracy computational gene prediction. PLoS Computational Biology, 3(3), e54. MathSciNetCrossRefGoogle Scholar
 Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boomboxes and blenders: domain adaptation for sentiment classification. In ACL. Google Scholar
 Bordes, A., & Bottou, L. (2005). The huller: a simple and efficient online svm. In LNAI: Vol. 3720. European conference on machine learning (ECML). Google Scholar
 Carvalho, V. R., & Cohen, W. W. (2006). Singlepass online learning: performance, voting schemes and online feature selection. In KDD2006. Google Scholar
 Censor, Y., & Zenios, S. (1997). Parallel optimization: theory, algorithms, and applications. New York: Oxford University Press. zbMATHGoogle Scholar
 CesaBianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. Journal of the Association for Computing Machinery, 44(3), 427–485. MathSciNetzbMATHCrossRefGoogle Scholar
 CesaBianchi, N., Conconi, A., & Gentile, C. (2005). A secondorder perceptron algorithm. SIAM Journal on Computing, 34. Google Scholar
 Chechik, G., Sharma, V., Shalit, U., & Bengio, S. (2010). Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11, 1109–1135. MathSciNetzbMATHGoogle Scholar
 Chiang, D., Marton, Y., & Resnik, P. (2008). Online largemargin training of syntactic and structural translation features. In Proceedings of the conference on empirical methods in natural language processing (pp. 224–233). Stroudsburg: Association for Computational Linguistics. CrossRefGoogle Scholar
 Collins, M. (2002). Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In Proceedings of the ACL02 conference on empirical methods in natural language processing (Vol. 10, pp. 1–8). Stroudsburg: Association for Computational Linguistics. CrossRefGoogle Scholar
 Crammer, K. (2010). Efficient online learning with individual learningrates for phoneme sequence recognition. In IEEE int. conf. on acoustics, speech, and signal processing (ICASSP). Google Scholar
 Crammer, K., & Lee, D. D. (2010). Learning via Gaussian herding. In Advances in neural information processing systems 24. Google Scholar
 Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3, 951–991. MathSciNetzbMATHGoogle Scholar
 Crammer, K., Dekel, O., ShalevShwartz, S., & Singer, Y. (2003). Online passive aggressive algorithms. In Advances in neural information processing systems 16. Google Scholar
 Crammer, K., Dekel, O., Keshet, J., ShalevShwartz, S., & Singer, Y. (2006). Online passiveaggressive algorithms. Journal of Machine Learning Research, 7, 551–585. MathSciNetzbMATHGoogle Scholar
 Crammer, K., Dredze, M., & Pereira, F. (2008). Exact convex confidenceweighted learning. In Neural information processing systems (NIPS). Google Scholar
 Crammer, K., Dredze, M., & Kulesza, A. (2009a). Multiclass confidence weighted algorithms. In Empirical methods in natural language processing (EMNLP). Google Scholar
 Crammer, K., Kulesza, A., & Dredze, M. (2009b). Adaptive regularization of weight vectors. In Advances in neural information processing systems 23 (pp. 414–422). Google Scholar
 Crammer, K., Mohri, M., & Pereira, F. (2009c). Gaussian margin machines. In Proceedings of the twelfth intentional conference on artificial intelligence and statistics (AISTATS). Google Scholar
 Crammer, K., Dredze, M., & Pereira, F. (2012). Confidenceweighted linear classification for text categorization. Journal of Machine Learning Research. Google Scholar
 Dredze, M., Crammer, K., & Pereira, F. (2008). Confidenceweighted linear classification. In International conference on machine learning. Google Scholar
 Duchi, J. C., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159. MathSciNetGoogle Scholar
 Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296. zbMATHCrossRefGoogle Scholar
 Frome, A., Singer, Y., Sha, F., & Malik, J. (2007). Learning globallyconsistent local distance functions for shapebased image retrieval and classification. In IEEE 11th international conference on computer vision. Google Scholar
 Gentile, C. (2003). The robustness of the pnorm algorithms. Machine Learning, 53(3), 265–299. MathSciNetzbMATHCrossRefGoogle Scholar
 HaThuc, V., & Cancedda, N. (2011). Confidenceweighted learning of factored discriminative language models. In Association for computational linguistics (ACL). Google Scholar
 Harrington, E., Herbrich, R., Kivinen, J., Platt, J., & Williamson, R. (2003). Online Bayes point machines. In 7th pacificAsia conference on knowledge discovery and data mining (PAKDD). Google Scholar
 Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. Berlin: Springer. zbMATHGoogle Scholar
 Haykin, S. (1996). Adaptive filter theory. New York: Prentice Hall. Google Scholar
 Hazan, E. (2006). Efficient algorithms for online convex optimization and their applications. PhD thesis, Princeton University. Google Scholar
 Herbrich, R., Graepel, T., & Campbell, C. (2001). Bayes point machines. Journal of Machine Learning Research, 1, 245–279. MathSciNetzbMATHGoogle Scholar
 Jaakkola, T., Meila, M., & Jebara, T. (1999). Maximum entropy discrimination. Google Scholar
 Jie, L., Orabona, F., & Caputo, B. (2010). An online framework for learning novel concepts over multiple cues. In H. Zha, R. Taniguchi, & S. Maybank (Eds.), Lecture notes in computer science: Vol. 5994. Computer vision – ACCV 2009 (pp. 269–280). Berlin: Springer. doi: 10.1007/9783642123078_25. CrossRefGoogle Scholar
 Khardon, R., & Wachman, G. (2007). Noise tolerant variants of the perceptron algorithm. Journal of Machine Learning Research, 8, 227–248. zbMATHGoogle Scholar
 Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1), 1–64. MathSciNetzbMATHCrossRefGoogle Scholar
 Klimt, B., & Yang, Y. (2004). The enron corpus: a new dataset for email classification research. In Machine learning: ECML 2004 (pp. 217–226). CrossRefGoogle Scholar
 Krogh, A. (1992). Learning with noise in a linear perceptron. Journal of Physics. A, Mathematical and General, 25, 1119. MathSciNetzbMATHCrossRefGoogle Scholar
 Krogh, A., & Hertz, J. (1992). Generalization in a linear perceptron in the presence of noise. Journal of Physics. A, Mathematical and General, 25, 1135. MathSciNetzbMATHCrossRefGoogle Scholar
 Le, A., Markopoulou, A., & Faloutsos, M. (2010). Phishdef: Url names say it all. Arxiv preprint arXiv:1009.2275.
 Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: a new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397. Google Scholar
 Littlestone, N. (1988). Learning when irrelevant attributes abound: a new linearthreshold algorithm. Machine Learning, 2, 285–318. Google Scholar
 Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108, 212–261. MathSciNetzbMATHCrossRefGoogle Scholar
 Ma, J., Kulesza, A., Crammer, K., Dredze, M., Saul, L., & Pereira, F. (2010). Exploiting feature covariance in highdimensional online learning. In: AIStats. Google Scholar
 McDonald, R., Crammer, K., & Pereira, F. (2005). Online largemargin training of dependency parsers. In Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 91–98). Stroudsburg: Association for Computational Linguistics. Google Scholar
 McMahan, H. B., & Streeter, M. (2010). Adaptive bound optimization for online convex optimization. In Proceedings of the twenty third annual conference on learning theory. Google Scholar
 Mejer, A., & Crammer, K. (2010). Confidence in structuredprediction using confidenceweighted models. In Proceedings of the 2010 conference on empirical methods in natural language processing (pp. 971–981). EMNLP’10. Stroudsburg: Association for Computational Linguistics. http://portal.acm.org/citation.cfm?id=1870658.1870753. Google Scholar
 Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. A, 209, 415–446. zbMATHGoogle Scholar
 Minka, T. P., Xiang, R., & Qi, Y. A. (2009). Virtual vector machine for Bayesian online classification. In Proceedings of the twenty fifth conference on uncertainty in artificial intelligence. Google Scholar
 Orabona, F., & Crammer, K. (2010). New adaptive algorithms for online classification. In Advances in neural information processing systems 24. Google Scholar
 Platt, J. C. (1998). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In P. Bartlett, B. Schölkopf, D. Schuurmans, & A. J. Smola (Eds.), Advances in large margin classifiers. Cambridge: MIT Press. Google Scholar
 Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge: MIT Press. zbMATHGoogle Scholar
 Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–407 (Reprinted in Neurocomputing, MIT Press, 1988). MathSciNetCrossRefGoogle Scholar
 Saha, A., Daume, H. III, & Venkatasubramanian, S. (2011). Online learning of multiple tasks and their relationships. In AISTATS. Google Scholar
 Sandhaus, E. (2008). The New York Times annotated corpus. Philadelphia: Linguistic Data Consortium. Google Scholar
 Shivaswamy, P., & Jebara, T. (2007). Ellipsoidal kernel machines. In Artificial intelligence and statistics (AISTATS). Google Scholar
 Shivaswamy, P., & Jebara, T. (2010a). Empirical Bernstein boosting. In Y. Teh & M. Titterington (Eds.), JMLR: Vol. 9. Proceedings of the thirteenth international conference on artificial intelligence and statistics (AISTATS) 2010 (pp. 733–740). W&CP. Google Scholar
 Shivaswamy, P. K., & Jebara, T. (2010b). Maximum relative margin and datadependent regularization. Journal of Machine Learning Research, 11, 747–788. MathSciNetzbMATHGoogle Scholar
 Sutton, R. S. (1992). Adapting bias by gradient descent: an incremental version of deltabardelta. In Proceedings of the tenth national conference on artificial intelligence (pp. 171–176). Cambridge: MIT Press. Google Scholar
 Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. MathSciNetzbMATHGoogle Scholar
 Vaits, N., & Crammer, K. (2011). Readapting the regularization of weights for nonstationary regression. In The 22nd international conference on algorithmic learning theory. ALT’11. Google Scholar