Making Learners (More) Monotone

Learning performance can show non-monotonic behavior. That is, more data does not necessarily lead to better models, even on average. We propose three algorithms that take a supervised learning model and make it perform more monotone. We prove consistency and monotonicity with high probability, and evaluate the algorithms on scenarios where non-monotone behaviour occurs. Our proposed algorithm $\text{MT}_{\text{HT}}$ makes less than $1\%$ non-monotone decisions on MNIST while staying competitive in terms of error rate compared to several baselines.


Introduction
It is a widely held belief that more training data usually results in better generalizing machine learning modelssee for example popular machine learning textbooks on the topic [1].Several learning problems have illustrated that more training data can lead to worse generalization performance [2,3,4].For the peaking phenomema [2], this occurs exactly at the transition from the underparametrized to the overparametrized regime.This phenomena has regained interest in the machine learning community in the context of deep neural networks [5,6], since these models are typically overparametrized.Recently, also several new examples have been found, where in quite simple settings more data results in worse generalization performance [7,8].
In practice, it would be very tough to explain why a machine learning model would perform worse when more, typically expensive to collect, data has been used for training.Besides that point, it seems generally desireable to have algorithms that guarantee increased performance with more data.How to get such a guarantee?This is the question we investigate in this work.This is studied using a learning curve: a curve that plots the expected performance of a learning algorithm versus the amount of training data. 1 In that context the question becomes: how can we make the learning curve monotonic?
The core requirement to make learners monotone is that, when more data is gathered and a new model is trained, this newly trained model should be compared to the older model that was trained on less data.And only if the new model performs better should it be used.We introduce several wrapper algorithms for supervised classification techniques that use the holdout set or cross-validation to do this comparison.Our proposed algorithm MT HS uses a hypothesis test to switch if the new model improves significantly upon the old model.Using guarantees from the hypothesis test we can prove that the resulting learning curve is monotone with high probability.We empirically study the effect of the parameters of the algorithms and benchmark them on several datasets including MNIST [9] to check to what degree the learning curves become monotone.This paper is organized as follows.The setting and the concept of monotonicity of learning curves is reviewed in Section 2. We introduce the algorithms in Section 3, and prove consistency and monotonicity with high probability in Section 4. Section 5 provides the empirical evaluation.We discuss the main findings of our results in Section 6 and end with the most important conclusions.

The Setting and the Definition of Monotonicity
We consider the setting where we have a learner that now and then receives data and that is evaluated over time.The question is then, how to make sure that the performance of this learner over time is monotone-or with other words, how can we guarantee that this learner over time improves its performance?
We analyze this question in a (frequentist) classification framework.We assume there exists an (unknown) distribution P over X × Y, where X is the input space (features) and Y is the output space (classification labels).To simplify the setup we operate in rounds indicated by i, where i ∈ {1, . . ., n}.In each round, we receive a batch of samples S i that is sampled i.i.d.from P .The learner can use this data in combination with data from previous rounds to come up with a hypothesis h i in round i.The hypothesis comes from a hypothesis space H.We consider learners that, as subroutine, use a supervised learner A : S → H, where S is the space of all possible training sets.
We measure performance by the error rate.The true error rate on P equals where l 0−1 is the zero-one loss.We indicate the empirical error rate of h i on a sample S as ˆ (h i , S).We call n rounds a run.All the i 's of a run form a learning curve.Averaging multiple runs gives the expected learning curve, ¯ i .
The goal for the learner is twofold.The error rates of the returned models i 's should (1) be as small as possible, and (2) be monotonically decreasing.These goals may be at odds with another; for example, always returning a fixed model ensures monotonicity but incurs large error rates.To measure (1), we summarize performance of a learning curve using the Area Under the Learning Curve (AULC) [10,11,12].The AULC averages all i 's of a run.Low AULC indicates that a learner manages to quickly reduce the error rate.
Monotone in round i means that i+1 ≤ i .We may care about monotonicity of the expected learning curve or individual learning curves.In practice, we may only get one chance to gather data and submit models.In that case, we would rather want to make sure that then any additional data also leads to better performance during a single run.Therefore, we are mainly concerned with monotonicity of individual learning curves.We quantify monotonicity of a run by the fraction of non-monotone transitions in an individual curve.

Algorithms
We will introduce three algorithms that wrap around supervised learners with the aim of making them monotone.First, we will provide some intuition how to achieve this: ideally, during the generation of the learning curve, we would check whether (h i+1 ) ≤ (h i ).A fix to make a learner monotone would be to output h i instead of h i+1 if the error rate of h i+1 is larger.Since learners do not have access to (h i ), we have to estimate it using the incoming data.The first two algorithms, MT SIMPLE and MT HT , use the holdout method to this end; newly arriving data is partitioned into training and validation sets.The third algorithm, MT CV , makes use of cross validation.

MT SIMPLE : Monotone Simple
The pseudo-code for MT SIMPLE is given by Algorithm  We call this algorithm MT SIMPLE because the model selection is a bit naive: for small validation sets, the variance in the performance measure could be quite large, leading to many non-monotone decisions.In the limit of infinitely large S i v , however, this algorithm should always be monotone (and very data hungry).

MT HT : Monotone Hypothesis Test
The second algorithm, MT HT , aims to resolve the issues of MT SIMPLE with small validation set sizes.In addition, for this algorithm, we will later prove that individual learning curves are monotone with high probability.The choice of hypothesis test depends on the performance measure.For the error rate McNemar's test can be used (see experimental setup for more details) [13,14].For the hypothesis test, there are several requirements: it should use paired data, since we evaluate two models on one sample, and it should be one-tailed.One-tailed, since we only want to know whether h i is better than h best (a two tailed test would switch to h i if its performance is significantly different, which is not what we want).Thus we have two hypotheses: We judge significance using the p-value: the probability of observing a more extreme sample given hypothesis H 0 .The smaller the p-value, the more evidence we have for H 1 .The confidence level α ∈ (0, 1  2 ] indicates the threshold.If the p-value is smaller than α, we accept H 1 , and thus we update the model h best .The smaller α, the more conservative the hypothesis test, and thus the smaller the chance that a wrong decision is made due to unlucky sampling.More precisely, most hypothesis tests satisfy that the False Positive Rate (FPR, or the probability to make a Type I error) is bounded: P (p ≤ α|H 0 ) ≤ α.

MT CV : Monotone Cross Validation
In practice, often K-fold cross validation (CV) is used to estimate model performance instead of the holdout.This is what MT CV does, and is similar to MT SIMPLE .As described in Algorithm 2, for each incoming sample an index I maintains to which fold it belongs.These indices are used to generate the folds for the K-fold cross validation.
During CV, K models are trained and evaluated on the validation sets.We now have to memorize K previously best models, one for each fold.We average the performance of the newly trained models over the K-folds, and compare that to the average of the best previous K models.This averaging over folds is essential, as this reduces the variance of the model selection step as compared to MT SIMPLE .As with MT SIMPLE paired samples are used for the comparison.
After the comparison we know which training size was better.Our framework requires us to return a single model in each iteration.We choose to return the model with the optimal training set size that performed best during CV, as this may further improve the performance.
// training set of kth fold

Theoretical Analysis
In this section we derive the probability of a monotone learning curve for the algorithms MT SIMPLE and MT HT , and we prove that all algorithms are consistent as long as they are guaranteed to update the model enough.
Theorem 1 Assume that the hypothesis test HT satisfies P (p ≤ α|H 0 ) ≤ α.Then running Algorithm MT HT with parameter α guarantees that the individual learning curve of n rounds is monotone with probability Proof The probability of making a non-monotone decision in round i is at most α, this is guaranteed by the hypothesis test.To see this, assume (h i ) ≥ (h best ).Let p be the p-value as returned by HT as before.The probability of accepting becomes larger if H 1 is true P (p ≤ α|H 1 ) ≥ P (p ≤ α|H 0 ).From this it should be clear that if (h i ) ≥ (h best ), the probability of accepting will be even smaller: P (p ≤ α| (h i ) ≥ (h best )) ≤ P (p ≤ α|H 0 ).In a worst case (h i ) ≥ (h best ) holds every round.Note that these guarantees on the probability of failure hold for any model h i , h best and anything that happened before round i.Since S i v are independent samples, being non-monotone in each round can be seen as independent events, thus we can multiply the probabilities resulting in (1 − α) n .
If the probability of being non-monotone in all rounds may be at most β, we may set α = 1 − β 1 n to fulfill this condition.Note that this analysis also holds for MT SIMPLE , since running MT HT with α = 1  2 results in the same algorithm as MT SIMPLE if HT satisfies P (p ≤ α|H 0 ) ≤ α.Now will argue that all proposed algorithms are consistent under some conditions.First we revisit the definition of consistency as defined by Shalev-Shwartz and Ben-David [1].
Definition 1 (Consistency [1]) Let H be the hypothesis class and let A be the learner.For all ∈ (0, 1), for all distributions D over X × Y , for all δ ∈ (0, 1), if there exists a n( , D, δ), such that for all m ≥ n( , D, δ), if A is trained on a sample S of size m, and the following holds with probability (over the choice of S) at least 1 − δ, then A is said to be consistent.
Before we can state the main result, we have to introduce a bit of notation.U i will indicate the event that the algorithm updates h best (or in case of M CV it will update the variable b).We will indicate H i+z i to indicate the event that ¬U i ∩ ¬U i+1 ∩ . . .∩ ¬U i+z , or in words, that in round i to i + z there has been no update.To fulfill consistency, we need that when the number of rounds grows to infinity, the probability of updating will be large enough.Then consistency of A will make sure that h best has sufficiently low error.For this analysis it is assumed that the number of rounds of the algorithms is not fixed.

Theorem 2
The algorithms MT SIMPLE , MT HT and MT CV are consistent, if A is consistent and if there exists a C z > 0 such that for all i we have P ( Proof Let A be consistent with n A ( , D, δ) samples.Let us analyze round i where i is big enough such that otherwise the proof is trivial.Since |S t | > n A ( , D, δ 2 ), we have for any round j ≥ i that holds with probability of at least 1 − δ 2 .Thus now the algorithm should update.The probability that in the next z rounds we don't update is, by assumption, bounded by Thus the probability of not updating after z more rounds is at most δ 2 , and we have a probability of δ 2 that the model after updating is not good enough.Applying the union bound, we find that the probability of failure is at most δ as desired.
A few remarks about the assumption.It tells us, that an update is more and more likely if we have more consecutive rounds where there has been no update.This holds for example if there are enough rounds where the probability of an update is nonzero.A weaker but also sufficient assumption would be that ∀ i : lim z→∞ p(H i+z i ) → 0.
For MT SIMPLE and MT CV the assumption is always satisfied, because these algorithms look directly at the mean error rate -and due to fluctuations in the sampling there is always a non-zero probability that ˆ (h i ) ≤ ˆ (h best ).However, for MT HT this may not always be satisfied.Especially if the validation batches N v are small, the hypothesis test may not be able to detect small differences in error -the test then has zero power.If N v stays small, even in future rounds the power may stay zero, and then the learner is not consistent.

Experiments
We evaluate MT SIMPLE and MT HT on artificial datasets to understand the influence of their parameters.Afterward we perform a benchmark where we also include MT CV and a baseline that uses validation data to tune the regularization strength.This last experiment is also performed on the MNIST dataset to get an impression of the practicality of the proposed algorithms.First we describe the experimental setup in more detail.

Experimental Setup
The peaking dataset [2] and dipping dataset [4] are artificial datasets that cause non-monotone behaviour.We use stratified sampling to obtain batches S i for the peaking and dipping dataset, for MNIST we use random sampling.For simplicity all batches have the same size.N indicates batch size, and N v and N t indicate the sizes of the validation and training sets.
As model we use least squares classification [15,16].This is ordinary linear least squares regression on the classification labels {−1, +1} with intercept.For MNIST one-versus-all is used to train a multi-class model.In case there are less samples for training than dimensions, the required inverse of the covariance matrix is ill-defined and we resort to the Moore-Penrose Pseudo-Inverse.
Monotonicity is calculated by the fraction of non-monotone iterations per run.AULC is also calculated per run.We do 100 runs with different batches and average to reduce variation from the randomness in the batches.Each run uses a newly sampled test set consisting of 10000 samples.The test set is used to estimate the true error rate and is not accessible by any of the algorithms.
We evaluate M SIMPLE , M HT and M CV and several baselines.The standard learner just trains on all received data.A second baseline, λ S , splits the data in train and validation like M SIMPLE and uses the validation data to select the optimal L 2 regularization parameter λ for the least square classifier.Regularization is implemented by adding λI to the estimate of the covariance matrix.
Several versions of McNemar's test can be used to compare models [17,13,14].We use the McNemar's exact conditional test [17], since for this test all assumptions are satisfied, and as such P (p ≤ α|H 0 ) ≤ α is guaranteed.
In the first experiment we investigate the influence of N v and α for MT SIMPLE and MT HT on the decisions.A complicating factor is that if N v changes, not only decisions change, but also training set sizes because S v is appended to the training set (see line 7 of Algorithm 1).This makes interpretation of the results difficult because decisions are then made in a different context.Therefore, for the first set of experiments, we do not add S v to the training sets, also not for the standard learner.For this set of experiment We use N l = 4, n = 150, d = 200 for the peaking dataset, and we vary α and N v .

Results
We perform a preliminary investigation of the algorithms M SIMPLE and M HT and the influence of the parameters N v and α.We show several learning curves in Figure 2a and 2d.For small N v and α we observe MT HT gets stuck: it does not switch models anymore.This indicates that indeed the assumption required for consistency is not satisfied.
In Figure 2b and Figure 2e we give a more complete picture of all tried hyperparameters in terms of the AULC.In Figure 2c and Figure 2f we plot the fraction of non-monotone decisions during a run.Observe that this is a log-log plot.In some cases zero non-monotone decisions were observed, and thus the log-log plot misses a value.This occurs for example if MT HT always sticks to the same model, then no non-monotone decisions will be made.
Results of the benchmark are shown in Figure 3.The AULC and fraction of monotone decisions are given in Table 1.

Discussion
6.1 First experiment: tuning α and N v As predicted MT SIMPLE typically performs worse than MT HT in terms of AULC and monotonicity unless N v is very large.The variance in the estimate of the error rates on S i v is so large that in most cases the algorithm doesn't switch to the correct model.
Larger N v leads typically to improved AULC.α ∈ [0.05, 0.1] seems to work best in terms of AULC for most values of N v .If α is too small, MT HT can get stuck, if α is too large, it switches models too often and non-monotone behaviour occurs.If α → 1 2 , MT HT becomes increasingly similar to MT SIMPLE as predicted by the theory.The fraction of non-monotone decisions of MT HT is much lower than α.This is in agreement with Theorem 1, but may indicate that the hypothesis test is rather pessimistic.The standard learner and MT SIMPLE often make non-monotone decisions, in some cases almost 50% of the decisions are not-monotone.

Second Experiment: Benchmark on Peaking, Dipping, MNIST
Interestingly, for the peaking and MNIST any non-monotonicity in the expected learning curve completely disappears for λ S that tunes the regularization parameter.However, for the dipping dataset this is not the case -thus regularization may not always avoid non-monotone behaviour.Furthermore, the fraction of non-monotone decisions per run    For the dipping dataset M CV has a large advantage in terms of AULC.We hypothize that this is largely due to tie breaking and small training set sizes due to the 5-folds.Surprisingly on the peaking dataset it seems to learn quite slowly.The expected learning curves of MT HT look better than that of MT SIMPLE , however, in terms of AULC the difference is quite small.
Again the fraction of non-monotone decisions for MT HT per run is very small as guaranteed.However, it is interesting to note that this does not always translate to monotonicity in the expected learning curve.For example, for peaking and dipping the expected curve doesn't seem entirely monotone.But MT CV , which makes many non-monotone decisions per run, still seems to have a monotone expected learning curve.This really does seem to indicate that monotonicity in individual curves and monotonicity in the expected curve are not necessarily related goals.This raises the question: under what conditions do we have monotonicity in the expected learning curve?

General Remarks
That the fraction of non-monotone decisions of MT HT is so much smaller than α may indicate the hypothesis test is too pessimistic.Fagerland et al. [17] indicate that the asymptotic McNemar test may have more power.For this test the guarantee P (p ≤ α|H 0 ) ≤ α can be violated, but in light of the monotonicity results we have obtained this may not be a problem in practice.The added power could further improve the AULC.
We would like to argue that possible inconsistency of MT HT is not so problematic.If one knows the desired error rate, this can be used to estimate a minimum N v that ensures the hypothesis test will not get stuck before reaching that error rate.Another way to get around this issue is to make the size N v dependent on i: if N v is monotonically increasing this directly leads to consistency of MT HT .It would be ideal if somehow N v could be automatically tuned to trade-off sample size requirements, consistency and monotonicity.For future work we also intend to investigate how to combine MT HT and MT CV , since for CV N v automatically grows and thus also directly implies consistency.
We suspect that the peak in the feature curves that Belkin et al. [5] observe, is due to the same peaking phenomena as seen by Duin [2].We wonder if optimal tuning of the regularization parameter therefore eliminates the double-descent curve, as Belkin et al. [5] calls this behaviour, as in our setup?
Devroye et al. [18] conjectured that it would be impossible to construct a learner that is monotone as judged by the expected learning curve that is also consistent.While our work does not disprove this conjecture, as we look at monotonicity of individual curves, some of us suspect this is a first step in that direction.First, however, we require a better understanding of the relation between monotonicity of individual curves and of the expected learning curve.

Conclusion
We have introduced three algorithms to make learners more monotone.We proved under which conditions the algorithms are consistent and we have shown for MT HT that the learning curve is monotone with high probability.If one cares only about monotonicity of the expected learning curve, MT SIMPLE with very large N v or MT CV may prove sufficient as shown by our experiments.However, they come without any theoretical guarantees.If N v is small, or one desires monotonicity of individual learning curves (as practically most relevant), MT HT is the right choice.
Our algorithms are a first step towards developing learners that, given more data, will improve their performance in expectation.

Figure 1 :
Figure 1: The algorithm combined with UpdateSimple gives MT SIMPLE , the algorithm combined with UpdateHT gives MT HT .Note that MT HT requires additional input parameters α and HT , which are not needed by MT SIMPLE .

Algorithm 2 :
M CV input: K folds, learner A, rounds n, batches S i b ← 1 // keeps track of best round S = {}, I = {} for i = 1, . . ., n do Generate stratified CV indices for S i and put in I i .Each index in indicates to which validation fold the corresponding sample belongs.Append to S: S ← [S; S i ] Append to I: I ← [I; I i ]

Figure 2 :Figure 3 :
Figure 2: Several experiments on the Peaking and Dipping dataset to investigate the influence of N v and α for MT SIMPLE and MT HT .
to estimate the performance of h i .At all times the algorithm stores the previously best performing model, h best , and compares its performance to that of h i .If the new model h i is better, it is returned in round i and h best is updated, otherwise the algorithm returns h best .In each iteration the performance estimate of h best is also updated (see line 2 in UpdateSimple) using S i v .Thus h i and h best are both compared on S i v , resulting in a more accurate comparison (because the comparison is paired).After the comparison S i v can safely be added to the training set (line 7 of Algorithm 1).

Table 1 :
Results of the benchmark.SL is the Standard Learner.AULC is the Area Under the Learning Curve of the error rate.Fraction indicates the average fraction of non-monotone decisions during a single run.Standard deviation shown in (braces).Best monotonicity result is underlined.A PREPRINT -NOVEMBER 26, 2019 is largest for this learner.It is strange that for the MNIST dataset this learner starts lagging behind other learners for large sample sizes.