Abstract
Neural networks converge faster with help from a smart batch selection strategy. In this regard, we propose AdaBoundary, a novel and simple adaptive batch selection algorithm that constructs an effective minibatch according to the learning progress of the model. Our key idea is to exploit confusing samples for which the model cannot predict labels with high confidence. Thus, samples near the current decision boundary are considered to be the most effective for expediting convergence. Taking advantage of this design, AdaBoundary maintained its dominance for various degrees of training difficulty. We demonstrate the advantage of AdaBoundary by extensive experimentation using CNNs with five benchmark data sets. AdaBoundary was shown to produce a relative improvement in test errors by up to 31.80% compared with the baseline for a fixed wallclock training time, thereby achieving a faster convergence speed.
Introduction
Deep neural networks (DNNs) have achieved remarkable performance in many fields, especially, in computer vision and natural language processing (Goodfellow et al. 2016). Nevertheless, as the size of data set grows, the training step via stochastic gradient descent (SGD) based on minibatches suffers from extremely high computational cost, which is mainly due to slow convergence. The common approaches for expediting convergence include SGD variants (Zeiler 2012; Kingma and Ba 2015) that maintain individual learning rates for parameters, and batch normalization (Ioffe and Szegedy 2015) that stabilizes the gradient variance.
Recently, considering the fact that not all samples have an equal impact on training, many studies have attempted to design sampling schemes based on sample importance (Wu et al. 2017; Fan et al. 2017; Katharopoulos and Fleuret 2018). Curriculum learning (Bengio et al. 2009), inspired by human learning, is one of the representative methods for speeding up the training step by gradually increasing the difficulty level of the training samples. In contrast, deep learning studies focus on giving higher weights to harder samples during the entire training process. When the model requires many epochs for convergence, it is known to converge faster with the batches of hard samples than with randomly selected batches (Schaul et al. 2016; Loshchilov and Hutter 2016; Gao and Jojic 2017; Song et al. 2020). There are various criteria for judging the hardness of a sample, e.g., the rank of the loss computed from previous epochs (Loshchilov and Hutter 2016).
Here, a natural question arises: Does the “hard” batch selection always speed up DNN training? Our answer is partially, yes: It is helpful only when training an easy data set. According to our indepth analysis, as demonstrated in Fig. 1a, the hardest samples in a hard data set (e.g., CIFAR10) were too hard to learn. They were highly likely to sway the decision boundary bias towards themselves, as shown in Fig. 1b. On the other hand, in an easy data set (e.g., MNIST), the hardest samples, though they were only moderately hard, provided useful information for training. In practice, it has been reported that hard batch selection successfully speed up training only for the easy MNIST data set (Loshchilov and Hutter 2016; Gao and Jojic 2017), and our experiments presented in Sect. 5 also confirmed the previous findings. This limitation calls for a new sampling scheme that supports both easy and hard data sets.
In this paper, we propose a novel and simple adaptive batch selection strategy, called AdaBoundary, that accelerates training and is better generalized to hard data sets. As opposed to existing hard batch selection, AdaBoundary picks up the samples with the most appropriate difficulty, considering the learning progress of the model. The samples near the current decision boundary are selected with high probability, as shown in Fig. 2a. Intuitively speaking, the samples far from the decision boundary are not that helpful because they are either too hard or too easy: those on the incorrect side are too hard, and those on the correct side are too easy. This is the reason why we regard the confusing samples around the decision boundary, which are moderately hard, to have the appropriate difficulty level.
Overall, the key idea of AdaBoundary is to use the distance of a sample to the decision boundary as the hardness of the sample. The beauty of this design is that it does not require human intervention. The current decision boundary should be directly influenced by the learning progress of the model. The decision boundary of a DNN moves towards eliminating incorrect samples as the training step progresses, so the difficulty of the samples near the decision boundary gradually increases as the model is learned. Then, the decision boundary continually updates to identify the confusing samples, as illustrated in Fig. 2b. This approach accelerates convergence by providing samples suited to the model at every SGD iteration, and it is less prone to incur an overfitting issue.
We conducted extensive experiments to demonstrate the superiority of AdaBoundary. A popular convolutional neural network (CNN)^{Footnote 1} model was trained on five benchmark data sets for the image classification task. Compared to random batch selection, AdaBoundary produced a relative improvement in test errors by up to \(31.80\%\) for a fixed wallclock training time. Compared to the two stateoftheart algorithms, online batch (Loshchilov and Hutter 2016) and active bias (Chang et al. 2017), it respectively improved the test error by up to \(8.14\%\) and \(10.07\%\) within the same time frame. Moreover, AdaBoundary is wellgeneralized for different gradient optimizers and CNN models.
Related work
There have been numerous attempts to understand which samples contribute the most during training. Curriculum learning (Bengio et al. 2009), inspired by the perceived way that humans and animals learn, first takes easy samples and then gradually increases the difficulty of samples using a manual method. Selfpaced learning (Kumar et al. 2010) uses prediction error to determine the easiness of samples in order to alleviate the limitation of curriculum learning. The researchers assumed that importance was determined by how easy the samples were. However, easiness does not sufficiently determine when a sample should be introduced to a learner (Gao and Jojic 2017).
Recently, Tsvetkov et al. (2016) used Bayesian optimization to optimize a curriculum for training dense, distributed word representations. Sachan and Xing (2016) emphasized that the right curriculum not only has to arrange data samples in order of difficulty, but also must introduce a small number of samples that are dissimilar to the samples previously seen. Shrivastava et al. (2016) proposed a hardexample mining method to eliminate several heuristics and hyperparameters commonly used to select hard examples. However, these algorithms are designed to support only a designated task, such as natural language processing or object detection. The neural data filter proposed by Fan et al. (2017) is orthogonal to our work because it aims to filter redundant samples from streaming data. As mentioned earlier, AdaBoundary generally follows the philosophy of curriculum learning because it exploits the samples with the most appropriate difficulty at the current training progress.
More closely related to adaptive batch selection, Loshchilov and Hutter (2016) stored the history of losses for previously seen samples, and computed sampling probability based on loss rank. The sample probability to be selected for the subsequent minibatch was exponentially decayed with its rank. This allowed the samples with low ranks (i.e., high losses) to be considered more frequently for the subsequent minibatch. Gao and Jojic (2017)’s work is similar to that of Loshchilov and Hutter (2016) except that gradient norms are used instead of losses to compute the probability. In contrast to curriculum learning, both methods focus on hard samples only for training. Also, they ignored the difference in actual losses or gradient norms by transforming the values to ranks. Similar to our work, the usefulness of exploiting uncertain samples was witnessed by active bias (Chang et al. 2017) for a different purpose. Their main contribution lies in producing a more accurate and robust model by choosing samples with high prediction variance, whereas ours lies in training faster by using confusing samples that have softmax distributions of low variance. According to our experiments presented in Sect. 5.1, the samples selected by active bias slowed down the convergence in training loss, though they reduced the generalization error.
To complement this survey, we mention work done to accelerate the optimization process of algorithms based on importance sampling. Needell et al. (2014) reweighted the obtained gradients by the inverse of their sampling probabilities to reduce the variance. Schmidt et al. (2015) biased the sampling toward the Lipschitz constant to quickly find the solution to a stronglyconvex optimization problem arising from the training of conditional random fields.
AdaBoundary Components
The main challenge for AdaBoundary is to evaluate how close a sample is to the decision boundary. In this section, we introduce a novel distance measure, and present a method for computing the sampling probability based on the measure.
Sample distance based on Softmax distribution
To evaluate the distance from the sample to the decision boundary, we note that the softmax distribution, which is the output of the softmax layer in neural networks, clearly indicates how confidently the model predicts the true label, as demonstrated in Fig. 3.
Let \(h(y  x_i ; \theta ^t)\) be the softmax distribution for a given sample \(x_i\) over \(y\in \{1,2,\ldots ,k\}\) labels, where \(\theta ^t\) is the parameter of a neural network at time t. Then, the distance from a sample \(x_i\) with the true label \(y_i\) to the decision boundary of the neural network with \(\theta ^t\) is defined by the directional distance function,
More specifically, the function consists of two terms related to the direction and magnitude of the distance, determined by the model’s correctness and confidence, respectively. The correctness is determined by verifying whether the label with the highest probability matches the true label \(y_i\), and the confidence is computed by the standard deviation of the softmax distribution. Intuitively, the standard deviation is a nice indicator of confidence because the value gets closer to zero when the model is confused.
One might argue that the crossentropy loss, \(H(p,q) = p(x_i)\log (q(x_i))\) where \(p(x_i)\) and \(q(x_i)\) are the true and softmax distributions for \(x_i\), can be adopted for the distance function. However, because \(p(x_i)\) is formulated as a onehot true label vector, the crossentropy loss cannot capture the prediction probability for false labels, which is an important factor in confusing samples.
Another advantage is that our distance function is bounded as opposed to the loss. For k labels, the maximum value of \(std(h(y  x_i ; \theta ^t))\) is \(k^{1}\sqrt{(k1)}\) when \(h(m  x_i ; \theta ^t)=1\) and \(\forall _{l \ne m}h(l  x_i ; \theta ^t)=0\). Thus, \(dist(x_i, y_i ; \theta ^t)\) is bounded by
Sampling probability based on quantization index
The rankbased approach introduced by Loshchilov and Hutter (2016) is a common way to assign the sampling probability of being selected for the next minibatch. This approach sorts the samples by a certain importance measure in descending order, and exponentially decays the sampling probability of a given sample according to its rank. Let N denote the total number of samples. Then, each rth ranked sample is selected with the probability p(r), which drops by a factor of \(\exp {(\log (s_e)/N)}\). Here, \(s_e\) is the selection pressure parameter that affects the probability gap between the most and the least important samples. When normalized to sum to 1.0, the probability that the rth ranked sample is selected is defined by
In the existing rankbased approach, the rank of a sample is determined by \(dist(x_i, y_i ; \theta ^t)\) in ascending order, because it is inversely proportional to the sample importance. However, if the mass of the true sample distribution is skewed to one side (e.g., easy side) as shown in Fig. 4, the minibatch samples are selected with high probability from the skewed side rather than from around the decision boundary where \(dist(x_i, y_i ; \theta ^t)\) is very small. This problem was attributed to the unconditionally fixed probability of a given rank. In other words, samples with similar ranks are selected with similar probabilities, regardless of the magnitude of the distance values.
To incorporate the impact of distance into batch selection, we adopt the quantization method (Gray and Neuhoff 1998; Chen and Wornell 2001) and use the quantization index q instead of rank r. Let \(\varDelta\) be the quantization step size and d be the output of the function \(dist(x_i, y_i ; \theta ^t)\) of a given sample \(x_i\). Then, the index q is obtained by the simple quantizer Q(d),
The quantization index gets larger as sampling moves away from the decision boundary. In addition, the difference between two indexes reflects the difference in the actual distances.
In Eq. (4), we set \(\varDelta\) to be \(k^{1}\sqrt{k1}/N\) such that the index q is bounded to N (the total number of samples) by Eq. (2). Then, the sampling probability of a given sample \(x_i\) with the true label \(y_i\) is defined by
As shown in Fig. 4, our quantizationbased method produces a wellbalanced distribution, even if the true sample distribution is skewed.
AdaBoundary Algorithm
Main proposed algorithm
Algorithm 1 describes the overall procedure of AdaBoundary. The input to the algorithm consists of samples of size N (i.e., training data set), a minibatch of size b, the selection pressure \(s_e\), and the threshold \(\gamma\) used to decide the warmup period. In the early stages of training, because the quantization index for each sample is not confirmed yet, the algorithm requires a warmup period of \(\gamma\) epochs. Randomly selected minibatch samples are used for the warmup period (Lines 6–7), and their quantization indexes are updated (Lines 12–18). After the warmup epochs, the algorithm computes the sampling probability of each sample by Eq. (5) and selects minibatch samples based on the probability (Lines 8–11). Then, the quantization indexes are updated in the same way (Lines 12–18). Here, we compute the indexes using the previous model with \(\theta ^{t}\) after every SGD step, rather than the latest model with \(\theta ^{t+1}\), in order to reuse the previously computed softmax distributions; in addition, we asynchronously update the indexes of the samples that are only included in the minibatch to prevent the additional forward propagation of the entire sample, which would induce a high computational cost.
Variants of AdaBoundary for comparison
For a more sophisticated analysis of sampling, we present two heuristic sampling strategies: (1) AdaHard is similar to the existing hard batch strategy (Loshchilov and Hutter 2016), but it uses our distance function instead of the loss. That is, AdaHard focuses on the samples far from the decision boundary in the negative direction; (2) AdaUniform is designed to select samples with a wide range of difficulty, so it samples uniformly over the distance range regardless of the sample distribution.
We modified a few lines of Algorithm 1 to implement the two variants. In detail, for AdaHard, the quantization index q should be small for the sample located far in the negative direction. Thus, AdaHard can be implemented by modifying the quantizer Q(d) in Line 16 of Algorithm 1. When we set \(\varDelta = k^{1}\sqrt{k1}/N\) to make index q bound to N, the quantizers of AdaHard are defined by
AdaUniform can be implemented by using \({F^{1}}(x)\) to compute the sampling probability in Line 10 of Algorithm 1, where F(x) is the empirical sample distribution according to the sample’s distance to the decision boundary. Note that the computational cost of AdaUniform is much higher than those of AdaBoundary and AdaHard because of the computation for the empirical distribution, which requires the linear time complexity to the total number of training samples (i.e., O(N)) in every update iteration.
Figure 5 shows the distributions of minibatch samples drawn by these two variants. The distribution of AdaHard is skewed to the hard side, and that of AdaUniform tends to be uniform.
Evaluation
To validate the superiority of AdaBoundary, we performed an image classification task on five benchmark data sets with varying difficulty levels: MNIST (easy),^{Footnote 2} classification of handwritten digits (LeCun 1998), with 60,000 training and 10,000 testing images; FashionMNIST (relatively easy),^{Footnote 3} classification of various clothing (Xiao et al. 2017), with 60,000 training and 10,000 testing images; CIFAR\(10^{4}\) (relatively hard) and CIFAR100 (hard),^{Footnote 4} classification of a subset of 80 million categorical images (Krizhevsky et al. 2014), with 50,000 training and 10,000 testing images; TinyImageNet (hard),^{Footnote 5} classification of a subset of largescale categorical images (Krizhevsky et al. 2012), with 100,000 training and 10,000 testing images. We did not apply any data augmentation or preprocessing procedures.
We experimentally analyzed the performance improvement of AdaBoundary compared with not only random batch selection but also four different adaptive batch selection algorithms: random batch selection selects the next batch uniformly at random from the entire data set; online batch selects hard samples based on the rank of the loss computed from previous epochs; active bias selects uncertain samples with high variance of true label probabilities; AdaHard and AdaUniform, which are the two variants of AdaBoundary introduced in Sect. 4.2. All the algorithms were implemented using TensorFlow 2.1.0^{Footnote 6} and excuted using eight NVIDIA Titan Volta GPU. For reproducibility, we provide the source code at https://github.com/kaistdmlab/AdaBoundary.
For the classification task, we mainly used a densely connected neural network (DenseNet) (Huang et al. 2017), which is widely known to achieve good generalization performance on data sets with varying difficulty levels (Song et al. 2019). In support of reliable evaluation, we repeated every task five times and reported the average with its standard error. That is, the training loss (or test error) was averaged for all the trials at every epoch. To compare the convergence speed among the methods, we plotted the averaged training loss and test error for an equivalent wallclock training time. Besides, because the best test error in a given time has been widely used for the studies on fast and accurate training (Loshchilov and Hutter 2016; Chang et al. 2017), we reported the average of the best test errors in tabular form.
Analysis on hard data sets
We used six batch selection strategies to train a DenseNet on the hard data sets: CIFAR10, CIFAR100, and TinyImageNet. Specifically, we trained a DenseNet (\(L=25\), \(k=12\)) with a momentum optimizer. We used batch normalization (Ioffe and Szegedy 2015), a momentum of 0.9, and a batch size of 128. As for the algorithm parameters, we used the best selection pressure \(s_e\), obtained from \(s_e=\{2, 8, 32\}\) (see Sect. 5.5 for details), and set the warmup threshold \(\gamma\) to 15. Technically, a small \(\gamma\) is enough to warmup, but to reduce the performance variance caused by randomly initialized parameters, we used a larger \(\gamma\) and shared the model parameters for all strategies during the warmup period. For online batch selection, we recomputed all the losses across every epoch to reflect the latest losses. Regarding the training schedule, following the experimental setup of Huang et al. (2017), we trained the model for 100 epochs and used an initial learning rate of 0.1, which was divided by 5 at \(50\%\) and \(75\%\) of the total number of training iterations. Because the baseline strategy required about 2,300 seconds for the two CIFAR data sets and 14,500 seconds for the TinyImageNet data set, we excluded the result of other strategies beyond those times.
Figure 6 shows the convergence curves of training loss and test error for six batch selection strategies on three hard data sets: CIFAR10, CIFAR100, and TinyImageNet. In order to improve legibility, only the curves for the baseline and proposed strategies are dark colored. The best test errors in Fig. 6 are summarized in Table 1. We conducted a convergence analysis of the six batch selection strategies, as follows:

CIFAR10 (relatively hard): Except AdaUniform and active bias, all adaptive batch selections achieved faster convergence speed than random batch selection in training loss, but only AdaBoundary converged faster than random batch selection in test error. This means that the strategy focused on hard sample results in the overfitting to “too hard” samples, which is indicated by a larger converged test error. Meanwhile, active bias was prone to make the network better generalized on test data, considering its test error comparable to that of random batch selection despite its much higher training loss. That is, active bias resulted in better generalization, but slowed down the training process. Quantitatively, AdaBoundary achieved test error relatively lower by \(3.79\%\) (\(8.71\%\!\rightarrow \!8.38\%\)) than random batch selection. In contrast, the test error of AdaHard, online batch selection, and active bias was relatively higher by \(2.18\%\) (\(8.71\%\!\rightarrow \!8.90\%\)), \(0.80\%\) (\(8.71\%\!\rightarrow \!8.78\%\)), and \(1.84\%\) (\(8.71\%\!\rightarrow \!8.87\%\)).

CIFAR100 (hard): In both training loss and test error, the convergence curves of all strategies showed similar trends to those of CIFAR10. However, as the training difficulty increased from CIFAR10 to CIFAR100, the overfitting of AdaHard and online batch selection was further exacerbated. This emphasizes the need to consider the samples with appropriate difficulty rather than hard samples. Compared with random batch selection, AdaBoundary achieved test error relatively lower by \(2.39\%\) (\(33.54\%\!\rightarrow \!32.74\%\)). On the other hand, the test error of AdaHard, online batch selection, and active bias was relatively higher by \(5.55\%\) (\(33.54\%\!\rightarrow \!35.40\%\)), \(6.26\%\) (\(33.54\%\!\rightarrow \!35.64\%\)), and \(2.80\%\) (\(33.54\%\!\rightarrow \!34.48\%\)).

TinyImageNet (hard): The convergence trend was consistent even when the data set became larger and harder. Only AdaBoundary achieved test error relatively lower by \(0.97\%\) (\(51.67\%\!\rightarrow \!51.17\%\)) than random batch selection. On the other hand, the test error of AdaHard, online batch selection, and active bias was relatively higher by \(1.86\%\) (\(51.67\%\!\rightarrow \!52.63\%\)), \(1.92\%\) (\(51.67\%\!\rightarrow \!52.66\%\)), and \(3.39\%\) (\(51.67\%\!\rightarrow \!53.42\%\)).
In all the cases, the large performance gap between AdaUniform and other methods were attributed to the high computational cost for updating its empirical sampling distribution and the oversampling of too hard samples owing to the plethora of easy ones.
Analysis on easy data sets
We also trained a DenseNet (\(L=25\), \(k=12\)) with momentum on the easy data sets: MNIST and FashionMNIST. We used the same experimental configuration as in Sect. 5.1, except for the training schedule. Generally, because a small learning rate without decay was preferred for easy data sets (Loshchilov and Hutter 2016; Gao and Jojic 2017), we used a constant learning rate of 0.01 over 80 epochs. Here, the baseline strategy required about 1,880 seconds for all cases.
Figure 7 shows the convergence curves of training loss and test error for six batch selection strategies on MNIST and FashionMNIST, and the best test errors in Fig. 7 are summarized in Table 2. We conducted a convergence analysis of the six batch selection strategies, as follows:

MNIST (easy): As we clarified in Sect. 1, the hard batch selections, AdaHard and online batch selection, worked well in the easy MNIST data set. They converged faster than random batch selection in both training loss and test error. AdaBoundary showed a fast convergence comparable to that of online batch selection; the absolute difference of test error between them was only \(0.02\%\), which was almost negligible. Quantitatively, AdaBoundary, AdaHard, online batch selection, and active bias achieved test error relatively lower by \(14.58\%\) (\(0.48\%\!\rightarrow \!0.41\%\)), \(4.17\%\) (\(0.48\%\!\rightarrow \!0.46\%\)), \(18.75\%\) (\(0.48\%\!\rightarrow \!0.39\%\)), and \(4.17\%\) (\(0.48\%\!\rightarrow \!0.46\%\)) than random batch selection, respectively.

FashionMNIST (relatively easy): In both training loss and test error, AdaBoundary achieved significantly faster convergence speed than random batch selection. AdaHard and online batch selection tended to weakly overfit to “too hard” samples. Their test error approached that of random batch selection, even when they converged much faster in training loss. In summary, AdaBoundary, online batch selection, and active bias achieved test error relatively lower by \(8.05\%\) (\(7.08\%\!\rightarrow \!6.51\%\)), \(1.84\%\) (\(7.08\%\!\rightarrow \!6.95\%\)), and \(2.82\%\) (\(7.08\%\!\rightarrow \!6.88\%\)) than random batch selection, respectively. On the other hand, AdaHard was relatively higher by \(3.80\%\) (\(7.08\%\!\rightarrow \!7.36\%\)).
Generalization of the gradient optimizer
To validate the generality of the optimizer, we repeated the experiment in Sects. 5.1 and 5.2 using the SGD optimizer. Figure 8 shows the convergence curves of six batch selection strategies with the SGD optimizer on all the data sets, and we conducted their convergence analysis as follows:

MNIST (easy): All adaptive batch selection strategies converged faster than random batch selection in both training loss and test error. The convergence curve of AdaUniform tended to fluctuate.

FashionMNIST (relatively easy): Except AdaUniform, all strategies showed comparable performance in both training loss and test error, which were slightly faster than random batch selection.

CIFAR10 (relatively hard): Except for AdaUniform and active bias, all adaptive batch selections converged significantly faster than random batch selection in training loss. However, due to the overfitting to “too hard” samples, only AdaBoundary achieved much faster convergence speed than other adaptive batch selections in test error.

CIFAR100 (hard): AdaBoundary converged faster than random batch selection in both training loss and test error. In contrast, AdaHard and online batch selection suffered from the overfitting issue in test error, even when they converged faster than random batch selection in training loss. much faster convergence speed than other adaptive batch selections in test error.

TinyImageNet (hard): Again, similar to the CIFAR data sets, AdaBoundary achieved the lowest test error while expediting the convergence of training loss. Active bias converged the slowest in training error, but its test error was comparable to that of the hard batch selections.
In summary, only AdaBoundary succeeded in increasing the convergence speed for all data sets, regardless of the difficulty level. Quantitatively, compared with random batch selection, AdaBoundary achieved a significant reduction in test error of \(31.80\%\) (\(3.43\%\!\rightarrow \!2.34\%\)), \(12.85\%\) (\(14.47\%\!\rightarrow \!12.61\%\)), \(3.26\%\) (\(14.72\%\!\rightarrow \!14.24\%\)), \(2.11\%\) (\(42.63\%\!\rightarrow \!41.73\%\)), and \(0.58\%\) (\(56.70\%\!\rightarrow \!56.37\%\)) in MNIST, FashionMNIST, CIFAR10, CIFAR100, and TinyImageNet.
Generalization of the model
To show the generality of the model, we trained a WideResNet 168 (Zagoruyko and Komodakis 2016) on the CIFAR10 data set for 50 epochs. For the experiment, we used the SGD optimizer and a constant learning rate of 0.01. The other configurations were the same as in Sect. 5.1. Figure 9 shows the convergence curves of six batch selection strategies on CIFAR10. Here, AdaBoundary also outperformed other batch selection strategies in both training loss and test error. AdaBoundary significantly reduced the test error by \(8.38\%\) (\(17.06\%\!\rightarrow \!15.63\%\)) compared with random batch selection. Owing to the overfitting issue, AdaHard converged slower than AdaBoundary in test error, even when it achieved low training loss comparable to that of AdaBoundary. The slow convergence speed of online batch selection in test error is explained by the same reasoning.
Impact of selection pressure \(s_e\)
The selection pressure \(s_e\) determines how strongly the boundary samples are selected. The larger the \(s_e\) value, the larger the sampling probabilities of the boundary samples, so more boundary samples are chosen for the next minibatch. On the other hand, as smaller \(s_e\) value brings AdaBoundary closer to random batch selection. To analyze the impact of the selection pressure \(s_e\) and determine the best value, we trained a DenseNet (\(L=25\), \(k=12\)) with momentum on our four benchmark data sets with varying \(s_e\) values in the same configuration as outlined in Sects. 5.1 and 5.2.
Tables 3 and 4 respectively show the converged training loss and the best test error of AdaBoundary with varying \(s_e\) values on four benchmark data sets. For training loss, the convergence speed was accelerated as the \(s_e\) value increased. That is, lower training loss was achieved with a larger \(s_e\) value (see Table 3). Similarly, for test error, this trend was observed on easy data sets (see MNIST and FashionMNIST in Table 4). However, the overexposure to the boundary samples when the large \(s_e\) was used incurred an overfitting issue on the hard data sets (see CIFAR10 and CIFAR100 in Table 4). When using a large \(s_e\) value, the test error increased even when the training loss decreased. This means that the overexposure to only some part of training samples is not beneficial for the generalization of overall training in hard data sets. Therefore, we used \(s_e=32.0\) for easy data sets, and \(s_e=2.0\) for hard data sets in all experiments.
Conclusion and future work
In this paper, we proposed a novel and simple adaptive batch selection algorithm, AdaBoundary, that presents the most appropriate samples according to the learning progress of the model. Toward this goal, we defined the distance from a sample to the decision boundary and introduced a quantization method for selecting the samples near the boundary with high probability. We performed extensive experimentation using a DenseNet for five benchmark data sets with varying difficulty levels. The results showed that AdaBoundary significantly accelerated the training process, and was better generalized for hard data sets. When training an easy data set, AdaBoundary showed a fast convergence comparable to that of the stateoftheart algorithm; when training hard data sets, only AdaBoundary converged significantly faster than random batch selection as well as the stateoftheart algorithm.
The most exciting benefit of AdaBoundary is its potential to reduce the time needed to train a DNN. This becomes more important as the size and complexity of the data increases and can be boosted with recent advances of hardware technology. It can be easily implemented into various optimizers owing to its simplicity. Our immediate future work is to apply AdaBoundary to other types of DNNs, such as RNN (Mikolov et al. 2010) and LSTM (Hochreiter and Schmidhuber 1997), which have a neural structure completely different from that of CNN. In addition, we plan to investigate the relationship between the power of a DNN and the improvement in AdaBoundary.
Notes
 1.
The idea is also applicable to DNNs other than CNNs, and we leave this extension to future work.
 2.
 3.
 4.
 5.
 6.
References
Bengio, Y., J. Louradour, R. Collobert, & J. Weston (2009). Curriculum learning. In International Conference on Machine Learning (ICML), pp. 41–48.
Chang, H.S., E. LearnedMiller, & A. McCallum (2017). Active Bias: Training more accurate neural networks by emphasizing high variance samples. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1002–1012.
Chen, B., & Wornell, G. W. (2001). Quantization index modulation: A class of provably good methods for digital watermarking and information embedding. Transactions on Information Theory, 47(4), 1423–1443.
Fan, Y., Tian, F., Qin, T., & Liu, T.Y. (2017). Neural data filter for bootstrapping stochastic gradient descent. In International Conference on Learning Representation (ICLR).
Gao, T., & Jojic, V. (2017). Sample importance in training deep neural networks. https://openreview.net/forum?id=r1IRctqxg.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: MIT Press.
Gray, R. M., & Neuhoff, D. L. (1998). Quantization. Transactions on Information Theory, 44(6), 2325–2383.
Hochreiter, S., & Schmidhuber, J. (1997). Long shortterm memory. Neural Computation, 9(8), 1735–1780.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pp. 448–456.
Katharopoulos, A., & Fleuret, F. (2018). Not all samples are created equal: Deep learning with importance sampling. In International Conference on Machine Learning (ICML), pp. 2525–2534.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representation (ICLR).
Krizhevsky, A., Nair, V., & Hinton, G. (2014). CIFAR10 and CIFAR100 datasets. https://www.cs.toronto.edu/~kriz/cifar.html.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105.
Kumar, M. P., Packer, B., & Koller, D. (2010). Selfpaced learning for latent variable models. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1189–1197.
LeCun, Y. (1998). The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist.
Loshchilov, I., & Hutter, F. (2016). Online batch selection for faster training of neural networks. In International Conference on Learning Representation (ICLR).
Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1045–1048.
Needell, D., Ward, R., & Srebro, N. (2014). Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1017–1025.
Sachan, M., & Xing, E. (2016). Easy questions first? A case study on curriculum learning for question answering. In Annual Meeting of the Association for Computational Linguistics (ACL), pp. 453–463.
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. In International Conference on Learning Representation (ICLR).
Schmidt, M., Babanezhad, R., Ahmed, M., Defazio, A., Clifton, A., & Sarkar, A. (2015). Nonuniform stochastic average gradient method for training conditional random fields. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 819–828.
Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training regionbased object detectors with online hard example mining. In International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 761–769.
Song, H., Kim, M., & Lee, J.G. (2019). SELFIE: Refurbishing unclean samples for robust deep learning. In International Conference on Machine Learning, pp. 5907–5915.
Song, H., Kim, M., Kim, S., & Lee, J.G. (2020). Carpe diem, seize the samples uncertain “at the moment” for adaptive batch selection. In International Conference on Information and Knowledge Management (CIKM).
Tsvetkov, Y., Faruqui, M., Ling, W., MacWhinney, B., & Dyer, C. (2016). Learning the curriculum with bayesian optimization for taskspecific word representation learning. In Annual Meeting of the Association for Computational Linguistics (ACL), pp. 130–139.
Wu, C.Y., Manmatha, R., Smola, A. J., & Krähenbühl, P. (2017). Sampling matters in deep embedding learning. In International Conference on Computer Vision (ICCV), pp. 2840–2848.
Xiao, H., Rasul, K., & Vollgraf, R. (2017). FashionMNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747.
Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. arXiv:1605.07146.
Zeiler, M. D. (2012). Adadelta: An adaptive learning rate method. arXiv:1212.5701.
Acknowledgements
This work was partly supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea Government (Ministry of Science and ICT) (No. 2017R1E1A1A01075927) and Institute of Information & Communications Technology Planning & Evaluation (IITP) Grant funded by the Korea Government (MSIT) (No. 2020000862, DB4DL: HighUsability and Performance InMemory Distributed DBMS for Deep Learning).
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.
Rights and permissions
About this article
Cite this article
Song, H., Kim, S., Kim, M. et al. Adaboundary: accelerating DNN training via adaptive boundary batch selection. Mach Learn 109, 1837–1853 (2020). https://doi.org/10.1007/s10994020059036
Received:
Revised:
Accepted:
Published:
Issue Date:
Keywords
 Batch selection
 Acceleration
 Convergence
 Decision boundary