Learning Rate Curriculum

Croitoru, Florinel-Alin; Ristea, Nicolae-Cătălin; Ionescu, Radu Tudor; Sebe, Nicu

doi:10.1007/s11263-024-02186-5

Download PDF

Florinel-Alin Croitoru¹,
Nicolae-Cătălin Ristea^1,2,
Radu Tudor Ionescu ORCID: orcid.org/0000-0002-9301-1950¹ &
…
Nicu Sebe³

29 Accesses
Explore all metrics

Abstract

Most curriculum learning methods require an approach to sort the data samples by difficulty, which is often cumbersome to perform. In this work, we propose a novel curriculum learning approach termed Learning Rate Curriculum (LeRaC), which leverages the use of a different learning rate for each layer of a neural network to create a data-agnostic curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. The learning rates increase at various paces during the first training iterations, until they all reach the same value. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that does not require sorting the examples by difficulty and is compatible with any neural network, generating higher performance levels regardless of the architecture. We conduct comprehensive experiments on 12 data sets from the computer vision (CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet-1K, Food-101, UTKFace, PASCAL VOC), language (BoolQ, QNLI, RTE) and audio (ESC-50, CREMA-D) domains, considering various convolutional (ResNet-18, Wide-ResNet-50, DenseNet-121, YOLOv5), recurrent (LSTM) and transformer (CvT, BERT, SepTr) architectures. We compare our approach with the conventional training regime, as well as with Curriculum by Smoothing (CBS), a state-of-the-art data-agnostic curriculum learning approach. Unlike CBS, our performance improvements over the standard training regime are consistent across all data sets and models. Furthermore, we significantly surpass CBS in terms of training time (there is no additional cost over the standard training regime for LeRaC). Our code is freely available at: https://github.com/CroitoruAlin/LeRaC.

From MNIST to ImageNet and back: benchmarking continual curriculum learning

Article Open access 22 April 2024

Deep curriculum learning optimization

Article 29 July 2020

SLIP: Self-supervision Meets Language-Image Pre-training

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Curriculum learning (Bengio et al., 2009) refers to efficiently training effective neural networks by mimicking how humans learn, from easy to hard. As originally introduced by Bengio et al. (2009), curriculum learning is a training procedure that first organizes the examples in their increasing order of difficulty, then starts the training of the neural network on the easiest examples, gradually adding increasingly more difficult examples along the way, until all training examples are fed into the network. The success of the approach relies in avoiding imposing the learning of very difficult examples right from the beginning, instead guiding the model on the right path through the imposed curriculum. This type of curriculum is later referred to as data-level curriculum learning (Soviany et al., 2022). Indeed, Soviany et al. (2022) identified several types of curriculum learning approaches in the literature, dividing them into four categories based on the components involved in the definition of machine learning given by Mitchell (1997). The four categories are: data-level curriculum (examples are presented from easy to hard), model-level curriculum (the modeling capacity of the network is gradually increased), task-level curriculum (the complexity of the learning task is increased during training), objective-level curriculum (the model optimizes towards an increasingly more complex objective). While data-level curriculum is the most natural and direct way to employ curriculum learning, its main disadvantage is that it requires a way to determine the difficulty of data samples. Despite having many successful applications (Soviany et al., 2022; Wang et al., 2022), there is no universal way to determine the difficulty of the data samples, making the data-level curriculum less applicable to scenarios where the difficulty is hard to estimate, e.g. classification of radar signals. The task-level and objective-level curriculum learning strategies suffer from similar issues, e.g. it is hard to create a curriculum when the model has to learn an easy task (binary classification) or the objective function is already convex.

Considering the above observations, we recognize the potential of model-level curriculum learning strategies of being applicable across a wider range of domains and tasks. To date, there are only a few works (Burduja, 2021; Karras et al., 2018; Sinha et al., 2020) in the category of pure model-level curriculum learning methods. However, these methods have some drawbacks caused by their domain-dependent or architecture-specific design. To benefit from the full potential of the model-level curriculum learning category, we propose LeRaC (Learning Rate Curriculum), a novel and simple curriculum learning approach which leverages the use of a different learning rate for each layer of a neural network to create a data-agnostic curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. This reduces the propagation of noise caused by the multiplication operations inside the network, a phenomenon that is more prevalent when the weights are randomly initialized. The learning rates increase at various paces during the first training iterations, until they all reach the same value, as illustrated in Fig. 1. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that is applicable to any domain and compatible with any neural network, generating higher performance levels regardless of the architecture, without adding any extra training time. To the best of our knowledge, we are the first to employ a different learning rate per layer to achieve the same effect as conventional (data-level) curriculum learning.

As hinted above, the underlying hypothesis that justifies the use of LeRaC is that the level of noise grows from one neural layer to the next, especially when the input is multiplied with randomly initialized weights having low signal-to-noise ratios. We briefly illustrate this phenomenon through an example. Suppose an image x is successively convolved with a set of random filters $c_1, c_2,\ldots , c_n$. Since the filters are uncorrelated, each filter distorts the image in a different way, degrading the information in x with each convolution. The information in x is gradually replaced by noise (see Fig. 2), i.e. the signal-to-noise ratio increases with each layer. Optimizing the filter $c_n$ to learn a pattern from the image convolved with $c_1, c_2,\ldots , c_{n-1}$ is suboptimal, because the filter $c_{n}$ will adapt to the noisy (biased) activation map induced by filters $c_1, c_2,\ldots , c_{n-1}$. This suggests that earlier filters need to be optimized sooner to reduce the level of noise of the activation map passed to layer n. In general, this phenomenon becomes more obvious as the layers get deeper, since the number of multiplication operations grows along the way. Hence, in the initial training stages, it makes sense to use gradually lower learning rates, as the layers get father away from the input. Our hypothesis is theoretically supported by Theorem 1, and empirically validated in “Appendix B”.

We conduct comprehensive experiments on 12 data sets from the computer vision (CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), Tiny ImageNet (Russakovsky et al., 2015), ImageNet-1K (Russakovsky et al., 2015), Food-101 (Bossard et al. 2014), UTKFace (Zhang et al., 2017), PASCAL VOC (Everingham et al., 2010)), language (BoolQ (Clark et al., 2019), QNLI (Wang et al., 2019), RTE (Wang et al., 2019)) and audio (ESC-50 (Piczak, 2015), CREMA-D (Cao et al., 2014)) domains, considering various convolutional (ResNet-18 (He et al., 2016), Wide-ResNet-50 (Zagoruyko & Komodakis, 2016), DenseNet-121 (Huang et al., 2017), YOLOv5 (Jocher et al., 2022)), recurrent (LSTM (Hochreiter & Schmidhuber, 1997)) and transformer (CvT (Wu et al., 2021), BERT (Devlin et al., 2019), SepTr (Ristea et al., 2022)) architectures. We compare our approach with the conventional training regime and Curriculum by Smoothing (CBS) (Sinha et al., 2020), our closest competitor. Unlike CBS, our performance improvements over the standard training regime are consistent across all data sets and models. Furthermore, we significantly surpass CBS in terms of training time, since there is no additional cost over the conventional training regime for LeRaC, whereas CBS adds Gaussian smoothing layers. We also compare with several data-level and task-level curriculum learning methods (Dogan et al., 2020; Wang et al., 2023; Khan et al., 2024, 2023a, b), and show that our method scores best in most of the experiments.

In summary, our contribution is threefold:

We propose a novel and simple model-level curriculum learning strategy that creates a curriculum by updating the weights of each neural layer with a different learning rate, considering higher learning rates for the low-level feature layers and lower learning rates for the high-level feature layers.
We empirically demonstrate the applicability to multiple domains (image, audio and text), the compatibility to several neural network architectures (convolutional neural networks, recurrent neural networks and transformers), and the time efficiency (no extra training time added) of LeRaC through a comprehensive set of experiments.
We demonstrate our underlying hypothesis stating that the level of noise increases from one neural layer to another, both theoretically and empirically.

2 Related Work

2.1 Curriculum Learning

Curriculum learning was initially introduced by Bengio et al. (2009) as a training strategy that helps machine learning models to generalize better when the training examples are presented in the ascending order of their difficulty. Extensive surveys on curriculum learning methods, including the most recent advancements on the topic, were conducted by Soviany et al. (2022) and Wang et al. (2022). In the former survey, Soviany et al. (2022) emphasized that curriculum learning is not only applied at the data level, but also with respect to the other components involved in a machine learning approach, namely at the model level, the task level and the objective (loss) level. Regardless of the component on which curriculum learning is applied, the technique has demonstrated its effectiveness on a broad range of machine learning tasks, from computer vision (Bengio et al., 2009; Gui et al., 2017; Jiang et al., 2018; Shi & Ferrari, 2016; Soviany et al., 2021; Chen & Gupta, 2015; Sinha et al., 2020; Khan et al., 2024, 2023a, b) to natural language processing (Platanios et al., 2019; Kocmi & Bojar, 2017; Spitkovsky et al., 2009; Liu et al., 2018; Bengio et al., 2009) and audio processing (Ranjan & Hansen, 2018; Amodei et al., 2016).

The main challenge for the methods that build the curriculum at the data level is measuring the difficulty of the data samples, which is required to order the samples from easy to hard. Most studies have addressed the problem with human input (Pentina et al., 2015; Jiménez-Sánchez et al., 2019; Wei et al., 2021) or metrics based on domain-specific heuristics. For instance, the text length (Kocmi & Bojar, 2017; Cirik et al., 2016; Tay et al., 2019; Zhang et al., 2021) and the word frequency (Bengio et al., 2009; Liu et al., 2018) have been employed in natural language processing. In computer vision, the samples containing fewer and larger objects have been considered to be easier in some works (Soviany et al., 2021; Shi & Ferrari, 2016). Other solutions employed difficulty estimators (Ionescu et al., 2016) or even the confidence level of the predictions made by the neural network (Gong et al., 2016; Hacohen & Weinshall, 2019) to approximate the complexity of the data samples. Other studies (Khan et al., 2024, 2023a, b) used the error of a previously trained model to estimate the difficulty of each sample. Such solutions have shown their utility in specific application domains. Nonetheless, measuring the difficulty remains problematic when implementing standard (data-level) curriculum learning strategies, at least in some application domains. Therefore, several alternatives have emerged over time, handling the drawback and improving the conventional curriculum learning approach. In (Kumar et al. 2010), the authors introduced self-paced learning to evaluate the learning progress when selecting training samples. The method was successfully employed in multiple settings (Kumar et al., 2010; Gong et al., 2019; Fan et al., 2017; Li et al., 2016; Zhou et al., 2018; Jiang et al., 2015; Ristea & Ionescu, 2021). Furthermore, some studies combined self-paced learning with the traditional pre-computed difficulty metrics (Jiang et al., 2015; Ma et al., 2017). An additional advancement related to self-paced learning is the approach called self-paced learning with diversity (Jiang et al., 2014). The authors demonstrated that enforcing a certain level of variety among the selected examples can improve the final performance. Another set of methods that bypass the need for predefined difficulty metrics is known as teacher-student curriculum learning (Zhang et al., 2019; Wu et al., 2018). In this setting, a teacher network learns a curriculum to supervise a student neural network.

Closer to our work, a few methods (Karras et al., 2018; Sinha et al., 2020; Burduja, 2021) proposed to apply curriculum learning at the model level, by gradually increasing the learning capacity (complexity) of the neural architecture. Such curriculum learning strategies do not need to know the difficulty of the data samples, thus having a great potential to be useful in a broad range of tasks. For example, Karras et al. (2018) proposed to gradually add layers to generative adversarial networks during training, while increasing the resolution of the input images at the same time. They are thus able to generate realistic high-resolution images. However, their approach is not applicable to every domain, since there is no notion of resolution for some input data types, e.g. text. Sinha et al. (2020) presented a strategy that blurs the activation maps of the convolutional layers using Gaussian kernel layers, reducing the noisy information caused by the network initialization. The blur level is progressively reduced to zero by decreasing the standard deviation of the Gaussian kernels. With this mechanism, they obtain a training procedure that allows the neural network to see simple information at the start of the process and more intricate details towards the end. Curriculum by Smoothing (CBS) (Sinha et al., 2020) was only shown to be useful for convolutional architectures applied in the image domain. Although we found that CBS is applicable to transformers by blurring the tokens, it is not necessarily applicable to any neural architecture, e.g. standard feed-forward neural networks. As an alternative to CBS, Burduja (2021) proposed to apply the same smoothing process on the input image instead of the activation maps. The method was applied with success in medical image alignment. However, this approach is not applicable to natural language input, as it is not clear how to apply the blurring operation on the input text.

Different from Burduja (2021) and Karras et al. (2018), our approach is applicable to various domains, including but not limited to natural language processing, as demonstrated throughout our experiments. To the best of our knowledge, the only competing model-level curriculum method which is applicable to various domains is CBS (Sinha et al., 2020). Unlike CBS, LeRaC does not introduce new operations, such as smoothing with Gaussian kernels, during training. As such, our approach does not increase the training time with respect to the conventional training regime, as later shown in the experiments included in Sect. 4.

To classify our approach as a curriculum learning framework, we consider the extreme case when the learning rate is set to zero for later layers, which is equivalent to freezing those layers. This clearly reduces the learning capacity of the model. If layers are unfrozen one by one, the capacity of the model grows. LeRaC can be seen as a soft version of the model-level curriculum method described above. We thus classify LeRaC as a model-level curriculum method. However, our method can also be seen as a curriculum learning strategy that simplifies the optimization (Pentina et al., 2015; Jiménez-Sánchez et al., 2019; Wei et al., 2021; Kocmi & Bojar, 2017; Cirik et al., 2016; Tay et al., 2019; Zhang et al., 2021; Bengio et al., 2009; Liu et al., 2018) in the early training stages by restricting the model updates (in a soft manner) to certain directions (corresponding to the weights of the earlier layers). Due to the imposed soft restrictions (lower learning rates for deeper layers), the optimization is easier at the beginning. As the training progresses, all directions become equally important, and the network is permitted to optimize the loss function in any direction. As the number of directions grows, the optimization task becomes more complex (it is harder to find the optimum). Hence, a relationship to curriculum learning can be discovered by noting that the complexity of the optimization increases over time, just as in curriculum learning.

In summary, we consider that the simplicity of our approach comes with many important advantages: applicability to any domain and task, compatibility with any neural network architecture, and time efficiency (adds no extra training time). We support all these claims through the comprehensive experiments presented in Sect. 4.

2.2 Learning Rate Schedulers

There are some contributions (Singh et al., 2015; You et al., 2017) showing that using adaptive learning rates can lead to improved results. We explain how our method is different below. In (Singh et al., 2015), the main goal is increasing the learning rate of certain layers as necessary, to escape saddle points. Different from Singh et al. (2015), our strategy reduces the learning rates of deeper layers, introducing soft optimization restrictions in the initial training epochs. You et al. (2017) proposed to train models with very large batches using a learning rate for each layer, by scaling the learning rate with respect to the norms of the gradients. The goal of You et al. (2017) is to specifically learn models with large batch sizes, e.g. formed of 8K samples. Unlike You et al. (2017), we propose a more generic approach that can be applied to multiple architectures (convolutional, recurrent, transformer) under unrestricted training settings.

Gotmare et al. (2019) point out that learning rate with warm-up and restarts is an effective strategy to improve stability of training neural models using large batches. Different from LeRaC, this approach does not employ a different learning rate for each layer. Moreover, the strategy restarts the learning rate at different moments during the entire training process, while LeRaC is applied only during the first few training epochs.

2.3 Optimizers

We consider Adam (Kingma & Ba, 2015) and related optimizers as orthogonal approaches that perform the optimization rather than setting the learning rate. Our approach, LeRaC, only aims to guide the optimization during the initial training iterations by reducing the relevance of optimizing deeper network layers. Most of the baseline architectures used in our experiments are already based on Adam or some of its variations, e.g. AdaMax, AdamW (Loshchilov & Hutter, 2019). LeRaC is applied in conjunction with these optimizers, showing improved performance over various architectures and application domains. This supports our claim that LeRaC is an orthogonal contribution to the family of Adam optimizers.

3 Method

Deep neural networks are commonly trained on a set of labeled data samples denoted as:

$$\begin{aligned} S\!=\!\{(x_i, y_i) | x_i\!\in \!X, y_i\!\in \!Y, \forall i \in \{1,2,\ldots ,m \} \}, \end{aligned}$$

(1)

where m is the number of examples, $x_i$ is a data sample and $y_i$ is the associated label. The training process of a neural network f with parameters $\theta $ consists of minimizing some objective (loss) function ${\mathcal {L}}$ that quantifies the differences between the ground-truth labels and the predictions of the model f:

$$\begin{aligned} \min _{\theta } \frac{1}{m} \sum _{i=1}^m {\mathcal {L}}\left( y_i, f(x_i, \theta ) \right) . \end{aligned}$$

(2)

The optimization is generally performed by some variant of Stochastic Gradient Descent (SGD), where the gradients are back-propagated from the neural layers closer to the output towards the neural layers closer to input through the chain rule. Let $f_1, f_2,\ldots , f_n$ and $\theta _1, \theta _2,\ldots , \theta _n$ denote the neural layers and the corresponding weights of the model f, such that the weights $\theta _j$ belong to the layer $f_j$, $\forall j \in \{1, 2,\ldots ,n\}$. The output of the neural network for some training data sample $x_i \in X$ is formally computed as follows:

$$\begin{aligned} {\hat{y}}_i\!=\!f (x_i, \theta )\!=\!f_n\!\left( \ldots f_2 \left( f_1 \left( x_i, \theta _1 \right) , \theta _2 \right) \ldots , \theta _n \right) \!. \end{aligned}$$

(3)

To optimize the model via SGD, the weights are updated as follows:

$$\begin{aligned} \theta _j^{(t+1)} = \theta _j^{(t)} - \eta ^{(t)} \cdot \frac{\partial {\mathcal {L}}}{\partial \theta _j^{(t)}}, \forall j \in \{1, 2,\ldots ,n\}, \end{aligned}$$

(4)

where t is the index of the current training iteration, $\eta ^{(t)} > 0$ is the learning rate at iteration t, and the gradient of ${\mathcal {L}}$ with respect to $\theta _j^{(t)}$ is computed via the chain rule. Before starting the training process, the weights $\theta _j^{(0)}$ are commonly initialized with random values, e.g. using Glorot initialization (Glorot & Bengio, 2010).

Sinha et al. (2020) suggested that the random initialization of the weights produces a large amount of noise in the information propagated through the neural model during the early training iterations, which can negatively impact the learning process. Due to the feed-forward processing that involves several multiplication operations, we argue that the noise level grows with each neural layer, from $f_j$ to $f_{j+1}$. This statement is confirmed by the following theorem:

Theorem 1

Let $s_1=u_1+z_1$ and $s_2=u_2+z_2$ be two signals, where $u_1$ and $u_2$ are the clean components, and $z_1$ and $z_2$ are the noise components. The signal-to-noise ratio of the product between the two signals is lower than the signal-to-noise ratios of the two signals, i.e.:

$$\begin{aligned} {{\,\textrm{SNR}\,}}(s_1\cdot s_2) \le {{\,\textrm{SNR}\,}}(s_i), \forall i \in \{1, 2\}. \end{aligned}$$

(5)

Proof

The proof is given in “Appendix A”. $\square $

The same issue can occur if the weights are pre-trained on a distinct task, where the misalignment of the weights with a new task is likely higher for the high-level (specialized) feature layers. To alleviate this problem, we propose to introduce a curriculum learning strategy that assigns a different learning rate $\eta _j$ to each layer $f_j$, as follows:

$$\begin{aligned} \theta _j^{(t+1)} = \theta _j^{(t)} - \eta _j^{(t)} \cdot \frac{\partial {\mathcal {L}}}{\partial \theta _j^{(t)}}, \forall j \in \{1, 2,\ldots ,n\}, \end{aligned}$$

(6)

such that:

$$\begin{aligned}{} & {} \eta ^{(0)} \ge \eta _1^{(0)} \ge \eta _2^{(0)} \ge \cdots \ge \eta _n^{(0)}, \end{aligned}$$

(7)

$$\begin{aligned}{} & {} \eta ^{(k)} = \eta _1^{(k)} = \eta _2^{(k)} =\cdots = \eta _n^{(k)}, \end{aligned}$$

(8)

where $\eta _j^{(0)}$ are the initial learning rates and $\eta _j^{(k)}$ are the updated learning rates at iteration k. The condition formulated in Eq. (7) indicates that the initial learning rate $\eta _j^{(0)}$ of a neural layer $f_j$ gets lower as the level of the respective neural layer becomes higher (farther away from the input). With each training iteration $t \le k$, the learning rates are gradually increased, until they become equal, according to Eq. (8). Thus, our curriculum learning strategy is only applied during the early training iterations, where the noise caused by the misfit (randomly initialized or pre-trained) weights is most prevalent. Hence, k is a hyperparameter of LeRaC that is usually adjusted such that $k\ll T$, where T is the total number of training iterations.

At this point, various schedulers can be used to increase each learning rate $\eta _j$ from iteration 0 to iteration k. We empirically observed that an exponential scheduler is a better option than linear or logarithmic schedulers. We thus propose to employ the exponential scheduler, which is based on the following rule:

$$\begin{aligned} \eta _j^{(l)}\!=\!\eta _j^{(0)}\!\cdot \!c^{\frac{l}{k} \cdot \left( \log _c \eta _j^{(k)} - \log _c \eta _j^{(0)} \right) }\!, \forall l\!\in \!\{0,1,\ldots ,k \}. \end{aligned}$$

(9)

We set $c=10$ in Eq. (9) across all our experiments. This is because learning rates are usually expressed as a power of $c=10$, e.g. $10^{-4}$. If we start with a learning rate of $\eta _j^{(0)}=10^{-8}$ for some layer j and we want to increase it to $\eta _j^{(k)}=10^{-4}$ during the first 5 epochs ($k=4$), the intermediate learning rates generated via Eq. (9) are $\eta _j^{(1)}\!=\!10^{-7}$, $\eta _j^{(2)}\!=\!10^{-6}$, $\eta _j^{(3)}\!=\!10^{-5}$ and $\eta _j^{(4)}\!=\!10^{-4}$. We thus believe it is more intuitive to understand what happens when setting $c=10$ in Eq. (9), as opposed to using some tuned value for c. To this end, we refrain from tuning c and fix it to $c=10$.

In practice, we obtain optimal results by initializing the lowest learning rate $\eta _n^{(0)}$ with a value that is around five or six orders of magnitude lower than $\eta ^{(0)}$, while the highest learning rate $\eta _1^{(0)}$ is always equal to $\eta ^{(0)}$. Apart from such general practical notes, the exact LeRaC configuration for each neural architecture is established by tuning its two hyperparameters (k, $\eta _n^{(0)}$) on the available validation sets.

We underline that the output feature maps of a layer j are affected (i) by the misfit weights $\theta _j^{(0)}$ of the respective layer, and (ii) by the input feature maps, which are in turn affected by the misfit weights of the previous layers $\theta _1^{(0)},\ldots , \theta _{j-1}^{(0)}$. Hence, the noise affecting the feature maps increases with each layer processing the feature maps, being multiplied with the weights from each layer along the way. Our curriculum learning strategy imposes the training of the earlier layers at a faster pace, transforming the noisy weights into discriminative patterns. As noise from the earlier layer weights is eliminated, we train the later layers at faster and faster paces, until all learning rates become equal at epoch k.

From a technical point of view, we note that our approach can also be regarded as a way to guide the optimization, which we see as an alternative to loss function smoothing. The link between curriculum learning and loss smoothing is discussed by Soviany et al. (2022), who suggest that curriculum learning strategies induce a smoothing of the loss function, where the smoothing is higher during the early training iterations (simplifying the optimization) and lower to non-existent during the late training iterations (restoring the complexity of the loss function). LeRaC is aimed at producing a similar effect, but in a softer manner by dampening the importance of optimizing the weights of high-level layers in the early training iterations. Additionally, we empirically observe (see results in “Appendix B”) that LeRaC tends to balance the training pace of low-level and high-level features, while the conventional regime seems to update the high-level layers at a faster pace. This could provide an additional intuitive explanation of why our method works better.

4 Experiments

4.1 Data Sets

We perform experiments on 12 benchmarks: CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), Tiny ImageNet (Russakovsky et al., 2015), ImageNet-1K (Russakovsky et al., 2015), Food-101 (Bossard et al., 2014), UTKFace (Zhang et al., 2017), PASCAL VOC 2007+2012 (Everingham et al., 2010), BoolQ (Clark et al., 2019), QNLI (Wang et al., 2019), RTE (Wang et al., 2019), CREMA-D (Cao et al., 2014), and ESC-50 (Piczak, 2015). We adopt the official data splits for the 12 benchmarks considered in our experiments. When a validation set is not available, we keep $10\%$ of the training data for validation.

CIFAR-10. CIFAR-10 (Krizhevsky, 2009) is a popular data set for object recognition in images. It consists of 60,000 color images with a resolution of $32 \times 32$ pixels. An image depicts one of 10 object classes, each class having 6000 examples. We use the official data split with a training set of 50,000 images and a test set of 10,000 images.

CIFAR-100. The CIFAR-100 (Krizhevsky, 2009) data set is similar to CIFAR-10, except that it has 100 classes with 600 images per class. There are 50,000 training images and 10,000 test images.

Tiny ImageNet. Tiny ImageNet is a subset of ImageNet-1K (Russakovsky et al., 2015) which provides 100,000 training images, 25,000 validation images and 25,000 test images representing objects from 200 different classes. The size of each image is $64 \times 64$ pixels.

ImageNet. ImageNet-1K (Russakovsky et al., 2015) is the most popular bemchmark in computer vision, comprising about 1.2 million images from 1000 object categories. We set the resolution of all images to $224 \times 224$ pixels.

Food-101. Food-101 Bossard et al. (2014) is a data set that contains images from 101 food categories. For each category, there are 750 training images and 250 test images. Thus, the total number of images is 101,000. We resize all images to $224 \times 224$ pixels. The test set is manually cleaned, while the training set is purposely left uncurated, being affected by labeling noise. This makes Food-101 suitable for testing the robustness of models to labeling noise.

UTKFace. The UTKFace data set (Zhang et al., 2017) contains face images representing various gender, age and ethnic groups. It consists of 23,709 images of $200 \times 200$ pixels. The data set is divided into 16,597 training images, 3556 validation images, and 3556 test images. Each image is annotated with the corresponding age and gender label, which makes UTKFace suitable for evaluating models in a multi-task learning setup.

PASCAL VOC 2007+2012. One of the most popular benchmarks for object detection is PASCAL VOC (Everingham et al., 2010). The data set consists of 21,503 images which are annotated with bounding boxes for 20 object categories. The official split has 16,551 training images and 4952 test images.

BoolQ. BoolQ (Clark et al., 2019) is a question answering data set for yes/no questions containing 15,942 examples. The questions are naturally occurring, being generated in unprompted and unconstrained settings. Each example is a triplet of the form: {question, passage, answer}. We use the data split provided in the SuperGLUE benchmark (Wang et al., 2019), containing 9427 examples for training, 3270 for validation and 3245 for testing.

Table 1 Optimal hyperparameter settings for the various neural architectures used in our experiments

Learning Rate Curriculum

Abstract

Similar content being viewed by others

From MNIST to ImageNet and back: benchmarking continual curriculum learning

Deep curriculum learning optimization

SLIP: Self-supervision Meets Language-Image Pre-training

1 Introduction

2 Related Work

2.1 Curriculum Learning

2.2 Learning Rate Schedulers

2.3 Optimizers

3 Method

Theorem 1

Proof

4 Experiments

4.1 Data Sets

4.2 Experimental Setup

4.3 Domain-Specific Preprocessing

4.4 Preliminary Results

4.5 Main Results

4.6 More Comparative Results

4.7 Ablation Studies

5 Discussion

6 Conclusion

Data Availability

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Theoretical Proof

Definition 1

Definition 2

Theorem 2

Proof

Corollary 1

Proof

Empirical Proof

Additional Experiments

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation